Skip to main content
Blog Post·

Nov 12, 2024

Written by Frederik Junge, Nadja Reischel, Jasper Masemann

Harvester - How Cherry Becomes a Better Investor With the Help of AI to Allow Us to Focus on What Matters Most, Our Founders

See how Cherry's team created Harvester, an AI-powered approach to venture capital that maximises meaningful founder interactions while maintaining highly selective investment criteria. A technical expose on architecture, backtesting, and lessons learned.

Harvester AI

Harvester - How Cherry becomes a better investor with the help of AI to allow us to focus on what matters most, our founders


In venture capital, three elements define success: identifying exceptional teams, selecting the right combination of team, model, and timing (which demands experience and human connection), and accelerating portfolio companies with founder-first support. This framework guides our approach to technological augmentation.


1. Why We Created Harvester

As an early-stage investor, you receive signals for potential leads every day—sometimes numbering in the thousands. These signals can come from potential founders leaving their roles, new companies being registered, or insights from our network. While each signal holds potential, the challenge is determining which ones to prioritise. This leads to a high volume of repetitive tasks, such as reviewing LinkedIn profiles, checking Crunchbase histories, and evaluating founders’ previous employers. Every signal needs to be thoroughly checked for all available information, but doing this for every lead can easily turn into a full-time job when done manually.

For a team of twelve making only ten to twelve investments annually, selectivity isn't just important—it's existential. We must be ruthlessly efficient in identifying founders who align with our investment thesis, even if it means passing on otherwise compelling opportunities.

We realised that this mirrors the inefficiencies we often seek to address when investing in AI-application layer startups—we love companies that focus on under-digitised verticals with highly repetitive tasks. Given that everyone at Cherry is an ex-entrepreneur or ex-operator, we saw an opportunity to optimise our internal processes and created Harvester to prioritise start-up signals more efficiently.

Harvester's primary goal is not to capture every lead in the market but to enrich the leads we receive, enabling us to be more knowledgeable about the ones we want to prioritise. This approach allows us to make better use of our investment resources by augmenting our workflows with AI.

2. Approach

At its core, venture capital is about making informed decisions through a blend of experience, insight, and intuition. We believe these processes can be significantly enhanced by integrating technology. Our vision of Augmented Venture Capital transforms traditional VC operations into a seamless collaboration between human expertise and machine intelligence, potentially reducing human biases and enhancing the consistency and objectivity of our outcomes. This approach allows us to optimise our decision-making processes, freeing up our most valuable resource—human attention—for critical activities like relationship building, ecosystem development, and understanding the opportunities and challenges of tomorrow.

As good engineers, we began by clearly defining the scope of our project. We created a high-level overview of our approach, ensuring that every step was aligned with our objectives and set the stage for developing a robust, data-driven model that enhances human decision-making in venture capital.

This approach includes several critical components:

  • Data Sources: The foundation of our model is the data it is based on. We needed to identify what kind of data was essential and assess the reliability of these data sources. This included data on companies, founders, market conditions, and previous investment rounds. We also evaluated the availability and accuracy of this data to ensure it would provide a solid base for our model’s predictions.
  • Enrichment: Once the initial data sources were identified, the next step was to enrich this data. This involved correlating data from different sources to enhance the initial signals we received. For example, we integrated information from LinkedIn profiles, GitHub repositories, ProductHunt, and financial databases to gain a more comprehensive understanding of each startup’s potential.
  • Preferences: We needed to define what characteristics we are looking for in a startup, such as the experience of the founding team, the scalability of the product, and the size of the target market. Additionally, we had to establish how these different dimensions should be weighted. This allows us to fine-tune the model later on, ensuring that it aligns with our firm’s investment thesis when looking at new opportunities.
  • Integration to Our Workflows: Finally, we considered how to integrate the model into our existing workflows seamlessly. It was essential that the companies flagged by the algorithm could be easily transferred into our CRM tool, along with all relevant information. This ensures that our team can efficiently act on the model’s recommendations and keep the additional effort to a minimum.

Based on these guardrails, we iteratively developed an architecture that adheres to the core principles of data engineering, featuring a ETL pipeline and a data warehouse. This architecture, at a high level, functions as follows:

  1. Extract: Data is pulled from various sources, including public databases, social media profiles, and internal CRM records.
  2. Transform: The extracted data is then processed to ensure consistency and accuracy. This step involves data cleaning, enrichment through correlation with additional sources, and applying the weighting preferences defined earlier.
  3. Load: Finally, the transformed data is loaded into our data warehouse, which is stored and made accessible for the model to analyse and generate predictions.

Based on these guardrails, we iteratively came up with an architecture that follows the principles of data engineering with an ETL pipeline and a data warehouse that looks like this on a high level:

tech stack

2.2 Tools and Libraries We Loved During Development

Throughout the development process, we had the opportunity to work with a variety of tools and libraries that significantly simplified our workflow. Below, we want to give some credit to the resources that became essential to our project.

  • Supabase (supabase.com): We needed a simple, relational database solution that could provide a user-friendly web interface for non-technical team members. Supabase was the perfect fit. It offered an intuitive UI and robust backend, making it easy for everyone on the team to interact with the data without needing deep technical knowledge. The seamless integration with our existing tech stack was a major plus.
  • Pydantic (docs.pydantic.dev/latest/): While Python was our go-to language for quick prototyping, its weakly typed nature presented some challenges during the ETL (Extract, Transform, Load) process. Pydantic came to our rescue by providing data validation and settings management using Python type annotations. It allowed us to create structured, validated data models, which significantly reduced errors and streamlined our ETL pipelines.
  • Prefect (prefect.io): For orchestrating our data workflows, Prefect proved invaluable. It offered powerful orchestration capabilities, enabling us to manage and monitor complex data pipelines with ease. However, we did encounter a steep learning curve when it came to Infrastructure as Code (IaC) deployment. Despite this, Prefect’s flexibility and scalability made it a key component of our infrastructure.
  • Instructor (python.useinstructor.com/): Structuring data for Large Language Model (LLM) enrichment and evaluation required a specialised approach. Instructor provided the tools we needed to efficiently manage and prepare our data for LLM processing. Its capabilities in handling complex data structures allowed us to focus on model training and evaluation without getting bogged down by data preparation challenges.
  • Langfuse (langfuse.com): Monitoring and continuous improvement of our models were critical aspects of our development process. Langfuse offered an easy-to-integrate solution for tracking and enhancing model performance, ensuring our models were always performing at their best.

3. Results/Backtesting

We wouldn't be good investors if we don't hold ourselves accountable just as we do with our portfolio companies. For this we set up a backtesting pipeline, based on the historic data of our CRM with all our screened companies over the lifetime of Cherry, to verify our results.

This approach allows us to validate our hypotheses about great founders through both internal and external factors, while maintaining our core value of always being prepared in founder interactions.

We specifically looked at the data in the early stages of the screening process (companies we had entered as leads and the ones we conducted first calls with) in order to not bias the algorithm or give an unfair advantage to its human counterpart. A crunchbase dataset served as the baseline or source of truth if the companies had received follow-up funding ergo became successful from a funding perspective.

The performance of our model was then measured with an AUC-ROC curve, with different subsamples of the dataset, which is the standard method for ML models. Here, we want to understand if the model is generally capable of predicting the outcome of a company based on the data we provided, which is then measured by the Crunchbase data points, such as follow-up funding rounds (by high-ranking VCs), IPOs or acquisitions.

ROC

The initial results indicate that the model is good at initially identifying companies that become successful, with a successful pick rate of 72% and over 90% identification of less relevant opportunities. However, we also wanted to benchmark our approach to the selection made by our human investors to understand the potential shortcomings of our model and fine-tune our weighting system.

results

Here, the results of a logistical regression clearly show that the model performs worse than its human counterparts. The two lower rows are the results of Harvester, which show that the model gets better when we introduce a threshold (comment: we were able to focus our model better because we know what we are looking for and optimised for the highest Cherry-fit) for the scoring but is still far off the human judgement.

This suggests that companies deemed worthy investments by the analyst are substantially more likely to succeed compared to companies selected by the model, indicating an advantage of human intuition and potentially more data sources, such as expert calls, etc. So don't worry; the machines won't completely take over our jobs in the near future.

don't worry!

4. Summary

Data-driven methods in Venture Capital are here to stay. However, instead of replacing human interaction, including founder outreach (comment: "always prepared" as one of our core values), and decision making it rather augments the investor to screen more opportunities and therefore enables them to spend more time on teams and companies that genuinely fit the investment thesis of the fund. This was especially important for us at Cherry, as we only do ten to twelve investments a year and therefore have to be very selective with our investment choices, which sometimes means we have to pass on great teams.

Our vision is threefold: optimise selection for our ten annual investments, augment analysis during founder discussions with data-driven insights, and most critically, identify technical founders earlier in their journey. We optimise for depth over breadth — maximising time with the founders who will define the next generation of technology.