AI for drug discovery and development


The use of AI in drug discovery and development is not a recent development, biotech and pharma companies have been leveraging AI either through technology partners or in-house teams for the last 10+ years. However, recent advancements in deep learning have renewed interest in the space and created new opportunities to reduce both the time and cost to bring new therapeutics to market. With the average drug costing north of $2B and over a decade to bring to market, new approaches are needed to create cost-effective drugs that can lead to better outcomes for patients across the world.

The primary steps of drug development roughly follow the following steps:

  1. Discovery and pre-clinical research: this includes target identification and validation, compound screening (or generation) and optimization, and pre-clinical studies. This stage has been the primary focus of AI-driven drug discovery and development to date. More on this in later sections.
  2. Clinical trials: multi-phase process where drug candidates are tested for efficacy, safety, side effects, and other factors. As clinical trials progress they scale in the number of participants with the end goal to validate that the drug has novel therapeutic behavior that is competitive with the current standard of case (and ideally much better) with minimal side effects.
  3. Regulatory approval: application to relevant regulatory bodies (such as FDA in the US) of findings and data with the request to move from clinical trials to commercial development.
  4. Manufacturing and scale up: the process of designing processes, equipment, and teams that can manufacture the drug at industrial scale while meeting the quality and purity needed to bring it to market. The cost of drug development is driven by the high failure rate in drugs being able to progress through this pipeline, long feedback cycles, and expensive testing in the physical world (ex: wet lab iterations in pre-clinical trials). Roughly 95% of drugs fail to make it through the development pipeline successfully which leaves the successful drugs to bear the cost of the ones that didn’t make it. Companies are looking to AI for ways to tackle these failure modes across different layers of the stack which are outlined below.

Applications of AI

Discovery and Pre-Clinical Research

AI and large scale compute are being used in a few different ways in order to speed up and improve the effectiveness of the discovery and pre-clinical phase of drug development. In the process of target identification and identifying potential drug candidates some of the favored approaches include large scale screening of known compounds against targets, using NLP for literature review to identify off-label uses for existing drugs that can be applied to other targets, and more recently using generative models for de novo molecule design. Companies such as Variational AI are taking a genAI approach to drug discovery by skipping the process of screening the chemical space and instead designing molecules directly using generative models while targeting certain characteristics with a focus on oncology.

A majority of the failure cases in drug development can be attributed to a failure to identify the proper target for a given disease / process, poor bioavailability/stability, side-effects due to interaction with unintended biological targets, inability to translate from animal models or other preclinical systems to humans, and many others. Both AI-focused and biotech companies are investing in developing AI models that can predict whether each of these different failure modes is likely so that higher potential candidates can be pursued.

IQVIA (a CRO) acquired Linguamatics which developed NLP technology that was used to extract data from unstructured sources such as research papers. While this created a competitive advantage for them in the past, the availability of more powerful models such as GPT-4, Claude2, and a variety of open source models has nullified this advantage as other CROs aim to leverage LLMs for NLP tasks across their business. It’s important to note that LLMs have been more heavily leveraged for text synthesis and data extraction rather than generation to date in this space.

Once a potential molecule has been identified there is still a process of optimization that needs to occur in order to reduce toxicity, improve efficacy, and ensure the drug can be a successful candidate when moved to later stages of the process. Reinforcement learning can be used here to further optimize the properties of a given molecule to improve its ability to bind to select sites and reduce the chances of side effects.

Clinical Trials

A relatively new approach that has been gaining interest is the use of digital twins to aid in the process of selecting trial participants, predicting drug responses, and anticipating side effects. (Series B+) is one example of startups that are tackling this space with the aim of improving the efficiency of clinical trials. By leveraging digital twin generative models the hope is that you can run an effective trial with a smaller cohort of control groups. There is non-trivial risk from a regulatory standpoint around using AI models to substitute for actual study subjects and regulatory bodies aren’t known to be the fastest moving organizations in the world. However, shrinking the size of required control groups can help therapeutics companies execute faster by overcoming the fear of placebo dosing that patients have when participating in studies.

Related to shrinking the required population needed for effective clinical trials is the use of synthetic data to predict outcomes for drug development. For some diseases it is very difficult to find enough patients that currently have the condition and are willing to participate in clinical trials and disease progression can occur over a very long timeline which lengthens the feedback loop for researchers. There is an open question as to whether synthetic data can act as effective input data into predictive models for moving drug development forward. This is most applicable in the pre-clinical environment currently, but if successful could expand into the clinical trial phases for certain data points.

Identifying, screening, and managing participants during clinical trials is also a heavy burden that is usually placed on research sites and CRAs. Managing data collection, ensuring study compliance, noting and following up on deviations, and summarizing findings are all potential applications for LLMs that will reduce the amount of headcount and contracted dollars allocated for studies.

Regulatory Approval

While not a large component of the overall cost in drug development this phase can be further optimized by leveraging LLMs for structured data extraction and summarization. Co-pilot type products will no-doubt exist in this world, however I’m skeptical of the ability for a venture scale company to break out in this space given the relatively small universe of enterprise customers, concerns over hallucinations, and low perceived value from pharma companies. A lot of work (and spending) has taken place to get to this point, a co-pilot assistant likely creates incremental value that commands a relatively low ACV across a constrained universe of buyers along with high churn via company attrition.

Manufacturing and Scale Up

Although a majority of the spend in this phase is related to equipment CAPEX and facility construction the design phase can be optimized through engineering design software infused with models that can more rapidly iterate to an optimal design and identify potential paths for process engineering to move from batch to flow chemistry. Most value is likely captured by existing engineering design software companies such as AspenTech and EPC firms due to the ability to layer on new models and functionality into an existing platform with a captive customer base as we’ve seen with incumbents in other industries like Microsoft and Adobe.

AI + Robotics

Even though a tremendous amount of progress has been made in generating, screening, and optimizing candidates through computational means, a huge amount of work remains for the wet lab to perform validation and testing of candidates in live cells and tissue. One area of ongoing development is progressing towards fully automated laboratory environments that can execute experiments with the click of a mouse and return relevant data with as little latency as possible. Full (or near full) automation of this phase could result in more rapid iteration of leading candidates at a much lower cost due to lower lab tech headcount and higher equipment utilization. These automation focused labs can also play a role in building out high quality data sets that can be used in the training of drug discovery models that can reduce the risk of failure at this stage of preclinical research.

Personalized Medicine

The holy grail for medicine is the ability to create customized treatment based on the genetics or other personal attributes of the patient. In this domain AI is crucial for analyzing large amounts of genetic data to identify variations linked to diseases and predict health risks. By analyzing existing data sets systems will be able to predict individual risk to certain diseases and anticipate the progression of those diseases. While there is a large amount of promise in this space there are still barriers that prevent effective deployment of personalized medicine. A few of these barriers include access to high quality data sets that are diverse enough for application in models, better data sharing protocols and privacy compliance, integration of heterogeneous data sets (genomic, lifestyle, clinical, etc), and model interpretability for understanding AI driven decisions.

Why Now?

Advancements in deep learning such as AlphaFold2 and powerful generative models have reignited interest in the tech bio space from investors and founders who may have previously avoided the space to pursue more potentially lucrative opportunities. These advances will help to draw AI/ML talent into the arena of biological systems, but it’s crucial that these AI practitioners are paired with founders and teams knowledgeable about the current shortcomings of leveraging AI in the drug development process to most effectively bring technology to the market that will have a material impact on lowering the cost curve for drug development.

Databases such as PubChem, PDB, and ChEMBL along with open source libraries such as DeepChem, DeepAffinity, Fpocket, DOCK, and more have made the space much more accessible to ML/AI practitioners, but further work is needed to unlock real step changes in reduction of drug costs. Data from the wet lab stage that provides critical feedback for drug discovery and small molecule generation models is hard to come by and often held closely by pharma and biotech companies given the costs required to run the wet lab experiments to gather the data.

The use and commercialization of AI in the drug discovery and development process is still very early with an estimated market size of $1B - $1.5B, but is rapidly growing at an estimated 25% - 30% CAGR from 2023 - 2030. Many of the generational companies are formed at a stage in which they help grow the market rather than simply capture a large existing market which makes this a particularly attractive time to meet with founders who want to build the future of drug development.

Early results from applying AI in drug discovery and development are promising with Insilico Medicine announcing their ability to progress to a phase I clinical trial with an AI designed candidate in just 30 months compared to the average duration of 5-6 years to reach this milestone. Estimates from BCG indicate that AI has the ability to produce time and cost savings of up to 50% in the discovery and preclinical stages of development.

Business Models

There are a few common business models that companies pursue in this space:

  1. In-house development - These firms develop and leverage the technology in house for development of pipelines with the aim to eventually bring drugs to market. This is extremely capital intensive currently and as such could be very dilutive for funds that invest with smaller check sizes and do not carry significant reserves to maintain pro-rata in follow on rounds. This area is more well suited to the traditional biotech firms with deeper clinical expertise and larger fund sizes.
  2. Development via partnerships - These firms focus on early stage discovery and then lean on larger pharma companies to handle clinical development through commercialization. These companies generate revenue through upfront payments, research milestones, and royalties. Sufficiently capital efficient businesses could be interesting, but in practice these companies are also capital intensive to the point of being extremely dilutive.
  3. Technology or data licensing - Companies that develop proprietary technology or data sets and license them to biotech or pharma companies fall into this bucket. This area is the most promising for funds that want to invest in earlier stages and are dilution sensitive. Creation of robust data sets and making them accessible to AI/ML practitioners within the industry will be crucial for developing models that can expedite the feedback loop from in silico screening and discovery to clinically viable solutions. Access to high quality, diverse, and comprehensive data sets remains a primary barrier to developing personalized medicine and new drug discovery models.


The application of AI in the drug discovery and development process holds a lot of promise, but it’s clear that we’re still in the early innings with many foundational challenges that need to be addressed in both the technological domain as well as regulatory processes. Full stack or partnership driven tech bio firms have the potential for developing promising technology and bringing novel drugs to market, however, their capital intensive nature along with long development cycles and high risk of failure make them more suited for traditional biotech investment firms.

I recommend that early stage firms that aren't exclusively biotech driven focus on companies that are working towards breaking down the data barriers that are currently holding back application of AI in the drug discovery and development process and personalized medicine. These data focused firms can operate in a capital efficient manner which allows firms to deploy funds at smaller check sizes and reduce the risk of excessive dilution prior to an exit.

Due to the complexity of biological systems and the drug development process an ideal team in this space is technically strong both in the data/ML space as well as at least one co-founder with experience in drug development / biotech. This is important to ensure that the data products, data sharing platforms, and compliance are built with the realities and complexity of the biotech industry in mind. Companies that focus in this area can provide holistic data platforms for integrating both proprietary and open source data sets across multiple domains (clinical, research + simulation, behavioral) for application in AI model development.

Development of data platforms could manifest as purely computational based or in combination with laboratory operations and robotics. At the earliest stage firms should remain open to companies that are taking either approach to solving problems related to data access for life sciences.

Particular areas of application that are compelling from a data standpoint include platforms that allow for increased success of target identification and compound-target matching, building data sets and training platforms that incorporate wet lab results from in vivo and in vitro testing, synthesis platforms that effectively mesh heterogeneous data sets for application in personalized medicine, and the development of generative models for synthetic data set generation.

These data driven companies for drug development and discovery may grow to take on a similar structure to large data platforms in other industries such as Bloomberg, IHS, Thomson Reuters, and others. They may serve as the data foundation that facilitates a Cambrian explosion of AI driven drugs that have a drastically reduced cost basis, higher success rate, and more rapid time to market. In the same way that Scale has made higher quality data accessible to self-driving car companies, there could be a similar play in the biotech/pharma space that aims to provide high quality and comprehensive data sets for application in generative models, RL, and GNNs geared towards drug discovery and development.