Explore By Subject Area   

How Harvard is Building Machine Learning Models to Better Identify Targets and Design Drugs

John Quackenbush, PhD, describes how combining human expertise of biological systems with machine learning models can help find new targets, design drugs and predict better interventions.

July 10, 2024
How Harvard is Building Machine Learning Models to Better Identify Targets and Design Drugs

How are artificial intelligence and machine learning models impacting drug development and delivery?

There are a few different areas where machine learning is starting to have an impact on the whole paradigm of drug development. The first thing you need to do in drug development is to identify a potential target. For what molecule in a cell are we going to choose to develop an intervention that will alter the trajectory of the system? If you think about it that way, you have to look at what drives biological state development. For example, what makes a cancer develop to be resistant or sensitive to chemotherapy? While these trajectories have stochastic components they also reflect key elements in the cell that drive the transition and the fate of the system as it continues to evolve. 

A second component is understanding the goal of the intervention. I tend to separate this into two pieces. The first piece is target discovery, where machine learning is starting to provide us with better ways to identify targets and to understand the places where we can make interventions or where we can perturb the system and allow them to move in a different trajectory or alter the state such that it’s more likely to respond to a particular intervention

The second piece is building predictive models. If I make a perturbation in a particular location, how do I predict whether or not it will have the effect we want? For that, we need large bodies of data where we can look at the effects of perturbation and understand how they might alter these disease development trajectories or shift them in a new direction. In summary, machine learning can help us find the targets, think about how we perturb those targets and predict the way in which perturbing those targets changes the trajectory of the system.


Do you have any real world examples?

I’m very excited about the release of AlphaFold3. AlphaFold is Google's large scale project that turned out to be a very accurate predictor of three-dimensional protein structures. AlphaFold2 refined that process somewhat and AlphaFold3 now goes beyond predicting individual protein structures and provides predictions of interactions between proteins and other proteins or between proteins and DNA. We can now think about ways in which we might be able to actually perturb systems with a much higher degree of success.

If we have a mutant protein, we can potentially find a protein whose predicted structure will bind that protein and lock or alter its functional role. Or we can even design a small molecule to bind to DNA and interfere with the way certain genes are turned on and off. There are many interesting opportunities to bring these new machine learning approaches and think about altering the overall trajectory in drug development.

"Machine learning can help us find the targets, think about how we perturb those targets and predict the way in which perturbing those targets changes the trajectory of the system."


What are the biggest challenges for these machine learning approaches to achieve meaningful clinical impact?

While we have access to massive quantities of data, we don’t necessarily have the right data. The genome is 3 billion bases long and we have two copies of that genome. Each one of us differs from any other one of us at 6 million or more individual sites in our genomes. So there are a lot of genetic variants we have to look at and understand in the context of identifying those that are relevant for disease development and progression. We have genome sequence data, RNA sequencing data and quantitative protein-level data but those may or may not be linked to the relevant clinical data to be able to make good decisions. We have lots of model data from cells and cell lines but ensuring we have the right data at the right time to train these models is a key challenge.

The second piece to understand is that we have a long history of designing targeted therapies but even if we target a particular system or a particular protein for intervention, we have no guarantee that the perturbation we make will give us the right result. Part of that has to do with the fact that each one of us has a unique genetic background and has a different capacity to respond to perturbation. While there are a lot of targeted therapies and many of them are incredibly effective, not all of them work in each and every individual. We need to recognize that building these models and improving upon them will be an iterative process. 

The third thing is that we have multiple objectives to balance when building machine learning models. One is going to be the predictive power of the model. When taking a large data set with outcomes data and trying to predict whether or not someone will respond to therapy, the challenge we often face is that we invariably see predictive accuracy start to fall as we validate on independent data sets and on individuals. This may mean that the models aren’t accurately reflecting the underlying processes that actually drive the system we’re looking at. One of the things my colleagues and I have started to think about is not just predictability but interpretability and explainability. Can we build a model where we can look at the predictions with the greatest weights and understand why those are influencing the outcomes? We are trying to find a causal explanation for why systems respond the way they do. 


How do you do that?

The good news is that we have a lot of prior knowledge about biological systems that we can start to build into our models as soft constraints when we try to optimize an objective function. By building these in, we are able to get models sparse enough to be interpretable and to better react and understand what features are driving the model. If a system doesn’t respond the way we predicted it to, we can implement changes to the model or introduce additional interventions.

One of the things I got very interested in was the lottery ticket hypothesis as a way of pruning neural networks and deep learning models to try to get better predictions. Essentially, the lottery ticket hypothesis comes down to taking a large model and reducing its complexity by pruning some of the connections in the model and then testing to see whether or not we get a model with better predictive power. The surprising thing is that as you get down to the key predictors, you often end up improving the overall performance of the machine learning system. If we can take that kind of approach and combine it with additional tests to see whether or not that system is also consistent to a degree with our understanding of how the biological system we're studying functions, then we have a better opportunity to build better, more predictive and more interpretable models. 

"Can we build a model where we can look at the predictions with the greatest weights and understand why those are influencing the outcomes?"


Is that where the role of the human expert comes in? 

I think this is one of the places where the no free lunch theorem from Wolpert and Macready demonstrates its power. The no free lunch theorem for optimization paper in 1997 argues that there's no general purpose optimization algorithm that works in all situations. If we think about machine learning or statistical model fitting or almost any place in which we’re trying to learn weights in a predictive model, I can adjust the model in a variety of ways to try to ensure the predictive power is as high as possible. 

Wolpert and Macready argued that there is no general purpose way to do that. The best way to get arbitrarily close to a near optimal solution in finite time is to introduce prior knowledge into the model about how the system behaves. It’s similar to what I said earlier: combining something like the lottery ticket hypothesis with a guess of how the system behaves in building an objective function. That can be extraordinarily important because it may not only help with improving the robustness in the predictive power, but it may also contribute to extending or improving the explainability system.


At PODD 2023, you said you were excited about spatial transcriptomics mixed with single cell transcriptomics. Where do you see these playing a role in precision medicine?

Single cell sequencing has been a tremendous new technology for understanding the complexity of the cell types that exist in biological systems. In oncology, we analyze lots of solid tumors. From the outside, we think of a tumor as being a bunch of cancer cells. But when we look at that tumor, it’s actually composed of many different cell types. Some are tumor cells. Some are surrounding epithelial tissue. Some are blood vessels. Some are invading lymphocytes. Even if we look at the tumor cells themselves, there may be different subpopulations that reflect different mutational states within the tumor. Single cell transcriptomics allows us to learn what individual cells look like in terms of their overall genomic profiles and particularly their gene expression profiles. We now have single cell atlases to build catalogs of different cell types. 

From this level of analysis, we can begin to understand the differences in the makeup of the tumor that may influence whether or not the tumor will respond to particular therapies. Are there enough invading immune cells that it's likely to respond with something like immunotherapy? Are they the right kind of immune cells? We can begin to explore these questions by combining laboratory and computational methods.


What about spatial transcriptomics?

Spatial transcriptomics is a relatively newer technology that allows us to take pieces of tissue from a tumor biopsy and look at the patterns of gene expression across that tissue. While we don’t yet have single cell resolution, we can look at the different tissue types in little regions across the tumor. There are machine learning tools, like SPOTlight, where you can take a single cell atlas and learn different cell profiles and then apply it to the data and predict which cell types are where. We’ve gone from a situation where we see a tumor as monolithic to now seeing a tumor as a complex mix of things and we can predict where those individual cell types are and tease apart the architecture of the tumor itself.

I’m involved with a project now with a PhD student, Lauren Su, and with Judith Agudo of the Dana Farber Cancer Institute looking at the context of tumors. Even among tumor cells there are different cell types, like quiescent cells that look like they’ve gone to sleep and other rapidly dividing cells. In combining spatial transcriptomics and single cell data, we can see that quiescent cells don’t intermingle, the differences in the infiltrated immune cells and how the interplay and intercommunication between cell types establishes the tumor system. This can inform our understanding of how disease develops and also gives us opportunities to think about better ways to intervene.

"Every great scientific advance, from Galileo to sequencing genomes, has been driven by access to data."


Can you tell us about some of the other work you are currently leading?

We're getting to the point now in combining data and methods where we can think about how diseases develop and progress. In oncology, we’d love to be able to look at a normal, healthy cell and follow it as it develops and progresses into a tumor cell. But we don’t know which cells in a person’s body will, over time, progress into tumors. And if someone had an early stage tumor, it would be unethical to follow it over time as it develops. All we really get is the static snapshot when a tumor is detected.

But if we think about a population of individuals, for example, people with lung tumors, we can look at distributions. One of the greatest risk factors for lung tumors is smoke exposure so there is a distribution of smokers who have cancer, don’t have cancer yet and who might not develop cancer. If I look at early stage tumors, I’m not going to see a monolithic group, but a distribution of individuals. We are working on ways to order individuals along that progression and build machine learning methods that approximate a dynamic time course and model progression. 

If you think about trying to learn the rules of American football, you can send someone to a game and have them take photos. If they take those photos and mix them all up, it’s going to be hard to learn the rules of the game. If you order them in progression, even though there will be gaps, you can start to tease out the different rules or the principles of the game. 

We just published a paper in Genome Biology on a method we call Phoenix that sits at this intersection between machine learning and the no free lunch theorem. First, we introduce a model of kinetics modeling gene regulatory processes that happen over time, like transcription factors binding to DNA and altering their expression. Then we guess at the structure of the network given other sources of data. We use that as a soft constraint to build our models in ways that are explainable and capture the essence of our understanding of the system yet are highly predictive. In the paper, we describe the application of this model first using simulated data, then using data from a model system yeast and finally using data from breast cancer samples ordered in approximate time – pseudotime – to develop gene regulatory network models that give us the temporal changes in the progression of the system.

By introducing these two no free lunch theorem constraints, we have been able to scale this to the analysis of the whole genome. By combining these different pieces, we are able to maintain the predictive power while giving ourselves in the explainability that comes from the model. I find this exciting because we can use this model to learn about key points and control modes in the development of disease and opportunities where we can intervene and block tumor progression.


How will the field of computational approaches to drug development evolve?

While I think we are going to see rapid progress, we are still in the hype cycle for the application of machine learning in a lot of different areas. It’s easy to leap to conclusions and say that we will solve all of these problems immediately. For example, ChatGPT is an amazing tool, but it is subject to hallucinations. We have to safeguard against that happening by introducing soft constraints and guidance in our systems to make them better. By doing that, our models will become more predictive and more reliable.

Five years from now, we are going to have much deeper biological knowledge of the systems we’re trying to study. Every great scientific advance, from Galileo to sequencing genomes, has been driven by access to data. And those data are made available by the development and application of new technologies, like the telescope. In the next five years, we will gain access to data on chromatin availability from techniques like ATAC-seq or patterns of DNA methylation from quantitative protein expression measurements.

I’m also really excited about quantitative phenotyping. One of my colleagues, JP Onella, is involved in something called “digital phenotyping” where he uses apps on smartphones to collect data that address questions like “how depressed are you?” in a quantifiable manner. Becoming more quantitative in how we measure biological states is going to help us in our quest to understand how those states develop.

The last piece is that we are going to see an acceleration of the timeline in which we identify particular drug targets, develop drug candidates, test them, and then optimize them. And that will be driven by smart approaches to the pipeline and the development, but also smarter approaches to running trials in laboratory systems and animals, but also in clinical trials. 


In This Article

Subscribe for More Information

Please provide your contact information and select areas of interest to receive updates.