Over the past decade, machine learning models have cemented themselves as one of the cornerstones underpinning the computational revolution in drug discovery. Despite substantial progress, several challenges remain and machine learning methods have yet to achieve their full transformative potential. One of the most pressing issues lies in the availability of high quality, uniformly-sourced experimental data that covers diverse chemical space. Without such data, AI models are limited in quality and generalisability.
Enter physics-based simulations. These simulations, specifically free energy calculations, can predict highly accurate binding data for potential drug molecules at a fraction of the time and cost of real-world experiments. By modelling how molecules interact with protein targets at the atomic level, simulation-based methods massively outperform simpler approaches to binding energy prediction, such as molecular docking. In particular, they are able to account for the flexibility of both the drug and the protein as well as how water molecule dynamics influence binding. However, free energy calculations have traditionally been difficult to set up and require access to substantial amounts of computational resources.
At Evariste, we have developed an incredibly fast and highly automated free energy pipeline that can generate huge swathes of uniformly sourced binding data on the fly. Integrating this data into Frobenius Discovery allows our models to reach even greater predictive accuracies and better navigate sparsely explored chemical landscapes, focussing on the most promising areas of search.
As mentioned, the predictive power of traditional approaches to free energy calculations comes at a hefty computational cost. The mainstay of these calculations, relative binding free energy perturbations (FEP) demand a large number of lengthy simulations to be performed. Furthermore, setting up these simulations often requires laborious planning and careful execution by domain experts, with rigorous testing and meticulous strategy adjustments necessary to avoid incorrect or misleading results.
Non-Equilibrium Switching (NES) offers a powerful, cost-efficient alternative. To briefly summarise, NES leverages short (80x100picosecond), non-equilibrium transitions between two equilibrated end states of a protein-ligand system[4]. Critically, this approach is both accurate and significantly faster than standard equilibrium FEP methodologies.
Where conventional methods might often consume at least 50 nanoseconds of simulation time per calculation (minimally 5ns simulations at 10 different lambda windows[2, 3]), NES can achieve comparable accuracy in roughly 20 nanoseconds—a reduction of about 60%. Implementing Evariste’s “Node-based NES” protocol further enhances both the speed and accuracy of the calculations and will be the subject of a future blog post. These calculations are massively parallelisable as each of the 100ps transitions can be run as an independent simulation. Evariste’s acceptance into Google Cloud and NVIDIA’s startup programs has provided us with access to extensive on-demand cloud computing resources, making NES's parallelisability a game-changer for accelerating our workflows.
A common challenge in free energy calculations is the reliance on high-quality starting structures, typically sourced from experimentally determined protein-ligand crystal complexes. Without these crystal structures, users often resort to docking each new compound individually, hoping the predicted pose resembles the true binding mode. This process can be both time-consuming and error-prone, and suffers from poor ligand-receptor complementarity as well as non-optimised water networks in the binding site, leading to worse estimates of binding energy.
To overcome this hurdle and ensure the robust generation of ligand poses without crystal structures, we developed an approach called the Dynamic Pose Database (PoseDB). The PoseDB stores a library of protein-ligand complex snapshots generated via an extensive simulation pipeline. This pipeline successfully captures conformational sampling and water dynamics, effectively mimicking the protein-ligand interactions one may observe in vivo and captured via crystal structures.
For any new target molecule, our platform selects the best-matching entry from the PoseDB and aligns the new compound to the pre-validated pose. This eliminates the need to perform individual pose generation for new target molecules, reducing setup time. More importantly, it enables accurate free energy calculations even in the absence of crystal structures. In backtests on our lead project where we compare this method to utilising publicly available crystal structures, the enhanced pipeline improves the experimental vs predicted Pearson r value from 0.28 to 0.65 and reduces the RMSE from 3.13 kcal/mol to 2.12 kcal/mol.
The data generated by these rapid free energy calculations is stored in a database used to train our machine learning models. Traditional ML models, often limited to 2D ligand data, can overlook critical structural insights. By integrating free energy data, Frobenius Discovery incorporates dynamic, structural information of protein-ligand complexes, enhancing predictive accuracy. Top predictions seamlessly feed back into the NES pipeline to refine and confirm our predictions. Thanks to this integration, we can evaluate millions of potential compounds in seconds using Frobenius Discovery and rapidly prioritise the most promising of candidates.