This project aims to streamline the early phases of drug discovery using AI and bioinformatics, focusing on identifying and evaluating potential drug candidates. By harnessing the ChemBL Database, it searches for compounds that interact with specific chemical or biological targets—using Histone Deacetylase 1 as the primary example. The project assesses these compounds based on the Lipinski Rule of 5 and other molecular descriptors, classifying them into active or inactive groups based on their IC50 values. The IC50 metric is crucial as it measures the concentration required to inhibit a biological process by half, offering insight into a compound's drug efficacy.
Furthermore, the project employs the PaDEL Descriptors for a more in-depth analysis, aiming to predict the IC50 values of compounds using Random Forest Regression. This approach not only facilitates the identification of promising drug candidates but also significantly reduces research and development costs by circumventing the need for extensive experimental testing.
To run this project, the following dependencies are required:
- ChemBL and RDKit:
conda install -c rdkit rdkit -y
- Bash (either through Conda or Git Desktop):
conda install -c conda-forge bash
- TextWrap3:
pip install textwrap3
- Other Essential Libraries (Matplotlib, Seaborn, Pandas, Numpy, Scikit-Learn, SciPy):
pip install matplotlib seaborn pandas numpy scikit-learn scipy
- Introduction
- Overview of the project, its objectives, and the methodology used.
- Getting Started and Example Inputs
- Getting started and showing suggested data for trial runs, including CHEMBL325, CHEMBL220, and CHEMBL3927, with CHEMBL325 as the primary example.
- Plotting
- Details on how data for the Lipinski Descriptors are plotted, including examples.
- Regression
- Explanation of how Random Forest Regression is utilized to predict IC50 values.
- More Info and Credits
- Additional resources and acknowledgments.
To initiate drug discovery, follow the installation steps to set up the environment and install necessary dependencies. Next, select a target from the suggested list or choose one of interest to you. The process involves extracting data on compounds interacting with the target, analyzing their properties according to the Lipinski Rule of 5, and employing statistical and machine learning models to evaluate their potential as drug candidates.
Data to try: CHEMBL325 (Histone deacetylase 1), CHEMBL220 (Homo Sapiens - Acetylcholinesterase), CHEMBL3927 (SARS coronavirus 3C-like proteinase)
In this example, I used CHEMBL325:
The project includes plotting the evaluated data using Matplotlib and Seaborn to visualize the distribution and comparison between active and inactive compounds across different molecular descriptors. These plots are crucial for understanding the characteristics that contribute to a compound's effectiveness and bioactivity.
Using PaDEL Descriptors and Random Forest Regression, the project aims to predict the IC50 values of compounds. The model learns to correlate the descriptors (features) with the IC50 values (target) across the training dataset. Random Forest improves prediction accuracy by creating a forest of decision trees where each tree is trained on a random subset of the data and features. This randomness helps in making the model more robust and less prone to overfitting to the training data, which is great for assessing a compound's viability as a drug candidate without extensive laboratory testing, offering a cost-effective and efficient alternative to traditional methods.
This project represents a significant tool in the field of drug discovery, leveraging bioinformatics and artificial intelligence to streamline the search and evaluation of new drug candidates. By reducing the need for extensive experimental testing, it shows promise in accelerating and cutting costs in the development of effective treatments for a variety of conditions.
This project was inspired by resources and tutorials from Data Professor on YouTube and machinelearningmastery.com Written by William Huang