Machine learning on multiple epigenetic features reveals H3K27Ac as a driver of gene expression prediction across patients with glioblastoma.
Two patient epigenetic marker files are provided in the data folder. Specifically, the latest versions of these GSC measurements are located in the latest_versions_of_all_raw subfolder. Additionally, the ind_shuffle.npy file is provided. This file was used to create consistent dataset splits for train, validation, and test sets.
Clone this repository to the local filesystem using the link provided by the "Code" dropdown button above. For example:
git clone https://github.com/rsinghlab/ML_epigenetic_features_glioblastoma.git
Change the current working directory to the folder created by the clone process:
cd ./ML_epigenetic_features_glioblastoma
We recommend that a virtual environment be created to allow for the installation of the required packages and libraries in a without potential conflict with other packages already installed on the system. In the example here the virtual environment is given the same name as the project folder.
python3 -m venv ML_epigenetic_features_glioblastoma
Activate the new python environment.
source ./ML_epigenetic_features_glioblastoma/bin/activate
you can now install packages into the new environment using the included requirements.txt file.
pip3 install -r requirements.txt
The project's cross-patient prediction models are avaliable in the following locations:
XGBoost (XGBR) code/models/xgboost/xgboost_cross_patient_pred_regression_gsc_stem_standard_log2.py
"Branched" Multi-layered Perceptron ("Branched" MLP) code/models/mlp/mlp_cross_patient_regression_gsc_stem_sequence_standard_log2.py
Multi-layered Perceptron (MLP) code/models/mlp/mlp_cross_patient_regression_gsc_stem_standard_log2.py
Convolutional Neural Network (CNN) code/models/cnn/cnn_cross_patient_pred_regression_gsc_stem_standard_log2.py
Recurrent Neural Network (RNN) code/models/rnn/rnn_models.py
Gradient Boosting Regression (GBR) code/models/gbr/gbr_cross_patient_pred_regression_gsc_stem_standard_log2.py
Support Vector Machine (SVR) code/models/svm/svm_cross_patient_pred_regression_gsc_stem_standard_log2.py
Multiple Linear Regression (MLR) code/models/mlr/mlr_cross_patient_pred_regression_gsc_stem_standard_log2.py
A) The script's path and filename.
B) The first data file's path and filename. This script creates the model's training and validation (or training only) sets from this file.
NOTE: The creation of a validation set is controlled by the validation = True
or False
statement in the script's main()
function. The proportions given to each set specified in the get_data_patient_1
function under the comment #HYPERPARAMETER TUNING SPLITS
and TESTING SPLITS
.
C) The second data file's path and filename.
D) The ind_shuffle.npy file (or equivalent) mentioned in the "Datasets" section above.
E) The absolute or relative directory path where the various script functions will direct model output, predictions and visualizations. If no directory is specified by the user than the default directory './cross_patient_regression_using_xgboost_results_and_figures'
will be used.
NOTE: 10/18/24 The 'script output save directory' argument and functionality is specific to the XGBoost, Support Vector Machine, and Gradient Boosting Regression model scripts. This functionality is planned for implementation in the other scripts. For now, arguments A-D are active for those scripts.