Author: Sofia Martello
Master's Thesis Project | TU Munich Campus Straubing | June 30, 2023
This repository contains the code and data for my master's thesis project on predicting interactions of m6A-regulated RNA-binding proteins, with the aim of understanding the impact of the methylation m6A on the binding preferences of RNA-binding proteins.
-
RNA-seq for HEK293 cells (Sun et al. 2018)
-
RNA-seq for HepG2 (Wold, ENCODE)
- Dataset: (Wold, ENCODE)
-
ENCODE eCLIP for protein-RNA interactions (Wold, ENCODE)
- Dataset: (Wold, ENCODE)
-
miCLIP for m6A modifications on HEK293 cells (Linder et al. 2015)
- Dataset: (Linder et al. 2015), GSE63753_hek293.abcam.CIMS.m6A.9536.bed (processed)
For dataset preparation, consider three settings:
-
Baseline:
- Does not contain m6A information.
-
Setting A:
- Contains only sequences with at least one m6A site.
-
Setting B:
- Ratio 1:1 between sequences containing and not containing m6A sites.
-
Duplicate the environment:
- Use
denbi-conda_env.yml
anddenbi-jupyterlab_env.yml
.
- Use
-
Create a package 'src':
- Utilize
encoding.py
,filtering.py
, andmodel.py
. - Install with
pip install -e .
- Utilize
-
Organize folders as suggested:
|-- data |-- docs |-- results |-- scripts |-- src |-- tests
Run the Jupyter Notebooks in the following order:
- Preprocessing + map_gene_ids.R
- Encoding
- Plots
- Model
- Baseline_vs_Setting_A
- Baseline_vs_Setting_B
- Baseline_vs_Setting_A_aug
- Baseline_vs_Setting_B_aug
For the Baseline_vs_Setting notebooks, input either negative-1
(unbound sequences) or negative-2
(sequences bound to other RBPs) as class 0.