RiboSplitter

RNA alternative splicing protein translation prediction

In order to gain meaningful insight into the biological consequences of alternative splicing events, we need to know the relative differences between the two isoforms of the alternative splicing event at the protein level, which exons are involved relative to the full transcript, and what protein domains are encoded by the involved exons. RiboSplitter was created to address these issues.

RiboSplitter will:

Apply filtering to alternative splicing events and run differential testing
Predict relative protein differences between the two isoforms of each event
Create 3 figures: a zoomed-in view to show details and protein changes of the event, a zoomed-out view to show the location of the event relative to the full transcript, and a figure with protein domains aligned to exons.

Citation

Najjar, R., Mustelin, T. Prediction of alternative pre-mRNA splicing outcomes. Sci Rep 13, 20000 (2023). https://doi.org/10.1038/s41598-023-47348-6

Requirements

R (4.2.1) libraries: tidyverse (2.0.0), Biostrings (2.66.0), biomaRt (2.54.1), patchwork (1.1.3), rhdf5 (2.42.1), aod (1.3.2)
bedtools (2.30.0)

Use

First run SplAdder (3.0.3)

Clone the repository, or load functions from raw url

source ([raw url ribosplitter.R])

Two datasets are required to start:

A dataset of details of alternative splicing events including gene name, event ID, genomic locations of involved exons. This will be called 'details'. These are SplAdder confirmed events. Provide the dir path to SplAdder output

details= read_details (dir)

A dataset of read counts supporting isoforms 1 and 2 for each sample per event plus percent spliced-in (PSI). This will be called 'isoforms'. PSI is set to missing when isoform counts are <10. Provide the dir path to SplAdder output and prefix of the sample names

isoforms= read_isoforms (dir, 'sample_prefix')

Note 1: SplAdder's naming convention is to name the longer isoform as isoform 2. This works for most event types but I think it creates confusion for mutually exclusive events where either of the middle exons can be longer, or they can have the same length. I solve this by renaming them so that isoform 1 is always the first one in genomic location and by adjusting the PSI accordingly.

Note 2: If using a different tool than SplAdder, you can still run RiboSplitter by creating the two datasets above from the tool's output. Include the following variables in the details dataset: gene_name, event_id, chrm, strand, type; start:stop genomic positions for involved exons (e1, e2, e3, e4); and for multiple exon skip events only, e2_starts contains colon separated start positions for all skipped exons, and e2_ends contains the end positions for multiple skipped exons. Format the isoforms dataset to include: event_id, sample, iso1, iso2, iso_total, and psi.

Next, merge your metadata with the isoforms dataset (adding a variable called "group" that classifies the samples into "disease" and "control")

Running pipeline

Use ribosplitter function to run full pipeline: filtering, statistical testing, multiple comparison adjustment, reading frame prediction, protein translation, amino acid differences between isoforms, and generate event genomic names.

Input parameters

"isoforms" and "details" dataframes. "isoforms" must contain a "group" variables with 2 values: "disease" and "control"
min_disease: alternative splicing events with this minimum number of disease samples will be included
min_control: alternative splicing events with this minimum number of control samples will be included
sd_cutoff: The lowest PSI SD for an event to be included
dir: path to save output files
ref_fasta: path to reference genome fasta (needed to extract exon DNA sequences)
q_cutoff: alpha value for q values (p values adjusted for multiple comparisons)
ensembl_version: Make sure this corresponds to the reference genome you use.

After filtering (events that are non-coding, have low variability, or have many missing values), RiboSplitter uses a beta binomial model with overdispersion to allow for variability by group, with p values adjusted for multiple comparisons using the BH methodology.

Not all events will be present in all samples. This could reflect real biology (e.g. disease heterogeneity), or incomplete/low sequencing coverage in some samples for specific splicing events. Therefore, we must decide how many samples per group are required for the event to be included. Additionally, low-variability events are excluded, based on PSI SD, I recommend using a low cutoff (e.g. sd_cutoff=0.05)

Example

de_events= ribosplitter (isoforms_df=isoforms, details_df=details,
                min_disease=10, min_control=10, sd_cutoff=0.05,
                dir='~/rna/sjogrenB',
                ref_fasta='~/rna/nw/GRCh38.primary_assembly.genome.fa',
                q_cutoff=0.05, ensembl_version=108)

Output

The output dataset will have the following elements:

Variable(s)	Description
gene_name, gene_id, event_id	names for genes and splicing events
genomic_name	splicing event name based on chromosomal positions
novel	which isoforms are annotated vs not
n_control, n_disease	number of samples included for this event
avg_control, avg_disease	average percent spliced-in (PSI) by group
sd_control, sd_disease	standard deviation by group
pval, adj_pval	pvalues: raw and adjusted for multiple comparisons
frame, validframe	reading frames and their description (frame=9 means ambiguous reading frame)
iso1_dna, iso2_dna	DNA code for isoforms 1 and 2
iso1_aa, iso2_aa	translated amino acid sequences for isoforms 1 and 2
iso1_stop, iso2_stop	distance in amino acid letters from start of sequence to first stop codon
frameshift	relative frameshift introduced by the middle exons. 0 means no frameshift, 1 or 2 means a frameshift by 1 or 2 nucleotides
same_start	amino acid sequences that are the same in the two isoforms at the beginning
same_end	amino acid sequences that are the same in the two isoforms at the end
diff_iso1	amino acid sequences that are different in isoform 1 (compared to 2)
diff_iso2	amino acid sequences that are different in isoform 2 (compared to 1)

A dataset called "positions" will be saved in the provided "dir". This contains event genomic positions and is needed to generate figures.

Visualization

This function will create a zoomed-in view of alternative splicing events. Fig2 can be 'jitter' or 'dot'

f= splicing_figure(final_df, positions, isoforms, fig2='jitter')

# You can either save an individual figure or output multiple in one large figure as below
ggsave(paste0(out_dir,'/exons_figure.png'), plot=wrap_plots(f[1:30], ncol=3), 
       dpi=300,units='cm', height=45, width=60)

This figure will show the gene name, splicing event ID, chromosome, strand, and adjusted p value (q value). The x axis is genomic position. The upper left corner shows the relative frameshift between the two isoforms. The counts above and below the isoforms are reads supporting either isoform. Light blue exons mean they have the same amino acid sequences in both isoforms, while light green exons indicate relative differences in amino acid sequences. The adjacent figure shows PSI for all samples by group. Note that a PSI of 1 means there is 100% isoform 2 expression in that sample.

A 2nd figure can be generated as below that will zoom out and show the event parallel to its best fit transcript, so we can see the location of the splicing event.

f= splice_figure_ref_TX (final_df , positions3, ensembl_version=108)
ggsave(paste0(out_dir,'/exons_ref_transcript.png'), plot=wrap_plots(f[1:30], ncol=3),
           dpi=300,units='cm', height=40, width=60)

A 3rd figure aligns exons (grey) to protein domains

domain_fig('ENSG00000064012.24', ensembl_version=108)

# you can also create domain figures for all genes of interest as below:
domains_figs= domain_fig (final_df$gene_name)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
LICENSE		LICENSE
README.md		README.md
RiboSplitter.R		RiboSplitter.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RiboSplitter

Citation

Requirements

Use

Running pipeline

Output

Visualization

About

Releases

Packages

Languages

License

R-Najjar/RiboSplitter

Folders and files

Latest commit

History

Repository files navigation

RiboSplitter

Citation

Requirements

Use

Running pipeline

Output

Visualization

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages