-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathlab_download.Rmd
312 lines (259 loc) · 14.5 KB
/
lab_download.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
---
title: "Annotation data"
subtitle: Workshop on RNA-Seq
output:
bookdown::html_document2:
highlight: textmate
toc: true
toc_float:
collapsed: true
smooth_scroll: true
print: false
toc_depth: 4
number_sections: true
df_print: default
code_folding: none
self_contained: false
keep_md: false
encoding: 'UTF-8'
css: "assets/lab.css"
include:
after_body: assets/footer-lab.html
---
```{r,child="assets/header-lab.Rmd"}
```
```{r,include=FALSE, eval = FALSE}
library(dplyr)
library(biomaRt)
```
# Genomic data
Reference genomic data for your projects are available from Ensembl. This is usually the latest build of the genome, transcriptome etc as well as the annotations in GTF or GFF format. Most common organisms are available from [ensembl.org](https://www.ensembl.org/index.html). You can select the organism and then click on **Download FASTA**/**Download GTF/GFF** which takes you to the FTP site.
You can also go directly to their FTP site [https://www.ensembl.org/info/data/ftp/index.html](https://www.ensembl.org/info/data/ftp/index.html) where you can select the type of data you need, and then select the organism. For eg; **homo_sapiens**, under which you find **cdna**, **cds**, **dna**, **dna_index**, **ncrna** and **pep**. Under **dna**, the FASTA files are available as full genome or as separate chromosomes. Each of them are again available as regular (repeat content as normal bases), soft-masked (sm, repeat content in lowercase) or repeat-masked (rm, repeat content as Ns). Full genomes are also available as **primary assembly** or **top-level**. **Primary assembly** is what most people would need. The **top-level** is much larger in size and contains non-chromosomal contigs, patches, haplotypes etc. This is significantly larger in size compared to the primary assembly.
<i class="fas fa-exclamation-circle"></i> Clades such as metazoa, protists, bacteria, fungi and plants are available through separate ensembl websites. These are listed on [http://ensemblgenomes.org/](http://ensemblgenomes.org/).
## GTF/GFF
The GTF (General Transfer Format) format is one line per feature, described in 9 columns of data.
The GFF (General Feature Format) (GFF3) is identical to GTF version 2.
![](data/GTF_Example.png)
For conversions and extracting features from either of them you can use [AGAT](https://github.com/NBISweden/AGAT).
# Biomart
By using biomaRt you can access three main databases available in Ensembl (core, compara, and variation). Ensembl uses MySQL to store information and the tables within each database is accessible and searchable by using R package **biomaRt**.
```{r, out.width = "85%", fig.align="center", fig.cap = "Ensembl core schema.", echo = F, crop = F}
library(knitr)
include_graphics("data/Core.assembly_tables.png")
```
## Genes
In core database we can download annotation data. Annotations refer to known features (verified experimentally or predicted) in the genome. Usually, our features of interest in RNA-Seq are genes, their IDs, position in the genome, gene biotype (protein coding, non-coding etc) etc. We will also use the **dplyr** package to pipe data through functions.
```{r, eval = FALSE}
library(biomaRt)
listMarts()
```
```
biomart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 109
2 ENSEMBL_MART_MOUSE Mouse strains 109
3 ENSEMBL_MART_SNP Ensembl Variation 109
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 109
```
We will use the code below to find the name of mouse ensembl genes dataset under ensembl mart.
```{r, eval = FALSE}
mart <- useMart("ENSEMBL_MART_ENSEMBL")
ds <- as.data.frame(listDatasets(mart=mart))
# find all rows in dataset 'ds' where column 'description' contains the string 'mouse'
ds %>% filter(grepl("mouse",tolower(description)))
```
```
dataset description version
1 mcaroli_gene_ensembl Ryukyu mouse genes (CAROLI_EIJ_v1.1) CAROLI_EIJ_v1.1
2 mmurinus_gene_ensembl Mouse Lemur genes (Mmur_3.0) Mmur_3.0
3 mmusculus_gene_ensembl Mouse genes (GRCm39) GRCm39
4 mpahari_gene_ensembl Shrew mouse genes (PAHARI_EIJ_v1.1) PAHARI_EIJ_v1.1
5 mspicilegus_gene_ensembl Steppe mouse genes (MUSP714) MUSP714
6 mspretus_gene_ensembl Algerian mouse genes (SPRET_EiJ_v1) SPRET_EiJ_v1
7 pmbairdii_gene_ensembl Northern American deer mouse genes (HU_Pman_2.1) HU_Pman_2.1
```
Now that we know the name of the dataset, we can list all the columns (filters) in this dataset.
```{r, eval = FALSE}
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset(mart=mart,dataset="mmusculus_gene_ensembl")
la <- listAttributes(mart=mart)
head(la)
```
```
name description page
1 ensembl_gene_id Gene stable ID feature_page
2 ensembl_gene_id_version Gene stable ID version feature_page
3 ensembl_transcript_id Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5 ensembl_peptide_id Protein stable ID feature_page
6 ensembl_peptide_id_version Protein stable ID version feature_page
```
One can also search for attributes of interest.
```{r, eval = FALSE}
searchAttributes(mart=mart,pattern="entrez")
```
```
name description page
52 entrezgene_trans_name EntrezGene transcript name ID feature_page
65 entrezgene_description NCBI gene (formerly Entrezgene) description feature_page
66 entrezgene_accession NCBI gene (formerly Entrezgene) accession feature_page
67 entrezgene_id NCBI gene (formerly Entrezgene) ID feature_page
```
We create a vector of our columns of interest.
```{r,eval=FALSE}
myattributes <- c("ensembl_gene_id",
"entrezgene_id",
"external_gene_name",
"chromosome_name",
"start_position",
"end_position",
"strand",
"gene_biotype",
"description")
```
We then use this to download our data. Note that this can be a slow step.
```{r,eval=FALSE}
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset(mart=mart,dataset="mmusculus_gene_ensembl")
bdata <- getBM(mart=mart,attributes=myattributes,uniqueRows=T,
useCache=FALSE)
head(bdata)
```
```
ensembl_gene_id entrezgene_id external_gene_name chromosome_name start_position
1 ENSMUSG00000064336 NA mt-Tf MT 1
2 ENSMUSG00000064337 NA mt-Rnr1 MT 70
3 ENSMUSG00000064338 NA mt-Tv MT 1025
4 ENSMUSG00000064339 NA mt-Rnr2 MT 1094
5 ENSMUSG00000064340 NA mt-Tl1 MT 2676
6 ENSMUSG00000064341 17716 mt-Nd1 MT 2751
end_position strand gene_biotype
1 68 1 Mt_tRNA
2 1024 1 Mt_rRNA
3 1093 1 Mt_tRNA
4 2675 1 Mt_rRNA
5 2750 1 Mt_tRNA
6 3707 1 protein_coding
description
1 mitochondrially encoded tRNA phenylalanine [Source:MGI Symbol;Acc:MGI:102487]
2 mitochondrially encoded 12S rRNA [Source:MGI Symbol;Acc:MGI:102493]
3 mitochondrially encoded tRNA valine [Source:MGI Symbol;Acc:MGI:102472]
4 mitochondrially encoded 16S rRNA [Source:MGI Symbol;Acc:MGI:102492]
5 mitochondrially encoded tRNA leucine 1 [Source:MGI Symbol;Acc:MGI:102482]
6 mitochondrially encoded NADH dehydrogenase 1 [Source:MGI Symbol;Acc:MGI:101787]
```
We find that there are several duplicates for all the IDs. This needs to be fixed when this information is to be used downstream.
```{r,eval=FALSE}
sum(duplicated(bdata$ensembl_gene_id))
sum(duplicated(bdata$entrezgene_id))
sum(duplicated(bdata$external_gene_name))
```
```
[1] 375
[1] 29854
[1] 2130
```
```{r,eval=FALSE}
# arrange table by chr name and start position
bdata <- dplyr::arrange(bdata,chromosome_name,start_position)
write.table(bdata,"./data/mouse_genes.txt",sep="\t",dec=".",row.names=FALSE,quote=FALSE)
head(bdata)
```
```
ensembl_gene_id entrezgene_id external_gene_name chromosome_name start_position end_position strand gene_biotype
1 ENSMUSG00000102693 NA 4933401J01Rik 1 3143476 3144545 1 TEC
2 ENSMUSG00000064842 115487594 Gm26206 1 3172239 3172348 1 snRNA
3 ENSMUSG00000051951 497097 Xkr4 1 3276124 3741721 -1 protein_coding
4 ENSMUSG00000102851 NA Gm18956 1 3322980 3323459 1 processed_pseudogene
5 ENSMUSG00000103377 NA Gm37180 1 3435954 3438772 -1 TEC
6 ENSMUSG00000104017 NA Gm37363 1 3445779 3448011 -1 TEC
description
1 RIKEN cDNA 4933401J01 gene [Source:MGI Symbol;Acc:MGI:1918292]
2 predicted gene, 26206 [Source:MGI Symbol;Acc:MGI:5455983]
3 X-linked Kx blood group related 4 [Source:MGI Symbol;Acc:MGI:3528744]
4 predicted gene, 18956 [Source:MGI Symbol;Acc:MGI:5011141]
5 predicted gene, 37180 [Source:MGI Symbol;Acc:MGI:5610408]
6 predicted gene, 37363 [Source:MGI Symbol;Acc:MGI:5610591]
```
## Transcript
Here we download transcript to gene mappings. Notice that we can specify the `mart` and `dataset` in the `useMart()` function.
```{r,eval=FALSE}
mart <- useMart(biomart="ensembl",dataset="mmusculus_gene_ensembl")
t2g <- getBM(attributes=c("ensembl_transcript_id","ensembl_gene_id","external_gene_name"),mart=mart,useCache=FALSE)
write.table(t2g,"./data/mouse_transcripts.txt",sep="\t",dec=".",row.names=F,quote=F)
head(t2g)
```
```
ensembl_transcript_id ensembl_gene_id external_gene_name
1 ENSMUST00000082387 ENSMUSG00000064336 mt-Tf
2 ENSMUST00000082388 ENSMUSG00000064337 mt-Rnr1
3 ENSMUST00000082389 ENSMUSG00000064338 mt-Tv
4 ENSMUST00000082390 ENSMUSG00000064339 mt-Rnr2
5 ENSMUST00000082391 ENSMUSG00000064340 mt-Tl1
6 ENSMUST00000082392 ENSMUSG00000064341 mt-Nd1
```
The transcipt information file is saved to a file and will be used in the lab on [Kallisto](lab_kallisto.html).
## Gene ontology
Similarly, we can get entrez gene ID to GO ID relationships. List all the GO related filters:
```{r,eval=FALSE}
mart <- biomaRt::useMart(biomart="ensembl",dataset="mmusculus_gene_ensembl")
la <- listAttributes(mart=mart)
# find all rows in dataset 'lf' where column 'name' contains the string 'go'
head(la[grepl("go",tolower(la$name)),])
```
```
name description
18 with_go With GO ID(s)
19 with_goslim_goa With GOSlim GOA ID(s)
70 go GO ID(s) [e.g. GO:0000002]
71 goslim_goa GOSlim GOA ID(s) [e.g. GO:0000003]
169 go_parent_term Parent term accession
170 go_parent_name Parent term name
171 go_evidence_code GO Evidence code
212 with_cgobio_homolog Orthologous Channel bull blenny Genes
233 with_cldingo_homolog Orthologous Dingo Genes
257 with_ggorilla_homolog Orthologous Gorilla Genes
```
```{r,eval=FALSE}
mart <- biomaRt::useMart(biomart="ensembl",dataset="mmusculus_gene_ensembl")
bdata <- getBM(mart=mart,attributes=c("entrezgene_id","go_id","go_linkage_type"),uniqueRows=T,useCache=FALSE)
write.table(bdata,"./data/mouse_go.txt",sep="\t",dec=".",row.names=FALSE,quote=FALSE)
```
```
entrezgene_id go_id go_linkage_type
1 NA GO:0006414 TAS
2 NA GO:0030533 IEA
3 NA GO:0030533 ISS
4 NA GO:0005739 TAS
5 NA GO:0000028 TAS
6 NA GO:0006412 TAS
```
## ID conversion
We can also take a quick look at converting IDs. It is often desirable to convert a certain gene identifier to another (ensembl gene ID, entrez gene ID, gene ID). Sometimes, it may be necessary to convert gene IDs of one organism to another. biomaRt has a convenient function for this called `getLDS()`.
Here is an example where we convert a few mouse ensembl IDs to Human Hugo gene IDs.
```{r,eval=FALSE}
mouse_genes <- c("ENSMUSG00000035847","ENSMUSG00000000214")
human <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mart <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")
attributes = c("external_gene_name", "ensembl_gene_id", "description")
attributesL = c("hgnc_symbol", "ensembl_gene_id")
getLDS(attributes, attributesL = attributesL, uniqueRows = T, filters = 'ensembl_gene_id', values= mouse_genes, mart = mart, martL = human)
```
If you get this error:
`Error: biomaRt has encountered an unexpected server error.
Consider trying one of the Ensembl mirrors (for more details look at ?useEnsembl)`
Try the following chunk instead.
```{r, eval = FALSE}
mouse_genes <- c("ENSMUSG00000035847","ENSMUSG00000000214")
human <- useMart("ensembl", dataset = "hsapiens_gene_ensembl", host = "https://dec2021.archive.ensembl.org/")
mart <- useMart("ensembl", dataset = "mmusculus_gene_ensembl", host = "https://dec2021.archive.ensembl.org/")
attributes = c("external_gene_name", "ensembl_gene_id", "description")
attributesL = c("hgnc_symbol", "ensembl_gene_id")
getLDS(attributes, attributesL = attributesL, uniqueRows = T, filters = 'ensembl_gene_id', values= mouse_genes, mart = mart, martL = human)
```
```
Gene.name Gene.stable.ID Gene.description HGNC.symbol Gene.stable.ID.1
1 Ids ENSMUSG00000035847 iduronate 2-sulfatase [Source:MGI Symbol;Acc:MGI:96417] IDS ENSG00000010404
2 Ids ENSMUSG00000035847 iduronate 2-sulfatase [Source:MGI Symbol;Acc:MGI:96417] ENSG00000241489
3 Th ENSMUSG00000000214 tyrosine hydroxylase [Source:MGI Symbol;Acc:MGI:98735] TH ENSG00000180176
```
***