The tuberculosis R/Bioconductor package features tuberculosis gene expression data for machine learning. All human samples from GEO that did not come from cell lines, were not taken postmortem, and did not feature recombination have been included. The package has more than 10,000 samples from both microarray and sequencing studies that have been processed from raw data through a hyper-standardized, reproducible pipeline.
To fully understand the provenance of data in the
tuberculosis
R/Bioconductor package, please see the
tuberculosis.pipeline
GitHub repository; however, all users beyond the extremely curious can
ignore these details without consequence. Yet, a brief summary of data
processing is appropriate here. Microarray data were processed from raw
files (e.g. CEL
files) and background corrected using the
normal-exponential method and the saddle-point approximation to maximum
likelihood as implemented in the
limma R/Bioconductor
package; no normalization of expression values was done; where platforms
necessitated it, the RMA (robust multichip average) algorithm without
background correction or normalization was used to generate an
expression matrix. Sequencing data were processed from raw files
(i.e. fastq
files) using the
nf-core/rnaseq pipeline inside a
Singularity container; the GRCh38 genome build was used for alignment.
Gene names for both microarray and sequencing data are HGNC-approved
GRCh38 gene names from the genenames.org
REST API.
To install tuberculosis from Bioconductor, use BiocManager as follows.
BiocManager::install("tuberculosis")
To install tuberculosis from GitHub, use BiocManager as follows.
BiocManager::install("schifferl/tuberculosis", dependencies = TRUE, build_vignettes = TRUE)
Most users should simply install tuberculosis from Bioconductor.
To use the package without double colon syntax, it should be loaded as follows.
library(tuberculosis)
The package is lightweight, with few dependencies, and contains no data itself.
To find data, users will use the tuberculosis
function with a regular
expression pattern to list available resources. The resources are
organized by GEO series accession
numbers. If multiple platforms were used in a single study, the platform
accession number follows the series accession number and is separated by
a dash. The date before the series accession number denotes the date the
resource was created.
tuberculosis("GSE103147")
## 2021-09-15.GSE103147
The function will print the names of matching resources as a message and
return them invisibly as a character vector. To see all available
resources use "."
for the pattern
argument.
To get data, users will also use the tuberculosis
function, but with
an additional argument, dryrun = FALSE
. This will either download
resources from
ExperimentHub
or load them from the user’s local cache. If a resource has multiple
creation dates, the most recent is selected by default; add a date to
override this behavior.
tuberculosis("GSE103147", dryrun = FALSE)
## $`2021-09-15.GSE103147`
## class: SummarizedExperiment
## dim: 24353 1649
## metadata(0):
## assays(1): exprs
## rownames(24353): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(1649): SRR5980424 SRR5980425 ... SRR5982072 SRR5982073
## colData names(0):
The function returns a list
of SummarizedExperiment
objects, each
with a single assay, exprs
, where the rows are features (genes) and
the columns are observations (samples). If multiple resources are
requested, multiple resources will be returned, each as a list
element.
tuberculosis("GSE10799.", dryrun = FALSE)
## $`2021-09-15.GSE107991`
## class: SummarizedExperiment
## dim: 24353 54
## metadata(0):
## assays(1): exprs
## rownames(24353): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(54): SRR6369879 SRR6369880 ... SRR6369931 SRR6369932
## colData names(0):
##
## $`2021-09-15.GSE107992`
## class: SummarizedExperiment
## dim: 24353 47
## metadata(0):
## assays(1): exprs
## rownames(24353): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(47): SRR6369945 SRR6369946 ... SRR6369990 SRR6369991
## colData names(0):
##
## $`2021-09-15.GSE107993`
## class: SummarizedExperiment
## dim: 24353 138
## metadata(0):
## assays(1): exprs
## rownames(24353): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(138): SRR6370167 SRR6370168 ... SRR6370303 SRR6370304
## colData names(0):
##
## $`2021-09-15.GSE107994`
## class: SummarizedExperiment
## dim: 24353 175
## metadata(0):
## assays(1): exprs
## rownames(24353): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(175): SRR6369992 SRR6369993 ... SRR6370165 SRR6370166
## colData names(0):
##
## $`2021-09-15.GSE107995`
## class: SummarizedExperiment
## dim: 24353 414
## metadata(0):
## assays(1): exprs
## rownames(24353): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(414): SRR6369879 SRR6369880 ... SRR6370303 SRR6370304
## colData names(0):
The assay
of each SummarizedExperiment
object is named exprs
rather than counts
because it can come from either a microarray or a
sequencing platform. If colnames
begin with GSE
, data comes from a
microarray platform; if colnames
begin with SRR
, data comes from a
sequencing platform.
The SummarizedExperiment
objects do not have sample metadata as
colData
, and this limits their use to unsupervised analyses for the
time being. Sample metadata are currently undergoing manual curation,
with the same level of diligence that was applied in data processing,
and will be included in the package when they are ready.
To contribute to the tuberculosis R/Bioconductor package, first read the contributing guidelines and then open an issue. Also note that in contributing you agree to abide by the code of conduct.