-
Notifications
You must be signed in to change notification settings - Fork 34
MultiAssayExperiment API
The MultiAssayExperiment
class can be used to manage results of diverse
assays on a collection of samples. Currently the class can handle assays that
are organized as instances of RangedSummarizedExperiment
, ExpressionSet
(legacy), matrix
, RaggedExperiment
(inherits from GRangesList
), and
RangedVcfStack
(defined in the GenomicFiles
package). Create new
MultiAssayExperiment
instances with the eponymous constructor, minimally with
the argument ExperimentList
, potentially also with the arguments colData
and sampleMap
.
Other data classes can be used in the MultiAssayExperiment
, as long as they
provide three methods: dimnames()
, [i, j]
, and dim()
. See the
ExperimentList section for details on requirements for
incorporating new data classes.
Note: For a brief visual summary of classes and methods involved in the package, please see the MultiAssayShiny package.
Note: For an essential overview of the methods from a user perspective, please see the MultiAssayExperiment cheat sheet
The most important class exported by this package is the MultiAssayExperiment
for coordinated representation of multiple experiments on partially overlapping
samples, with associated metadata at the level of entire study and the level of
"biological unit". The biological unit may be a patient, plant, yeast strain,
etc. This package is designed around the following hierarchy of information:
study (highest level). The study can encompass several different types of
experiments performed on one set of biological units, for example cancer
patients. A MultiAssayExperiment
represents a whole study, containing:
- metadata about the study as a whole
- metadata about each biological unit: for example, age, grade, stage for cancer patients
- results from a set of experiments performed on the biological units
- a map for matching data from the experiments back to the corresponding biological units.
experiment. A set of assays of a single type performed on some or all of the biological units. It is permissible that an experiment may be performed only on a subset of the biological units, and may be performed in duplicate on some of the biological units. For example, an experiment could be somatic mutation calls for some or all of the biological units.
Data from multiple experiments are stored in a list object called the
ExperimentList
, which provides flexibility for partially overlapping samples
(column names) and features (row names), while keeping samples correctly matched
to study-level metadata and to other experiments on the same samples.
Experiments may be ID-based, where measurements are indexed identifiers of
genes, microRNA, proteins, microbes, etc. Alternatively, experiments may be
range-based, where measurements correspond to genomic ranges that can be
represented as GRanges
objects, such as gene expression or copy number. Note
that for ID-based experiments, there is no requirement that the same IDs be
present for different experiments. For range-based experiments, there is also
no requirement that the same ranges be present for different experiments;
furthermore, it is possible for different samples within an experiment to be
represented by different ranges. Note however that even ranged-based features
must be named, so that genomic features can be referred to by character IDs.
The following data classes have so far been tested to work as elements of
ExperimentList
:
-
matrix
: the most basic class for ID-based datasets, could be used for example for gene expression summarized per-gene, microRNA, metabolomics, or microbiome data. -
SummarizedExperiment
: A richer representation for ID-based datasets, could be used for the same types of data asmatrix
, but storing additional assay-level metadata. -
RangedSummarizedExperiment
: For rectangular range-based datasets, meaning that one set of genomic ranges are assayed for multiple samples. Could be used for gene expression, methylation, or other data types referring to genomic positions. -
ExpressionSet
: Another rich representation for ID-based datasets, supported only for legacy reasons as theSummarizedExperiment
class already provides numerous improvements over theExpressionSet
structure. -
RaggedExperiment
: For non-rectangular (ragged) ranged-based datasets, meaning that a potentially different set of genomic ranges are assayed for each sample. A typical example would be segmented copy number, where segmentation of copy number alterations occurs and different genomic locations in each sample. -
RangedVcfStack
: For VCF archives broken up by chromosome (seeVcfStack
class defined in theGenomicFiles
package) -
DelayedMatrix
: An on-disk representation of matrix-like objects for large datasets. It reduces memory usage and optimizes performance with delayed operations. This class is part of theDelayedArray
package.
samples (lowest level). An individual set of measurements performed on a
single biological unit. These measurements must be indexed by character IDs,
however datasets may be ID-based (such as matrix
or SummarizedExperiment
) or
range-based (such as RangedSummarizedExperiment
). In the experimental
datasets, columns refer to samples, and rows refer to genomic features that are
represented by IDs or ranges.
The MultiAssayExperiment
class is the main representation of multiple
experiment data. It contains all information required to subset and match sample
identifiers with clinical records.
-
ExperimentList - slot of class
ExperimentList
containing data for each experiment/assay- contains "SimpleList" class from S4Vectors
- access using "experiments"
-
colData - slot of class
DataFrame
describing the clinical data available across all experiments -
sampleMap - slot of class
DataFrame
of translatable identifiers of samples and participants -
metadata - slot of any class providing additional information about the
MultiAssayExperiment
object -
drops - slot of class
list
to keep a log of all residuals from subset operations
-
ExperimentList
-
ExperimentList
length
should be the same as the unique length of thesampleMap
"assay" column. - Element names of the
ExperimentList
should be found in thesampleMap
"assay" column. - For each ExperimentList element (say for an element named "assay X"), the colnames of that element must be identical to the sorted character string found in the "colname" column of the sampleMap within the rows where the "assay" equals the name of that ExperimentList element (in this example, "assay X"). The order does not need to be the same.
-
-
colData
- Ensure that this slot is of class
DataFrame
- Ensure that this slot is of class
-
sampleMap - validity checks include checks for consistency between the
sampleMap
and thecolData
primary (or phenotype) data slot- all names in the
sampleMap
"primary" column must be found in the rownames of thecolData
DataFrame. - Within rows of
sampleMap
corresponding to a single value in the "assay" column, there can be no duplicated values in the "colname" column.
- all names in the
Note. These validity checks only apply when at least an ExperimentList
slot
is provided at MultiAssayExperiment
object creation.
The updateObject
method attempts to repair previously serialized instances of
the MultiAssayExperiment
so that it conforms with the updated API
. It is
advised to run updateObject
on old
instances of the MultiAssayExperiment
and reserialize the object. This
should be a one-time operation.
Recent changes to the API
include changing the name of the workhorse container
class from Elist
to ExperimentList
with an accessor function named
experiments
.
Other changes include, renaming and reordering of the sampleMap
columns from
primary
, assay
, and assayname
to assay
(previously "assayname"),
primary
, and colname
(previously "assay"), respectively.
The ExperimentList
slot and class is the driver for the
MultiAssayExperiment
class as it contains necessary data from experiments and
sample identifiers. The purpose of the ExperimentList
is to store results
from a set of experiments, as a SimpleList
. Each element in the
ExperimentList represents an experiment performed. All ExperimentList elements
should be named.
-
ExperimentList - inherits from
SimpleList
with no additions. Contains separate validity checks and a show method.
-
ExperimentList elements
- For data classes stored in each
ExperimentList
element, ensure that method functions[
(bracket),dimnames
, anddim
are possible. - For each
ExperimentList
element, ensure that dimensions of non-zero length in eachExperimentList
element have non-nullcolnames
. - Ensure
ExperimentList
elements are appropriate for the API warn whenDataFrame
ordata.frame
present
- For data classes stored in each
Rationale
-
ExperimentList element requirements
- The requirement of methods
[
(bracket),dimnames
, anddim
allow for predictable subsetting operations and metadata acquisition. - Standard subsetting by columns match character vectors to the colnames, so any ExperimentList element with more than zero columns must have non-NULL colnames.
- Rectangular objects that allow multiple data types and nested lists within their columns are discouraged and may interfere with data manipulation operations; therefore, matrix-based assays are preferred.
- The requirement of methods
Any data class that provides the following methods can be used as an element of
ExperimentList
. RangedSummarizedExperiment
provides the template behavior
for ExperimentList
elements, as follows. These are "template" behavior, but
not explicit requirements:
-
dimnames()
, by returning a list of character vectors for sample and feature identifiers (genes, proteins, etc.) -
[i, j]
, by returning the restriction of the instance to rows i and columns j -
dim()
, by returning integer vector of length two for the number of rows and columns
The RaggedExperiment
class is an extension of the GRangesList
Bioconductor
class. It is intended to handle segmented copy number data. The package aims to
represent this type of data as a table where columns are samples and rows are
ranges. Please see the RaggedExperiment
package for more information.
The standard assay
functionality allows the user to obtain a numeric matrix of
data.
The current hasAssay
function includes a "soft" check that ensures all classes
in an existing MultiAssayExperiment
class object have listed assay
methods
via the hasMethods
function. For convenience, the argument passed to the
hasAssay
function can either be a MultiAssayExperiment
or a list
class
object.
This helper function checks to see whether any elements in the ExperimentList
support the rowRanges
method. This is important for future expansion of
methods where operations involve genomic ranges. The requirement for this check
is that all qualifying objects should return a GRanges
class from a
rowRanges
method.
A couple of methods for subsetting were created for the MultiAssayExperiment
with a user-friendly interface in mind. Both the bracket notation [
and
subset
methods are available. Each allows for subsetting via numeric
,
character
, and logical
vectors. Additional support for list
and List
objects is available.
Users are able to subset by:
rows
columns
assays
respectively, within the bracket notation and seperated by commas ,
.
When subsetting a MultiAssayExperiment
via a numeric
vector, all rows
and
columns
of each element in the ExperimentList
will be subset by that vector.
When subsetting by a character
vector, the vector will be
matched against either the rownames
, colnames
or assays
of the
MultiAssayExperiment
. Logical vectors can be passed to all dimensions of the
MultiAssayExperiment
(i.e., rows
, columns
, assays
) and recycled if
necessary, following standard R language practice.
Subsetting assay
s follows the list-like methods closely.
Subsetting with list-like and List-like objects is allowed as long as element
names in said lists match the experiment names in the
MultiAssayExperiment
/ExperimentList
. Subsetting with list
and List
is only available for rows
and columns
.
Examples of List
classes include but are not limited to:
CharacterList
LogicalList
IntegerList
-
SimpleList
(inS4Vectors
)
These classes can be found in the IRanges
package.
The subsetting of experiments with genomic ranges is possible when a GRanges
object is introduced to the subsetting operation. Example classes that contain
genomic ranges and support the current API are: 1) RangedSummarizedExperiment
2) RaggedExperiment
. Additional arguments may be passed on to either the
subsetByRow
function or to the bracket [
funnction notation.
The drop
argument indicates whether to keep assays with zero dimensions (after
subsetting) in the ExperimentList
class object.