Skip to content

multiplierz.mgf

Max Alexander edited this page May 10, 2017 · 3 revisions

Mascot Generic Format (MGF) files are used as input to several database search engines; these files contain MS2 data with the required metadata (the charge and M/Z value of the analyte) necessary to identify the peptide being examined. MGF is also a conveniently parsable and human-readable format, which makes it ideal for data pre-processing scripts. This module provides several utilities to produce and manipulate MGF files.


extract(datafile, outputfile = None, default_charge = 2, centroid = True, scan_type = None, deisotope_and_reduce_charge = True, min_mz = 140, precursor_tolerance = 0.005, isobaric_labels = None, label_tolerance = 0.01)

An important step in a typical MS proteomic workflow is extracting MS2 data from the proprietary instrument data format into a format (in this case, MGF) that can be directly handled by the search program. It is important, at this stage, to make the best use of all the data that is available from the instrument data format; since MGF files do not, for example, store information about MS1 scans, any analysis of precursor peaks (for instance, to correct the initially imputed charge or mass) must be done before the data is converted to MGF. It is also useful to centroid and otherwise reduce the data at this step, to avoid producing unduly large MGF files. extract() takes a number of arguments, where only datafile, specifying the path of the target MS instrument data file, is necessary; default arguments are set to produce good results in most known cases, though the ideal settings will depend on the instrument and methods being used.

  • default_charge controls the default charge assumed for an MS2 spectrum, when the precursor charge cannot be determined.
  • centroid controls whether to centroid the data; centroided MS data is much smaller, and centroiding is a necessary step before using most database search algorithms. Depending on the data format, either the vendor-provided software or Multiplierz' own centroid algorithm will be used. Also note that setting this to False does not guarantee that the spectra won't be centroided, since many data formats store only the centroided spectra to begin with.
  • scan_type allows the user to specify a specific dissociation or detector mode to be extracted from the data file. This can be useful in cases where the instrument has performed two types of analysis in a single run; for example, certain instruments have the capability to perform both ITMS (Ion Trap Mobility Spectrometry) and FTMS (Fourier Transform Mass Spectrometry) on acquired analytes in tandem, producing two spectra for each target which will have different mass precisions and other attributes. It is usually advantageous to separate each scan mode into separate streams in a processing pipeline; thus, you might specify 'FTMS' in the scan_mode argument to extract only FTMS scans, which can then be passed on to FTMS-specific processing steps.
  • deisotope_and_reduce_charge It is often found that, when possible, determining the charge state of fragment ions and replacing them with singly-charged peaks of the same mass produces more accurate results in database search. This toggles that process as a step in extraction; this is equivalent to the processing performed by multiplierz.spectral_process.deisotope_reduce_scan().
  • min_mz Fragment ions below this M/Z value are ignored.
  • isobaric_labels Allowed values are None, 4, 6, 8, and 10. If not None, the extraction process looks for peaks produced by the corresponding type of isobaric peptide quantitation reagent; either 4-plex iTRAQ, 6-plex TMT, 8-plex iTRAQ, or 10-plex TMT. The intensity values for each quantitation channel is written out into the spectrum description for later processing (see standard_title_parse() below.)
  • isobaric_label_tolerance Allowed inaccuracy of the M/Z reading of a isobaric label peak; this should typically correspond to the mass accuracy of your instrument at the mass range of the label peaks.

parse_mgf(mgffile, labelType = (lambda x: x))

This reads an MGF file and produces a dict of dicts; for each entry, the title of that entry is a key to the entry data. The entry data is a dict that has a key for each data line in the entry (so, e.g., 'title', 'pepmass' and 'charge' should be in every entry, as per the MGF format requirements) as well as a 'spectrum' element that gives the mass spectrum as a list of (M/Z, intensity) tuples. labelType is an optional argument that may take a function from the spectrum title to a more convenient key format. For instance, setting labelType = lambda x: standard_title_parse(x)['scan'] would produce a result where the key for each spectrum is the scan number of that spectrum.

parse_to_generator(mgffile)

Identical to parse_mgf(), except that the output is a generator that reads in spectra as they are requested; this is useful for handling large MGF files that would be slow or cumbersome to load into memory all at once, in cases where random access to spectra is not required.

write_mgf(entries, outputName, header = [])

Writes an MGF file. Each entry must be a dict with a 'spectrum' member and additional members for each data line in the entry (so, the MGF standard would require a 'title', 'pepmass' and 'charge' member for each entry.)


standard_title_write() and standard_title_parse()

An unfortunate limitation of MGF-based workflows is that the MGF format discards many forms of information which are liable to be useful in downstream post-processing steps. A common approach to retaining information relevant to individual scans is to encode this data in the scan title, as this is guaranteed in most cases to always correspond in the output of a processing step to an individual scan spectrum. These two functions convert between dicts of spectrum data and a terse but human-readable format that can be used as a spectrum title. Also note that MGF files created by the extract() function above use spectrum titles in this format, so standard_title_parse() can be used by custom post-processing scripts to readily access information from the header.

For example, a spectrum title can be created by
title = standard_title_write('spectrometer_experiment_A1.RAW', mz = 5555.5, scan = 1000, charge = 3)
Which produces the string
spectrometer_experiment_A1.RAW|MultiplierzMGF|SCAN:1000|MZ:5555.5|RT:23.3|charge:3
And this can be converted back into an easily-accessed dict object;
standard_title_parse(title)
Yields
{'rt': '23.3', 'charge': '3', 'mz': '5555.5', 'file': 'spectrometer_experiment_A1.RAW', 'scan': '1000'}