How to create own Modifications with specific location? #2

sgibb · 2017-07-02T18:45:39Z

This issue is a follow-up to lgatto/MSnbase#167.

@adder asked for user-defined locations of the Modification:

You also have to be able to specify where the variable modificiation is positioned. Eg. if you have a sequence with 2 serines. You can have a Phospo group on the first but not on the second serine.

unimod.org doesn't support specific locations. Instead they have a position argument that could be: "Anywhere", "Any N-term", "Any C-term", "N-term", "C-term", "Protein N-term", "Protein C-term".

library("unimod")
umodName("Acetyl") # or umodId(1)
# - General:
#   Modification version   :          0.0.1
#   Accession number/id    :              1
#   PSI-MS/Interim Name    :         Acetyl
#   Description            :    Acetylation
#   Composition            : H(2) C(2) O(1)
#   Delta Average Mass     :        42.0367
#   Delta Monoisotopic Mass:      42.010565
#   Approved               :           TRUE 
# - Specificity:
#     site       position      classification hidden group
# 1      K       Anywhere            Multiple  FALSE     1
# 2 N-term     Any N-term            Multiple  FALSE     2
# 3      C       Anywhere  Post-translational   TRUE     3
# 4      S       Anywhere  Post-translational   TRUE     4
# 5 N-term Protein N-term  Post-translational  FALSE     5
# 6      T       Anywhere  Post-translational   TRUE     6
# 7      Y       Anywhere Chemical derivative   TRUE     7
# 8      H       Anywhere Chemical derivative   TRUE     8
# - References: use 'references(object)'

I am currently not sure how to fulfill @adder's feature request best. The specificity slot is a data.frame that has a column position of type character. We could add another column, e.g. index as numeric or the user has to supply the position number as character (or we cast it).

I am planning to add a seq2mass(sequence, modificitions) or calculateMass(sequence, modifications) or just mass(sequence, modifications) function that would do the following:

Split the amino acid sequence into single letter character.
Look up the mass of these amino acids in the aminoacid data.frame (see How to store chemical elements and aminoacids? #1).
Check the specificity slot (site and position column) if the mass has to be modified.
Sum the mass up

This function should replace the mass calculation in MSnbase::calculateFragments.

Any suggestions regarding the user-defined modification positions or something else?

The text was updated successfully, but these errors were encountered:

sgibb · 2017-08-06T14:21:18Z

There are attempts to create a nomenclature for proteoforms https://github.com/topdownproteomics/proteoform-nomenclature-standard (tl;dr the modification is written in square brackets behind the amino acid).

e.g. Carbamidomethyl at the first C:

AC[Unimod:4]E

an unspecific (not in a database) mass shift at the second amino acid:

AC[mass:+16]E

modification on n-term:

[mass:+16]-ACE

In our Modification class or for calculateFragments we have to handle at least these five conditions:

global modifications, like: Carbamidomethyl, which replaces all C with C + 57.02146
local modifications, like: Phosphorylation on the second Serin in the sequence
modification on the n/c-term (protein)
modification on the n/c-term (any)
neutral loss

2 and 3 are handled by the current version of the standard.

For 1. and 5. there are some issues (topdownproteomics/ProteoformNomenclatureStandard#21, topdownproteomics/ProteoformNomenclatureStandard#18, but I don't expect that any of them will be integrated because it is a little bit out of scope).

We could extend the mentioned standard, e.g.

for global modifications: use a prefix, e.g. C[Unimod:4]>ACE
use same notation as for 3. (would be only interesting for result presentation of calculateFragments)
use * for ammonia and _ for water loss (would be only interesting for result presentation of calculateFragments), e.g. ACE*

A user could call calculateFragments in the following way:

calculateFragments("[Unimod]+C[4]>CS_E_QU[mass:+16]E_NCE_]")
# the unimod package would collect the mass for C

Which would be equivalent to the current

calculateFragments("CSEQUENCE", modifications=c(C=57.02146, U=16), 
                   neutraLoss=list(water=c("Cterm", "D", "E", "S", "T"),
                                   ammonia=c("K", "N", "Q", "R")))

For convenience we could keep the neutralLoss argument (otherwise the sequence would be really destroyed by many _, * for possible neutral loss positions).

While I think that would a great interface for the user if he only wants to calculate fragments for just a single protein sequence but it would be nearly impossible to do this for batch processing of mzML + mzID files.

In summary it seems not suitable for our intention.

sgibb added the question label Jul 2, 2017

sgibb added this to the First Bioconductor Release milestone Jul 2, 2017

sgibb mentioned this issue Jul 11, 2017

Modification object #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create own Modifications with specific location? #2

How to create own Modifications with specific location? #2

sgibb commented Jul 2, 2017

sgibb commented Aug 6, 2017 •

edited by lgatto

Loading

How to create own Modifications with specific location? #2

How to create own Modifications with specific location? #2

Comments

sgibb commented Jul 2, 2017

sgibb commented Aug 6, 2017 • edited by lgatto Loading

sgibb commented Aug 6, 2017 •

edited by lgatto

Loading