Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create own Modifications with specific location? #2

Open
sgibb opened this issue Jul 2, 2017 · 1 comment
Open

How to create own Modifications with specific location? #2

sgibb opened this issue Jul 2, 2017 · 1 comment

Comments

@sgibb
Copy link
Member

sgibb commented Jul 2, 2017

This issue is a follow-up to lgatto/MSnbase#167.

@adder asked for user-defined locations of the Modification:

You also have to be able to specify where the variable modificiation is positioned. Eg. if you have a sequence with 2 serines. You can have a Phospo group on the first but not on the second serine.

unimod.org doesn't support specific locations. Instead they have a position argument that could be: "Anywhere", "Any N-term", "Any C-term", "N-term", "C-term", "Protein N-term", "Protein C-term".

library("unimod")
umodName("Acetyl") # or umodId(1)
# - General:
#   Modification version   :          0.0.1
#   Accession number/id    :              1
#   PSI-MS/Interim Name    :         Acetyl
#   Description            :    Acetylation
#   Composition            : H(2) C(2) O(1)
#   Delta Average Mass     :        42.0367
#   Delta Monoisotopic Mass:      42.010565
#   Approved               :           TRUE 
# - Specificity:
#     site       position      classification hidden group
# 1      K       Anywhere            Multiple  FALSE     1
# 2 N-term     Any N-term            Multiple  FALSE     2
# 3      C       Anywhere  Post-translational   TRUE     3
# 4      S       Anywhere  Post-translational   TRUE     4
# 5 N-term Protein N-term  Post-translational  FALSE     5
# 6      T       Anywhere  Post-translational   TRUE     6
# 7      Y       Anywhere Chemical derivative   TRUE     7
# 8      H       Anywhere Chemical derivative   TRUE     8
# - References: use 'references(object)'

I am currently not sure how to fulfill @adder's feature request best. The specificity slot is a data.frame that has a column position of type character. We could add another column, e.g. index as numeric or the user has to supply the position number as character (or we cast it).

I am planning to add a seq2mass(sequence, modificitions) or calculateMass(sequence, modifications) or just mass(sequence, modifications) function that would do the following:

  1. Split the amino acid sequence into single letter character.
  2. Look up the mass of these amino acids in the aminoacid data.frame (see How to store chemical elements and aminoacids? #1).
  3. Check the specificity slot (site and position column) if the mass has to be modified.
  4. Sum the mass up

This function should replace the mass calculation in MSnbase::calculateFragments.

Any suggestions regarding the user-defined modification positions or something else?

@sgibb sgibb added the question label Jul 2, 2017
@sgibb sgibb added this to the First Bioconductor Release milestone Jul 2, 2017
@sgibb
Copy link
Member Author

sgibb commented Aug 6, 2017

There are attempts to create a nomenclature for proteoforms https://github.com/topdownproteomics/proteoform-nomenclature-standard (tl;dr the modification is written in square brackets behind the amino acid).

e.g. Carbamidomethyl at the first C:

AC[Unimod:4]E

an unspecific (not in a database) mass shift at the second amino acid:

AC[mass:+16]E

modification on n-term:

[mass:+16]-ACE

In our Modification class or for calculateFragments we have to handle at least these five conditions:

  1. global modifications, like: Carbamidomethyl, which replaces all C with C + 57.02146
  2. local modifications, like: Phosphorylation on the second Serin in the sequence
  3. modification on the n/c-term (protein)
  4. modification on the n/c-term (any)
  5. neutral loss

2 and 3 are handled by the current version of the standard.

For 1. and 5. there are some issues (topdownproteomics/ProteoformNomenclatureStandard#21, topdownproteomics/ProteoformNomenclatureStandard#18, but I don't expect that any of them will be integrated because it is a little bit out of scope).

We could extend the mentioned standard, e.g.

  1. for global modifications: use a prefix, e.g. C[Unimod:4]>ACE
  2. use same notation as for 3. (would be only interesting for result presentation of calculateFragments)
  3. use * for ammonia and _ for water loss (would be only interesting for result presentation of calculateFragments), e.g. ACE*

A user could call calculateFragments in the following way:

calculateFragments("[Unimod]+C[4]>CS_E_QU[mass:+16]E_NCE_]")
# the unimod package would collect the mass for C

Which would be equivalent to the current

calculateFragments("CSEQUENCE", modifications=c(C=57.02146, U=16), 
                   neutraLoss=list(water=c("Cterm", "D", "E", "S", "T"),
                                   ammonia=c("K", "N", "Q", "R")))

For convenience we could keep the neutralLoss argument (otherwise the sequence would be really destroyed by many _, * for possible neutral loss positions).

While I think that would a great interface for the user if he only wants to calculate fragments for just a single protein sequence but it would be nearly impossible to do this for batch processing of mzML + mzID files.

In summary it seems not suitable for our intention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant