Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug reading AGCTarget in UVPD dataset #73

Closed
pavel-shliaha opened this issue Feb 21, 2018 · 12 comments
Closed

bug reading AGCTarget in UVPD dataset #73

pavel-shliaha opened this issue Feb 21, 2018 · 12 comments
Assignees
Labels
Milestone

Comments

@pavel-shliaha
Copy link
Collaborator

pavel-shliaha commented Feb 21, 2018

I have a folder that contains just the files with the recently added UVPD data. I see a bizzare bug: the AGC targets are not read very well in UVPD dataset. However the information in csv files appears to be correct

setwd ("M:\\r_scripts\\2017_01_18_top_down_lumos_first_large_experiment\\2017_07_03_CA\\UVPD_only")

CSV <- lapply (list.files(pattern = "csv"),  function(x) read.csv(x, stringsAsFactors = FALSE) )
CSV <- do.call (rbind, CSV)
table (CSV$AGCTarget)

#  analysis results
#  100000  500000 1000000 
 # 72      78      72 

sampleColumns <- c("Mz", "AgcTarget",  "UvpdActivation")
UVPDFull <-  readTopDownFiles(path = getwd(),
                              type = c("a", "b", "c", "x", "y", "z"),
                              adducts = data.frame(mass=c(H, H, -H,  -H, H), 
                                                   name=c("apH", "xpH", "ymH", "zmH", "zpH"), 
                                                   to=c("a", "x", "y", "z", "z")),
                              neutralLoss = NULL,
                              modifications = c("Carbamidomethyl", "Acetyl", "Met-loss"),
                              sampleColumns = sampleColumns,
                              tolerance = 5e-6) 


table (UVPDFull@colData$AgcTarget)


# analysis results
# 100000 500000 
# 129     82
@sgibb
Copy link
Owner

sgibb commented Feb 21, 2018

Nice catch! The problem is not in AgcTarget but in the combination of the information from experiment.csv and scanheadsman output. For the scanheadsman table we get duplicated IDs (it is a bug that topdownr doesn't throw an error). That's why only the first (the lower AgcTarget) rows are combined. The reason is again the FilterString:

Our assumption:

topdownr/R/utils.R

Lines 93 to 102 in de5b876

#' The ScanHeadsman output ocntains some FilterStrings that have wrong IDs (the
#' id from the previous or next run), e.g.:
#'
#' FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
#' FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
#' FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
#' FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
#'
#' pavel-shliaha finds out, that the skip is never larger than 1, never more
#' than 1 in a row and both is possible 1, 2, 2, 4 and 1, 3, 3, 4.

But actually the FilterStrings are:

FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]

So there is no gap to fill. How should we handle this? Just cumsum as always? (But be aware that we have more than twice the conditions in the scanheadsman output than in the experiment.csv.)

@pavel-shliaha
Copy link
Collaborator Author

pavel-shliaha commented Feb 21, 2018

I dont understand the nature of the problem. Also why has this problem appeared now? If there is no gaps then what is the problem?

@sgibb sgibb self-assigned this Feb 22, 2018
@sgibb sgibb added the bug label Feb 22, 2018
@sgibb sgibb added this to the bioc 3.7 milestone Feb 22, 2018
@sgibb
Copy link
Owner

sgibb commented Feb 22, 2018

Sry, please forget everything I wrote yesterday. I always suspect FilterStrings to be the root of all evil. But this time they aren't! Somehow I messed up.

Now I think there is actually no problem in the code but in the files.
Because there are no matches in the scan numbers for AgcTarget == 1e6 (when matching ScanHeadsman output and mzML information).

library("topdownr")
debug(readTopDownFiles)

H <- 1.0078250321
sampleColumns <- c("Mz", "AgcTarget",  "UvpdActivation")
UVPDFull <-  readTopDownFiles(path = getwd(),
                              type = c("a", "b", "c", "x", "y", "z"),
                              adducts = data.frame(mass=c(H, H, -H,  -H, H),
                                                   name=c("apH", "xpH", "ymH", "zmH", "zpH"),
                                                   to=c("a", "x", "y", "z", "z")),
                              neutralLoss = NULL,
                              modifications = c("Carbamidomethyl", "Acetyl", "Met-loss"),
                              sampleColumns = sampleColumns,
                              tolerance = 5e-6)

## hit 12 times [ENTER] until you see
## "debug: if (dropNonInformativeColumns) { ..."
## use [Q] (capitalized Q = Shift Q) to exit if you finished

# scanHeadsman is the combined table of .experiments.csv and .txt files
# (here everything is fine)
# mzmlHeader is the header table of .mzML files
# header is the final combined header (what will become @colData)
# (it is the result of `merge(mzmlHeader, scanHeadsman, by=c("File", "Scan"))`)
table(scanHeadsman$AgcTarget)
#  100000  500000 1000000
#     129     106      96

table(header$AgcTarget)
# 100000 500000
#    129     82

# the combination of scanHeadsman and mzmlHeaders is based on
# filename (column: File) and spectrum scan index (Scan)
unique(scanHeadsman$Scan)
#  [1] 21 22 23 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 24 25 53 54 55 56 57
# [26] 58 59 60 61 62 63 64 65 66 26 27 67 68 69 70 71 72 73 74 28 29 30 31 32 33
# [51] 34 35 36 37 75 76 77 78 79 80

unique(mzmlHeader$Scan)
#  [1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
# [26] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

# AgcTarget == 1e6 should have scan indices above 57, but as seen above there
# just 4 potential scan index matches 57:60 (but the File don't match).
# Most of the AgcTarget == 1e6 have scan indices greater than 60 but these are
# not present in the mzML
unique(scanHeadsman$Scan[scanHeadsman$AgcTarget == 1e6])
# [1] 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 59 75 76 77 78 79 80 57 58

# [Q]

@sgibb sgibb added help wanted and removed bug labels Feb 22, 2018
@pavel-shliaha
Copy link
Collaborator Author

ok, I have a strong suspicion as to what happened. Thank for invetigating

@pavel-shliaha
Copy link
Collaborator Author

pavel-shliaha commented Feb 22, 2018

Sorry, I think its a bug afterall.

I have a suggestion as to why this might happen. The reason for this is that each acquisition contains a series of MS1 scans, which help to monitor spray stability. These scans are not present in mzML (I only deconvolute MS2 spectra), but they are present in the scanHeadsMan ".txt" I have placed these scans in the beginning of the files. In this particular case you can see "a saw" since MS1 are recorded alternating between orbitrap (high signal) and ion trap (low signal)

capture

so that "CA_UVPD_745_.raw" for example contains it contains 80 spectra, of which first 20 is MS1. This means that mzML will contain 60 spectra, while ".txt" will contain 80.

the first MS2 spectrum is line 21 in the ".txt"
.readScanHeadsTable() seems to simply remove the MS1 lines

the first MS2 spectrum is number 1 in the mzML"Scan" column in the header table created by .readMzMl() has the first value of 1

in order to match them you either need to extract the actual scan numbers from the mzML file, or in scansHeadsman ".txt" to remove lines corresponding to MS1 and then renumber the scans so that the first line gets the first number.

@sgibb
Copy link
Owner

sgibb commented Feb 22, 2018

I still don't think it is a bug in topdownr. Of course .readMzMl takes the scan index (called acquisitionNum) from the mzMl file:

topdownr/R/import.R

Lines 184 to 185 in de5b876

hd <- header(fh)
i <- which(hd$msLevel == 2L & hd$acquisitionNum %in% scans)

colnames(hd)[grepl("acquisitionNum", colnames(hd), fixed=TRUE)] <- "Scan"

The problem is the file itself (maybe the exporter is broken), or we call it a bug in proteowizard:

library("mzR")

f <- openMSfile("CA_UVPD_745_.mzML")
head(header(f)[, c("acquisitionNum", "spectrumId")])
#   acquisitionNum       spectrumId
# 1              1 scan=21 file=191
# 2              2 scan=22 file=191
# 3              3 scan=23 file=191
# 4              4 scan=24 file=191
# 5              5 scan=25 file=191
# 6              6 scan=26 file=191
close(f)

library("topdownrdata")
f <- openMSfile(
    file.path(topDownDataPath("myo"),
              "mzml",
              "myo_1211_ETDReagentTarget_1e6_1.mzML.gz"
    )
)
head(header(f)[, c("acquisitionNum", "spectrumId")])
#   acquisitionNum                                  spectrumId
# 1             11 controllerType=0 controllerNumber=1 scan=11
# 2             12 controllerType=0 controllerNumber=1 scan=12
# 3             13 controllerType=0 controllerNumber=1 scan=13
# 4             14 controllerType=0 controllerNumber=1 scan=14
# 5             15 controllerType=0 controllerNumber=1 scan=15
# 6             16 controllerType=0 controllerNumber=1 scan=16
close(f)

proteowizard looks for controllerType=0 controllerNumber=1 if it is a thermo file. If found it looks for scan= if not it just numbers the spectra from 1:n:

https://github.com/sneumann/mzR/blob/ac131c31947e1d1b72aa9e31158d63b3d1db4a79/src/pwiz/data/msdata/MSData.cpp#L552-L582

Do you know why the spectrumId changed? Could you specify some parameters during the export process? If not I guess we need to fill a bug against proteowizard.

@pavel-shliaha
Copy link
Collaborator Author

pavel-shliaha commented Feb 22, 2018

The software that creates mzML is not proteoWizard, but ProteomeDiscoverer (from Thermo), since only PD contains X!tract deisotoping and charge state deconvolution node.

I agree the bug is not with topdownr, but with PD. So thanks a lot to Thermo... Could you please implement the corrections I wrote to you about, I mean

  1. implement an internal check that the nrow() of scansHeadsMan .txt is the same as the number of spectra in the MSnset
  2. if it works then simply change the scan numbers so that they correspond. Change in either scanHeadsman .txt or in mzML header table.

@pavel-shliaha
Copy link
Collaborator Author

I am, however, puzzled the data makes sense, which is not possible if matching between fragmentation conditions and spectra was off. So there must be some sort of matching control somewhere in the code

@sgibb sgibb removed the help wanted label Feb 22, 2018
@sgibb
Copy link
Owner

sgibb commented Feb 22, 2018

The software that creates mzML is not proteoWizard, but ProteomeDiscoverer (from Thermo), since only PD contains X!tract deisotoping and charge state deconvolution node.

I know but the software that imports the mzML is proteowizard (mzR is an R interface to a shrunken proteowizard (nearly everything but mz(X)ML import removed)).

So thanks a lot to Thermo...

It is always the same ...

Could you please implement the corrections I wrote to you about, I mean:

I will slightly change it. Hopefully my approach is a little bit more safe for future modifications:

  1. As you suggested I will implement a check that nrow scanheadsman and the number of spectra in the mzml are identical. If not we will show an error.
  2. Instead of simply changing the numbers manually I will parse the spectrumId information from the mzML that contains the scan= information. This will be somehow thermospecific but because of scanheadsman we are already thermospecific so it is not that bad.
  3. I will throw a warning if acquisitionNum and the information from spectrumId don't match (as in this particular case).

@pavel-shliaha
Copy link
Collaborator Author

perfect. I did not know that R uses proteowizard. I thought it is being imported as an XML file...

@sgibb
Copy link
Owner

sgibb commented Feb 22, 2018

Ok, I implement my 2. and 3. points but can't implement 1.

Most important the fix first:

library("topdownr")

H <- 1.0078250321
sampleColumns <- c("Mz", "AgcTarget",  "UvpdActivation")
UVPDFull <-  readTopDownFiles(path = getwd(),
                              type = c("a", "b", "c", "x", "y", "z"),
                              adducts = data.frame(mass=c(H, H, -H,  -H, H),
                                                   name=c("apH", "xpH", "ymH", "zmH", "zpH"),
                                                   to=c("a", "x", "y", "z", "z")),
                              neutralLoss = NULL,
                              modifications = c("Carbamidomethyl", "Acetyl", "Met-loss"),
                              sampleColumns = sampleColumns,
                              tolerance = 5e-6)
table(UVPDFull$AgcTarget)
#
#  100000  500000 1000000
#     129     106      96

Unfortunately I can't compare the nrow of scanheadsman and mzML because at least for the myoglobin dataset they never match:

library("mzR")
library("topdownrdata")

p <- topDownDataPath("myo")
mzml <- file.path(p, "mzml")
header <- file.path(p, "header")

m <- lapply(list.files(mzml, full.names=TRUE), function(m) {
    f <- openMSfile(m)
    h <- header(f)
    close(f)
    h[h$msLevel == 2,]
})

h <- lapply(list.files(header, full.names=TRUE), function(f) {
    h <- read.csv(f)
    h[h$MSOrder == 2,]
})

nr <- cbind(mzML=sapply(m, nrow), shm=sapply(h, nrow))
cbind(nr, delta=nr[,2]-nr[,1])
#       mzML shm delta
#  [1,]  351 357     6
#  [2,]  345 361    16
#  [3,]  279 286     7
#  [4,]  277 288    11
#  [5,]  276 286    10
#  [6,]  277 288    11
#  [7,]  361 365     4
#  [8,]  370 373     3
#  [9,]  293 296     3
# [10,]  293 295     2
# [11,]  294 296     2
# [12,]  294 298     4
# [13,]  366 369     3
# [14,]  365 373     8
# [15,]  288 294     6
# [16,]  289 294     5
# [17,]  287 294     7
# [18,]  292 296     4
# [19,]  141 150     9
# [20,]  144 151     7

Regarding 1. and your wish to have a "matching control": We have to take something for guaranteed (I thought that acquisitionNum would be this kind of anchor). Everything else is difficult. As I have mentioned in #63, TIC is different (even without our recalculation), BasePeakMz, BasePeakIntensity, lowMz, highMz, etc. they are all different, only retention time and ion injection time are similar (but still different, we could allow a small deviation and use this two columns for a rough match control, everything else seems useless; even ion injection time would be hard because they are very similar between runs, so the best bet would be the retention time). I am wondering why the output of both applications (Scanheadsman vs PD) are so different:

library("mzR")

f <- openMSfile("CA_UVPD_745_.mzML")
m <- header(f)
m <- m[m$msLevel == 2,]
close(f)
head(m)
#   seqNum acquisitionNum msLevel polarity peaksCount totIonCurrent retentionTime
# 1      1              1       2        1         26       9222232      35.15775
# 2      2              2       2        1        104       2555957      51.62017
# 3      3              3       2        1        131       3500345      68.03485
# 4      4              4       2        1        184       6561143      84.43040
# 5      5              5       2        1        174       4993073     101.00283
# 6      6              6       2        1        212       5100819     117.56625
#   basePeakMZ basePeakIntensity collisionEnergy ionisationEnergy    lowMZ
# 1   745.0015         3840828.5               0                0 723.8685
# 2  9657.2824          170899.0               0                0 707.1067
# 3  9665.2971          512793.0               0                0 670.3330
# 4  7436.6716          777629.0               0                0 709.4607
# 5  9668.9597          772262.4               0                0 421.1946
# 6  4995.4678          169933.3               0                0 421.1943
#      highMZ precursorScanNum precursorMZ precursorCharge precursorIntensity
# 1  9655.921                1    745.1802               1           11009877
# 2 13698.287                2    745.1802               1           11009877
# 3 13697.285                3    745.1802               1           11009877
# 4 13697.293                4    745.1802               1           11009877
# 5 14518.856                5    745.1802               1           11009877
# 6 14398.815                6    745.1802               1           11009877
#   mergedScan mergedResultScanNum mergedResultStartScanNum
# 1          0                   0                        0
# 2          0                   0                        0
# 3          0                   0                        0
# 4          0                   0                        0
# 5          0                   0                        0
# 6          0                   0                        0
#   mergedResultEndScanNum injectionTime filterString       spectrumId
# 1                      0       0.00227         <NA> scan=21 file=191
# 2                      0       0.00450         <NA> scan=22 file=191
# 3                      0       0.00440         <NA> scan=23 file=191
# 4                      0       0.00423         <NA> scan=24 file=191
# 5                      0       0.00420         <NA> scan=25 file=191
# 6                      0       0.00409         <NA> scan=26 file=191

h <- read.csv("CA_UVPD_1162_.txt")
h <- h[h$MSOrder == 2,]
head(h)
#    Scan MSOrder                                                  FilterString
# 21   21       2  FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
# 22   22       2  FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
# 23   23       2  FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
# 24   24       2 FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
# 25   25       2 FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
# 26   26       2 FTMS + p NSI Full ms2 [email protected] [100.0000-2000.0000]
#    DataType Analyzer Polarity Ionization Source.CID CompensationVoltage
# 21  Profile     FTMS Positive        NSI        Off                 Off
# 22  Profile     FTMS Positive        NSI        Off                 Off
# 23  Profile     FTMS Positive        NSI        Off                 Off
# 24  Profile     FTMS Positive        NSI        Off                 Off
# 25  Profile     FTMS Positive        NSI        Off                 Off
# 26  Profile     FTMS Positive        NSI        Off                 Off
#    TurboScan Enhanced DependentScan Wideband SupplementalActivation
# 21       Off      Off           Off      Off                    Off
# 22       Off      Off           Off      Off                    Off
# 23       Off      Off           Off      Off                    Off
# 24       Off      Off           Off      Off                    Off
# 25       Off      Off           Off      Off                    Off
# 26       Off      Off           Off      Off                    Off
#    AccurateMass UltraScan ScanType Multiplex RT..min. LowMass HighMass      TIC
# 21           On       Off     Full       Off  0.58991     100     2000 33413028
# 22           On       Off     Full       Off  0.86754     100     2000 17650910
# 23           On       Off     Full       Off  1.14659     100     2000 17578534
# 24           On       Off     Full       Off  1.42526     100     2000 16377225
# 25           On       Off     Full       Off  1.70711     100     2000 16823158
# 26           On       Off     Full       Off  1.98866     100     2000 15512068
#    BasePeakPosition BasePeakIntensity  X Scan.Description          AGC
# 21         1161.915         1070457.6 NA               NA On          
# 22         1161.954          957240.8 NA               NA On          
# 23         1161.915          951840.1 NA               NA On          
# 24         1161.875          653533.7 NA               NA On          
# 25         1161.914          694364.5 NA               NA On          
# 26         1161.874          422254.5 NA               NA On          
#    Micro.Scan.Count Ion.Injection.Time..ms. Elapsed.Scan.Time..sec.
# 21               40                   9.422                  16.657
# 22               40                  14.940                  16.743
# 23               40                  14.612                  16.720
# 24               40                  14.795                  16.911
# 25               40                  14.523                  16.893
# 26               40                  14.150                  17.094
#    Average.Scan.by.Inst Orbitrap.Resolution Access.ID API.Process.Delay
# 21                   No              120000         0                -1
# 22                   No              120000         0                -1
# 23                   No              120000         0                -1
# 24                   No              120000         0                -1
# 25                   No              120000         0                -1
# 26                   No              120000         0                -1
#    Dependency.Type Multi.Inject.Info Master.Scan.Number Monoisotopic.M.Z
# 21               0                NA                  0                0
# 22               0                NA                  0                0
# 23               0                NA                  0                0
# 24               0                NA                  0                0
# 25               0                NA                  0                0
# 26               0                NA                  0                0
#    Charge.State HCD.Energy MS2.Isolation.Width SPS.Masses SPS.Masses.Continued
# 21            1         -1                   1         NA                   NA
# 22            1         -1                   1         NA                   NA
# 23            1         -1                   1         NA                   NA
# 24            1         -1                   1         NA                   NA
# 25            1         -1                   1         NA                   NA
# 26            1         -1                   1         NA                   NA
#    Conversion.Parameter.I Conversion.Parameter.A Conversion.Parameter.B
# 21                      0                      0              211728668
# 22                      0                      0              211728668
# 23                      0                      0              211728668
# 24                      0                      0              211728669
# 25                      0                      0              211728669
# 26                      0                      0              211728669
#    Conversion.Parameter.C Conversion.Parameter.D Conversion.Parameter.E
# 21              -68944402                      0                      0
# 22              -68944402                      0                      0
# 23              -68944402                      0                      0
# 24              -68944402                      0                      0
# 25              -68944402                      0                      0
# 26              -68944402                      0                      0
#    Temperature.Comp...ppm. RF.Comp...ppm. Space.Charge.Comp...ppm.
# 21                   -5.64              0                    -0.41
# 22                   -5.64              0                    -0.41
# 23                   -5.64              0                    -0.41
# 24                   -5.64              0                    -0.41
# 25                   -5.64              0                    -0.41
# 26                   -5.64              0                    -0.41
#    Resolution.Comp...ppm. Number.of.LM.Found LM.Correction..ppm. RawOvFtT
# 21                  -0.29                  0                   0 301264.9
# 22                  -0.29                  0                   0 263915.4
# 23                  -0.29                  0                   0 258677.5
# 24                  -0.29                  0                   0 244920.4
# 25                  -0.29                  0                   0 246894.9
# 26                  -0.29                  0                   0 223337.2
#    Injection.t0 Reagent.Ion.Injection.Time..ms. NrCentroids SelectedMass1
# 21        0.000                               0        1031          1162
# 22        0.019                               0        1833          1162
# 23       -0.020                               0        1879          1162
# 24        0.000                               0        2283          1162
# 25        0.054                               0        2371          1162
# 26       -0.039                               0        2756          1162
#    Activation1 Energy1
# 21        UVPD       5
# 22        UVPD       5
# 23        UVPD       5
# 24        UVPD      10
# 25        UVPD      10
# 26        UVPD      15

I am going to close this issue. Please open a new one or for "matching control" if you have an idea how we could achieve that, otherwise we have to trust the "scan" id.

BTW: we have the same "matching control" problem with the ".experiments.csv" files. These files don't share anything with the scanheadsman or mzml header.

@sgibb
Copy link
Owner

sgibb commented Feb 22, 2018

Fixed in 7327ab5 (wrong issue number in commit message).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants