-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test performance on a large data set #2
Comments
CCLE data setA Performance is evaluated based on some frequently used tasks and compared to a ## con is the connection to the MariaDB database
ccle <- Spectra(con, source = MsqlBackend())
## con_blob is the connection to the MariaDB database with m/z and intensity stored as BLOB
ccle_blob <- Spectra(con_blob, source = MsqlBackend())
ccle_mzr <- Spectra(fls, source = MsBackendMzR())
## Size comparison
print(object.size(ccle), units = "MB")
## 162.5 MB
print(object.size(ccle_mzr), units = "MB")
## 13506 MB Since the Subsetting by retention time rangesystem.time(
ccle_rt <- filterRt(ccle, c(100, 110))
)
## user system elapsed
## 58.140 1.556 136.230
system.time(
ccle_blob_rt <- filterRt(ccle_blob, c(100, 110))
)
## user system elapsed
## 34.038 1.796 118.031
system.time(
ccle_mzr_rt <- filterRt(ccle_mzr, c(100, 110))
)
## user system elapsed
## 0.674 0.180 0.855 Filtering the length(ccle_mzr_rt)
[1] 14791
system.time(
pd <- peaksData(ccle_rt)
)
## user system elapsed
## 2.174 0.030 19.202
system.time(
pd_blob <- peaksData(ccle_blob_rt)
)
## user system elapsed
## 0.230 0.026 1.999
system.time(
pd_mzr <- peaksData(ccle_mzr_rt)
)
## user system elapsed
## 65.183 3.825 69.416 Extracting peaks data from an Get MS2 spectra with precursor m/z matching a predefined valuepmz <- 940.1510
system.time(
ccle_pmz <- filterPrecursorMzValues(ccle, pmz[1], tolerance = 0.01)
)
## user system elapsed
## 0.251 0.047 14.008
system.time(
ccle_blob_pmz <- filterPrecursorMzValues(ccle_blob, pmz[1], tolerance = 0.01)
)
## user system elapsed
## 0.265 0.000 14.707
system.time(
ccle_mzr_pmz <- filterPrecursorMzValues(ccle_mzr, pmz[1], tolerance = 0.01)
)
## user system elapsed
## 3.041 0.146 3.191
length(ccle_mzr_pmz)
[1] 1693 Subsetting is again faster if the data is in memory, but extracting peaks data is faster with the database-backed backend. system.time(
pd <- peaksData(ccle_pmz)
)
## user system elapsed
## 0.233 0.040 6.569
system.time(
pd_blob <- peaksData(ccle_blob_pmz)
)
## user system elapsed
## 0.043 0.003 1.064
system.time(
pd_mzr <- peaksData(ccle_mzr_pmz)
)
## user system elapsed
## 64.462 3.819 69.489 Note that caching e.g. the precursor m/z values locally (in memory) speeds up the filtering operation (but will obviously increase the memory demand): ccle$precursorMz <- precursorMz(ccle)
system.time(
ccle_pmz <- filterPrecursorMzValues(ccle, pmz, tolerance = 0.01)
)
## user system elapsed
## 3.164 0.107 3.274 ConclusionThe |
Update: by adding additional indices to the database table the performance of the filter calls can be increased: peakRAM(
ccle_blob_pmz <- filterPrecursorMzValues(ccle_blob, pmz[1], tolerance = 0.01)
)
## Function_Call
## 1 ccle_blob_pmz<-filterPrecursorMzValues(ccle_blob,pmz[1],tolerance=0.01)
## Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
## 1 0.315 0 650 Thus, the |
HILIC Data SetThis are some performance comparisons using our in-house untargeted metabolomics dataset:
Tests were run on R version . CPU was a quad core 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz. Compare size> print(object.size(chris), units = "MB")
4639.7 Mb
> print(object.size(chris_db), units = "MB")
115.9 Mb The Subset and filteringSubset to 1000 random spectra (using indices) > microbenchmark(
+ chris[idx],
+ chris_db[idx])
Unit: microseconds
expr min lq mean median uq max neval cld
chris[idx] 640.182 666.0300 721.2030 683.676 714.8640 3713.326 100 b
chris_db[idx] 91.875 102.6875 115.2116 114.568 125.5975 177.565 100 a Filter based on MS level > microbenchmark(
+ filterMsLevel(chris, 1L),
+ filterMsLevel(chris_db, 1L),
+ times = 7)
Unit: seconds
expr min lq mean median uq
filterMsLevel(chris, 1L) 2.355874 2.403097 2.550644 2.467742 2.660718
filterMsLevel(chris_db, 1L) 9.629843 9.693403 10.741660 9.796958 9.972104
max neval cld
2.903265 7 a
16.433808 7 b Filter by retention time > microbenchmark(
+ filterRt(chris, rt = c(200, 300)),
+ filterRt(chris_db, rt = c(200, 300)),
+ times = 7)
Unit: milliseconds
expr min lq mean median
filterRt(chris, rt = c(200, 300)) 953.2141 980.3457 1010.402 989.2629
filterRt(chris_db, rt = c(200, 300)) 4749.7759 4808.0915 4829.327 4847.0458
uq max neval cld
1038.790 1092.065 7 a
4853.335 4885.618 7 b Subset by index index is much faster for the Extracting peaks dataIn the next tests we extract the MS data ( > idx <- sample(seq_along(chris), 10000)
> idx_o <- order(idx)
> microbenchmark(
+ peaksData(chris[idx]),
+ peaksData(chris[idx], BPPARAM = SerialParam()),
+ peaksData(chris_db[idx]),
+ peaksData(chris[idx_o]),
+ peaksData(chris[idx_o], BPPARAM = SerialParam()),
+ peaksData(chris_db[idx_o]),
+ times = 7)
Unit: milliseconds
expr min lq
peaksData(chris[idx]) 42190.7919 43377.2902
peaksData(chris[idx], BPPARAM = SerialParam()) 36361.7414 37462.8143
peaksData(chris_db[idx]) 854.4480 872.4723
peaksData(chris[idx_o]) 3066.6946 3124.0278
peaksData(chris[idx_o], BPPARAM = SerialParam()) 5225.6786 5359.7452
peaksData(chris_db[idx_o]) 704.8962 799.1920
mean median uq max neval cld
44075.1709 44183.8126 44922.2583 45552.4946 7 e
38165.4711 37930.3878 38955.9492 40028.6411 7 d
921.7522 909.4374 948.7440 1045.9471 7 a
3210.9612 3191.9740 3278.5428 3412.9186 7 b
5426.6894 5415.2606 5480.3372 5665.7216 7 c
804.3953 815.3419 823.9607 864.2233 7 a Extracting peaks data from spectra of 3 random files. > fls <- sample(unique(chris$dataOrigin), 3)
> idx <- which(chris$dataOrigin %in% fls)
> microbenchmark(
+ peaksData(chris[idx]),
+ peaksData(chris[idx], BPPARAM = SerialParam()),
+ peaksData(chris_db[idx]),
+ times = 13
+ )
Unit: milliseconds
expr min lq mean
peaksData(chris[idx]) 1999.4652 2060.613 2099.8230
peaksData(chris[idx], BPPARAM = SerialParam()) 3204.7828 3253.108 3310.9872
peaksData(chris_db[idx]) 514.2502 533.412 614.6944
median uq max neval cld
2077.436 2143.4068 2227.182 13 b
3301.743 3328.6203 3443.200 13 c
548.369 570.3886 1403.045 13 a
|
You should probably run any benchmarks a few times to account for any potential DB caching. |
that's a good point - the |
Create a backend using various test data sets.
The text was updated successfully, but these errors were encountered: