-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving run time performance #140
Comments
Hi Mehar, A reproducible example is very important. In the example below I process > 300K variants in seconds while you report you're taking 10 minutes. I suspect the issue is that you may have more samples than I realize. But because you've provided no data I don't know. That means I can't help you. It doesn't need to be your data, that's why I've provided example data. You could try to modify the below code to reproduce the behaviour you're experiencing. Also, you are doing things I feel are unnecessary. One of the performance features of vcfR is it only does things on demand. If your goal is to parse an entire file you might want to try VariantAnnotation. library(vcfR)
#>
#> ***** *** vcfR *** *****
#> This is vcfR 1.8.0
#> browseVignettes('vcfR') # Documentation
#> citation('vcfR') # Citation
#> ***** ***** ***** *****
library(pinfsc50)
library(microbenchmark)
vcf <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50")
my_vcf <- read.vcfR(vcf, verbose = FALSE)
my_vcf
#> ***** Object of Class vcfR *****
#> 18 samples
#> 1 CHROMs
#> 22,031 variants
#> Object size: 22.4 Mb
#> 7.929 percent missing data
#> ***** ***** *****
# "Add" more variants.
my_vcf@fix[, "CHROM"] <- 1
for(i in 1:4){
my_vcf2 <- my_vcf
my_vcf2@fix[, "CHROM"] <- i + 1
my_vcf <- rbind2(my_vcf, my_vcf2)
}
my_vcf
#> ***** Object of Class vcfR *****
#> 18 samples
#> 5 CHROMs
#> 352,496 variants
#> Object size: 90.5 Mb
#> 7.929 percent missing data
#> ***** ***** *****
write.vcf(x = my_vcf, file = "big_data.vcf.gz")
res <- microbenchmark( my_vcf <- read.vcfR("big_data.vcf.gz", verbose = FALSE),
times = 10, unit = "s")
my_vcf
#> ***** Object of Class vcfR *****
#> 18 samples
#> 5 CHROMs
#> 352,496 variants
#> Object size: 90.5 Mb
#> 7.929 percent missing data
#> ***** ***** *****
print(res)
#> Unit: seconds
#> expr min lq
#> my_vcf <- read.vcfR("big_data.vcf.gz", verbose = FALSE) 7.574603 7.594612
#> mean median uq max neval
#> 7.873175 7.634449 7.660916 10.08351 10 Created on 2019-07-09 by the reprex package (v0.3.0) |
Hi Brian, You are correct that my input file has 2500 samples and the issue could be with the sample size. Here is the big data i am trying to read which is the variant calls from chr1 from 1000G Phase3 data.
I couldn't read the file atleast. Could you suggest how to make use of vcfR to process such big data. |
Hi Mehar, that was the critical information I needed. The issue is that file is too large for you to read in. memuse::howbig(nrow = 6468094, ncol = 2513)
#> 121.104 GiB Created on 2019-07-10 by the reprex package (v0.3.0) For context, my workstation only has 32 GB of RAM. So even if the language could theoretically create a data structure that large, my system doesn't have enough physical RAM. Also, its been my experience that R doesn't perform very well if you try to make data structures that are over 1 GB in RAM. R is somewhat unique as a language in that it was designed to read all of your data into RAM at once. Other languages tend to read in chunks of your data perform something, dump the results to an outfile, and read in another chunk. vcfR::read.vcfR has nrow and skip parameters, so you could read in parts of the file. Also keep in mind that because R is a high level, interpreted language, it doesn't perform very well with really large datasets. I think you need to put some critical thought into what you're trying to do. Perhaps using something like vcftools may be a more efficient option? |
I am processing VCF file to split the columns until the INFO field (columns 1-9) into tab separated columns. The example input has 250K variants.
To do this i am using the below code:
The input dataset with 250K variants is taking ˜10min for executing the code in STEP1 + STEP3a or STEP1+ STEP3b. And the run time is increasing with increase in variants in the input VCF file. Is it possible to reduce the runtime with some parallel processing methods in R so that it performs better with the size of the input VCF? I am not familiar with parallel processing methods, in the above code i have used lapply and mclapply for STEP3 which do not have a significant improvement.
Any suggestions to improve performance while reading the file in STEP1 and STEP3a/3b?
Since this has been a performance issue and not a bug in the package to show a reproducible example, i am not able to share any input file.
The text was updated successfully, but these errors were encountered: