-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote File Support for input files #761
base: main
Are you sure you want to change the base?
Conversation
Thanks, this is interesting. I’ll have to think about whether I want this. I agree the code is not that intrusive, but it would require some documentation and may cause support requests. No one has asked for this feature before, would you actually benefit from it? (I don’t have much time right now, please ping me next week if I haven’t gotten back to you by then.) |
Yes, I would benefit from this :-) Cutadapt is the first step in our WES and WGS workflow on AWS. The massive staging of hundreds of FASTQ files when starting the analysis of a novaseq run brings a significant cost in EBS and EFS elastic throughput. By using direct S3 access, the network traffic becomes more spread out over time. Similar efforts are present for htslib (samtools) and GATK (mainly for google though) I'm happy to help with the documentation. |
I propose using smart_open with |
that's a good suggestion, I'll try to adapt and update here |
I've made some changes to xopen to support passing open filehandles. It's this PR : pycompression/xopen#150 Once that is active, the current cutadapt PR can be re-evaluated. I've tested it, and S3 in/out processing with decent network speed is about as fast as local/local processing with 4 threads : approximaely 9M reads/minute for paired data using : cutadapt --transport-params '{"max_pool_connections":50 , "buffer_size":64008864}' -q 30 -a AGATCGGAAGAG --minimum-length 18 -e 0.1 -O 3 -n 1 -j 4 -o s3://gvdw-testing-bucket-dev/R1.fastq.gz -p s3://gvdw-testing-bucket-dev/R2.fastq.gz s3://wesss-263124-s--20231204-123244/wesss-263124-s_S71_L001_R1_001.fastq.gz s3://wesss-263124-s--20231204-123244/wesss-263124-s_S71_L001_R2_001.fastq.gz |
I've added support for remote files (s3, gcs, ftp, http(s)) as input files.
It's based on redirecting file-opening to the smart_open library, and had minimal impact on the code.
Tests have been added and were passed (for https and s3).
I've tried to same for output files, but the gzip-support of smart_open is very, very slow when tested on s3. So I removed that again.