Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote File Support for input files #761

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

geertvandeweyer
Copy link

I've added support for remote files (s3, gcs, ftp, http(s)) as input files.

  • It's based on redirecting file-opening to the smart_open library, and had minimal impact on the code.

  • Tests have been added and were passed (for https and s3).

  • I've tried to same for output files, but the gzip-support of smart_open is very, very slow when tested on s3. So I removed that again.

@marcelm
Copy link
Owner

marcelm commented Feb 6, 2024

Thanks, this is interesting. I’ll have to think about whether I want this. I agree the code is not that intrusive, but it would require some documentation and may cause support requests.

No one has asked for this feature before, would you actually benefit from it?

(I don’t have much time right now, please ping me next week if I haven’t gotten back to you by then.)

@geertvandeweyer
Copy link
Author

Yes, I would benefit from this :-)

Cutadapt is the first step in our WES and WGS workflow on AWS. The massive staging of hundreds of FASTQ files when starting the analysis of a novaseq run brings a significant cost in EBS and EFS elastic throughput. By using direct S3 access, the network traffic becomes more spread out over time.

Similar efforts are present for htslib (samtools) and GATK (mainly for google though)

I'm happy to help with the documentation.

@rhpvorderman
Copy link
Collaborator

I propose using smart_open with ignore_extension and then passing the filehandle to xopen. Xopen does not support filehandles yet. But it should be possible. Especially since the latest refactorings have almost halved the codebase there is room for some additional functionality again. This way there is no need to handle .xz extensions differently, and gzip files will be very efficiently decompressed.

@geertvandeweyer
Copy link
Author

I propose using smart_open with ignore_extension and then passing the filehandle to xopen. Xopen does not support filehandles yet. But it should be possible. Especially since the latest refactorings have almost halved the codebase there is room for some additional functionality again. This way there is no need to handle .xz extensions differently, and gzip files will be very efficiently decompressed.

that's a good suggestion, I'll try to adapt and update here

@geertvandeweyer
Copy link
Author

I've made some changes to xopen to support passing open filehandles.

It's this PR : pycompression/xopen#150

Once that is active, the current cutadapt PR can be re-evaluated. I've tested it, and S3 in/out processing with decent network speed is about as fast as local/local processing with 4 threads : approximaely 9M reads/minute for paired data using :

cutadapt --transport-params '{"max_pool_connections":50 , "buffer_size":64008864}' -q 30 -a AGATCGGAAGAG --minimum-length 18 -e 0.1 -O 3 -n 1 -j 4 -o s3://gvdw-testing-bucket-dev/R1.fastq.gz -p s3://gvdw-testing-bucket-dev/R2.fastq.gz s3://wesss-263124-s--20231204-123244/wesss-263124-s_S71_L001_R1_001.fastq.gz s3://wesss-263124-s--20231204-123244/wesss-263124-s_S71_L001_R2_001.fastq.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants