Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream file directly to s3 bucket without downloading to local disk first #517

Open
selkamand opened this issue Aug 8, 2022 · 5 comments
Labels
question 🧐❓ Further information is requested

Comments

@selkamand
Copy link

selkamand commented Aug 8, 2022

Thanks for the package. Really impressive work!

Question about paws functionality

I was wondering if it's currently possible to take a url, for example
https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf

And stream the file directly to an s3 bucket without ever downloading to the local disk.

What I've tried so far
As far as i can tell put_object doesn't check if body is a url and so cannot find the file if the following is run:

svc$put_object(Body = "https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf", Bucket = "bucketname", Key =  "ieee_talk.pdf")

Next thing i tried was to use a connection object as body of put_object which obviously failed since Body must be a string

Basically, i'm looking for the paws equivalent of the following (uses aws cli)

curl <url> | aws s3 cp - keyname

This feature is particularly useful if the files you want to store on S3 are very large, since you avoid having to download/upload it twice - once to local then once more to s3

TLDR
Is it currently possible to copy a remote file using its URL straight into s3 using paws?

@DyfanJones
Copy link
Member

Hi @selkamand,

Would something like this solve your issue?

s3 = paws::s3()
stream <- curl::curl_fetch_memory("https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf")
s3$put_object(Body = stream$content, Bucket = "bucketname", Key =  "ieee_talk.pdf")

@DyfanJones DyfanJones added the question 🧐❓ Further information is requested label Aug 8, 2022
@selkamand
Copy link
Author

Thanks @DyfanJones

Use of curl::curl_fetch_memory does avoid saving to disk - but if i understand correctly, requires the target file fits within RAM of local machine.

This is a limitation I would hope to avoid - something i really should have specified in the original question (my bad!)

Is there an alternative that would support streaming the remote file into s3 bucket?

@DyfanJones
Copy link
Member

DyfanJones commented Aug 10, 2022

If you don't want to do it 1 step i.e. download file to memory and upload you can do it using the multipart upload method. Something like this:

library(httr2)
library(paws)

Bucket = "your_bucket"
Key = "my_file"
upload_no <- new.env(parent = emptyenv())
upload_no$i <- 1
upload_no$parts <- list()

s3 <- paws::s3()

upload_id = s3$create_multipart_upload(
    Bucket = Bucket, Key = Key,
  )$UploadId

s3_upload_part <- function(x){
  etag <- s3$upload_part(
    Body = x,
    Bucket = Bucket,
    Key = Key,
    PartNumber = upload_no$i,
    UploadId = upload_id
  )$ETag
  upload_no$parts[[upload_no$i]] <- list(ETag = etag, PartNumber = upload_no$i)
  upload_no$i <- upload_no$i + 1
  return(NULL)
}

tryCatch(
  resp <- request("your url") %>%
    req_stream(s3_upload_part, buffer_kb = 5 * 1024)
error = function(e){
  s3$abort_multipart_upload(
    Bucket = Bucket,
    Key = Key,
    UploadId = upload_id
  )
})

s3$complete_multipart_upload(
  Bucket = Bucket,
  Key = Key,
  UploadId = upload_id,
  MultipartUpload = list(Parts = upload_no$etags)
)

Note: The buffer can't be less than 5MB.

@DyfanJones
Copy link
Member

DyfanJones commented Aug 10, 2022

Note it isn't clean but that is what you need to do. If you like you can make a request to s3fs (an R implementation of s3fs based on the R package fs and using paws under the hood). I believe it would make a nice addition to s3_file_stream_out method.

@DyfanJones
Copy link
Member

This functionality has been added to s3fs. So feel free to use that method or the method above :)

For completeness here is the s3fs code:

remotes::install_github("DyfanJones/s3fs")

library(s3fs)

s3_file_stream_out(
  "https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf",
  "s3://mybucket/ieee_talk.pdf"
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question 🧐❓ Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants