Stream file directly to s3 bucket without downloading to local disk first #517

selkamand · 2022-08-08T10:38:16Z

Thanks for the package. Really impressive work!

Question about paws functionality

I was wondering if it's currently possible to take a url, for example
https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf

And stream the file directly to an s3 bucket without ever downloading to the local disk.

What I've tried so far
As far as i can tell put_object doesn't check if body is a url and so cannot find the file if the following is run:

svc$put_object(Body = "https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf", Bucket = "bucketname", Key =  "ieee_talk.pdf")

Next thing i tried was to use a connection object as body of put_object which obviously failed since Body must be a string

Basically, i'm looking for the paws equivalent of the following (uses aws cli)

curl <url> | aws s3 cp - keyname

This feature is particularly useful if the files you want to store on S3 are very large, since you avoid having to download/upload it twice - once to local then once more to s3

TLDR
Is it currently possible to copy a remote file using its URL straight into s3 using paws?

The text was updated successfully, but these errors were encountered:

DyfanJones · 2022-08-08T17:05:05Z

Hi @selkamand,

Would something like this solve your issue?

s3 = paws::s3()
stream <- curl::curl_fetch_memory("https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf")
s3$put_object(Body = stream$content, Bucket = "bucketname", Key =  "ieee_talk.pdf")

selkamand · 2022-08-10T05:56:18Z

Thanks @DyfanJones

Use of curl::curl_fetch_memory does avoid saving to disk - but if i understand correctly, requires the target file fits within RAM of local machine.

This is a limitation I would hope to avoid - something i really should have specified in the original question (my bad!)

Is there an alternative that would support streaming the remote file into s3 bucket?

DyfanJones · 2022-08-10T11:17:19Z

If you don't want to do it 1 step i.e. download file to memory and upload you can do it using the multipart upload method. Something like this:

library(httr2)
library(paws)

Bucket = "your_bucket"
Key = "my_file"
upload_no <- new.env(parent = emptyenv())
upload_no$i <- 1
upload_no$parts <- list()

s3 <- paws::s3()

upload_id = s3$create_multipart_upload(
    Bucket = Bucket, Key = Key,
  )$UploadId

s3_upload_part <- function(x){
  etag <- s3$upload_part(
    Body = x,
    Bucket = Bucket,
    Key = Key,
    PartNumber = upload_no$i,
    UploadId = upload_id
  )$ETag
  upload_no$parts[[upload_no$i]] <- list(ETag = etag, PartNumber = upload_no$i)
  upload_no$i <- upload_no$i + 1
  return(NULL)
}

tryCatch(
  resp <- request("your url") %>%
    req_stream(s3_upload_part, buffer_kb = 5 * 1024)
error = function(e){
  s3$abort_multipart_upload(
    Bucket = Bucket,
    Key = Key,
    UploadId = upload_id
  )
})

s3$complete_multipart_upload(
  Bucket = Bucket,
  Key = Key,
  UploadId = upload_id,
  MultipartUpload = list(Parts = upload_no$etags)
)

Note: The buffer can't be less than 5MB.

DyfanJones · 2022-08-10T11:29:52Z

Note it isn't clean but that is what you need to do. If you like you can make a request to s3fs (an R implementation of s3fs based on the R package fs and using paws under the hood). I believe it would make a nice addition to s3_file_stream_out method.

DyfanJones · 2022-08-10T14:52:51Z

This functionality has been added to s3fs. So feel free to use that method or the method above :)

For completeness here is the s3fs code:

remotes::install_github("DyfanJones/s3fs")

library(s3fs)

s3_file_stream_out(
  "https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf",
  "s3://mybucket/ieee_talk.pdf"
)

DyfanJones added the question 🧐❓ Further information is requested label Aug 8, 2022

DyfanJones mentioned this issue Aug 10, 2022

Stream file directly to s3 bucket from url DyfanJones/s3fs#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream file directly to s3 bucket without downloading to local disk first #517

Stream file directly to s3 bucket without downloading to local disk first #517

selkamand commented Aug 8, 2022 •

edited

Loading

DyfanJones commented Aug 8, 2022

selkamand commented Aug 10, 2022

DyfanJones commented Aug 10, 2022 •

edited

Loading

DyfanJones commented Aug 10, 2022 •

edited

Loading

DyfanJones commented Aug 10, 2022

Stream file directly to s3 bucket without downloading to local disk first #517

Stream file directly to s3 bucket without downloading to local disk first #517

Comments

selkamand commented Aug 8, 2022 • edited Loading

DyfanJones commented Aug 8, 2022

selkamand commented Aug 10, 2022

DyfanJones commented Aug 10, 2022 • edited Loading

DyfanJones commented Aug 10, 2022 • edited Loading

DyfanJones commented Aug 10, 2022

selkamand commented Aug 8, 2022 •

edited

Loading

DyfanJones commented Aug 10, 2022 •

edited

Loading

DyfanJones commented Aug 10, 2022 •

edited

Loading