-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stream file directly to s3 bucket without downloading to local disk first #517
Comments
Hi @selkamand, Would something like this solve your issue? s3 = paws::s3()
stream <- curl::curl_fetch_memory("https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf")
s3$put_object(Body = stream$content, Bucket = "bucketname", Key = "ieee_talk.pdf") |
Thanks @DyfanJones Use of This is a limitation I would hope to avoid - something i really should have specified in the original question (my bad!) Is there an alternative that would support streaming the remote file into s3 bucket? |
If you don't want to do it 1 step i.e. download file to memory and upload you can do it using the multipart upload method. Something like this: library(httr2)
library(paws)
Bucket = "your_bucket"
Key = "my_file"
upload_no <- new.env(parent = emptyenv())
upload_no$i <- 1
upload_no$parts <- list()
s3 <- paws::s3()
upload_id = s3$create_multipart_upload(
Bucket = Bucket, Key = Key,
)$UploadId
s3_upload_part <- function(x){
etag <- s3$upload_part(
Body = x,
Bucket = Bucket,
Key = Key,
PartNumber = upload_no$i,
UploadId = upload_id
)$ETag
upload_no$parts[[upload_no$i]] <- list(ETag = etag, PartNumber = upload_no$i)
upload_no$i <- upload_no$i + 1
return(NULL)
}
tryCatch(
resp <- request("your url") %>%
req_stream(s3_upload_part, buffer_kb = 5 * 1024)
error = function(e){
s3$abort_multipart_upload(
Bucket = Bucket,
Key = Key,
UploadId = upload_id
)
})
s3$complete_multipart_upload(
Bucket = Bucket,
Key = Key,
UploadId = upload_id,
MultipartUpload = list(Parts = upload_no$etags)
) Note: The buffer can't be less than 5MB. |
Note it isn't clean but that is what you need to do. If you like you can make a request to s3fs (an R implementation of s3fs based on the R package fs and using paws under the hood). I believe it would make a nice addition to |
This functionality has been added to s3fs. So feel free to use that method or the method above :) For completeness here is the s3fs code: remotes::install_github("DyfanJones/s3fs")
library(s3fs)
s3_file_stream_out(
"https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf",
"s3://mybucket/ieee_talk.pdf"
) |
Thanks for the package. Really impressive work!
Question about paws functionality
I was wondering if it's currently possible to take a url, for example
https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf
And stream the file directly to an s3 bucket without ever downloading to the local disk.
What I've tried so far
As far as i can tell
put_object
doesn't check if body is a url and so cannot find the file if the following is run:Next thing i tried was to use a connection object as body of
put_object
which obviously failed since Body must be a stringBasically, i'm looking for the paws equivalent of the following (uses aws cli)
This feature is particularly useful if the files you want to store on S3 are very large, since you avoid having to download/upload it twice - once to local then once more to s3
TLDR
Is it currently possible to copy a remote file using its URL straight into s3 using paws?
The text was updated successfully, but these errors were encountered: