-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: report download progress #56
feat: report download progress #56
Conversation
a1de6f6
to
b9f8ba8
Compare
@jcfr whenever possible, could you help troubleshoot with the pylint error in CI? I am unable to figure what's causing the issue. Thanks! |
Maybe this article can help troubleshoot pylint error 12: https://stackoverflow.com/questions/76181900/pylint-return-non-zero-code-error-only-if-errors-are-there |
@jcfr sorry for the silly question - but if we are running ruff already, why do we also need to run pylint? |
ruff and pylint are complementary |
When first updating the CI, I added a lot "exception" related to potential issues: Lines 150 to 174 in 1868f8a
Lines 196 to 206 in 1868f8a
To move forward, it would be sensible to:
Note that in some cases, it is legitimate to have exception and not follow the recommendation associated with the warning. This should be address on a case-by-case basis |
@fedorov I added the initial disk size as a baseline, and the download progress is now going as expected. Calculating file size may be an overhead if too many files are in a directory. I do not know if there is any better way. I do think if we implement patient/study/series hierarchy as in #22, we could only track the subfolders we know data is going to be downloaded into. Implementing this hierarchy may lessen some burden on disk size calculation. I tried replacing threading with multiprocessing which seems to use Popen, it worked only on Ubuntu, but it did not work well on Windows and macOS systems, so I reverted to threading. I can also confirm that s5cmd does not delete any existing files, so replacing cp with sync has been seamless. Here's a run that did not work on Windows and Mac. |
@vkt1414 this is the use of |
Thank you! Ubuntu(Linux) has no problem with any approach so far. Either way, I'll give this a try. |
The idea is to use the simplest approach possible to improve readability and reduce maintenance. |
301aeb7
to
62c9d7a
Compare
Sure. I used Popen only this time, and this PR should now be ready for review. |
e40856f
to
67a5be2
Compare
@vkt1414 this PR is still marked as draft - is this right? |
67a5be2
to
b29812e
Compare
I have now marked it as ready for review now. |
Changing back to draft as I think I can optimize the manifest validator and sync size calculations in seconds regardless of the manifest size. I'd be analyzing all series in the manifest, instead of one series at a time as it is currently implemented now. I'd be leveraging reading the manifest as a dataframe, followed by data wrangling operations to accomplish this. |
bf1493a
to
2a9838d
Compare
@vkt1414 I am going over the review. Your summary of how the function is implemented is very helpful for the review! But I really do not think it should be in the docstrings - those should contain details needed to understand what the function does. Not how it does it. I would recommend moving details of how the function is implemented into the comments within the body function and outside of the docstring. Please wait with responding to this and with any other edits until I am done with my review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, I think it is very close!
88123a2
to
c08aff5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed today, there are some further optimization to be implemented in the validate function. Also, please see this PR for few comments/style changes: vkt1414#10. I have to say I did not finish the review - I only did the validate function...
c04acac
to
02ee675
Compare
Sounds good. I made the validator as efficient as I possibly could. Made tests more robust now by testing all combinations of optional parameters. Made pytest more verbose in CI to aid in troubleshooting.
Could you just comment any changes necessary on the lines directly? I think it is also easier to probably easier to keep the discussion in place. I also added you and JC as co-authors on this PR already. |
2906d93
to
2e2bf83
Compare
there is no way to track download sequentially as s5cmd run or cp command will be locked on a thread. so, download progress is outsourced to a process and will run simultaneously along side download process for both download from selection and manifest For manifest or selection, download size is calculated first and the progress is tracked against as a whole. manifest validator in download from manifest is now offloaded to a dedicated function that can check not only the first line but every line when sync dry run is enabled, s5cmd cp is replaced with sync to gracefully avoid downloading the same data again. add additional endpoints download_dicom_studies, download_dicom_patients, download_dicom_patients and download_collection, all of them routed to download_from_selection Co-Authored-By: Andrey Fedorov <[email protected]> Co-Authored-By: Jean-Christophe Fillion-Robin <[email protected]>
c81ac0f
to
213b5c7
Compare
530c0f2
to
546ec31
Compare
* propagate s5cmd errors to the user * instead of raising RuntimeError, inform user about the error and save them from polluting the console with the stack trace
546ec31
to
4d60e6e
Compare
this speeds up test time and exercises the situation where the input manifest contains invalid URLs
b43afb7
to
9dd8f62
Compare
* update the parameter to allow the user to request the use of s5cmd sync instead of the default s5cmd cp * eliminate the code that would condition the use of s5cmd sync on the progress bar parameter value
9dd8f62
to
531444e
Compare
This PR aims to address #24 and #51
there is no way to track download sequentially as s5cmd run or cp command will be locked on a thread. so, download progress is outsourced to a thread and will run simultaneously along side download thread on both series and manifest downloader
as the index only contains aws urls, when a manifest contains gcs urls, crdc series instance uuid is extracted from aws urls and queried against the aws urls in the index to get download size. For manifest download, download size is calculated first and the progress is tracked against as a whole.
manifest validator in download from manifest is offloaded to a dedicated function that will check not only the first line but every line, and if the manifest has urls from both gcp and aws to raise an exception.
s5cmd cp is replaced with sync to gracefully avoid downloading the same data again.
get functions will now return a message that data not found for the values given for a key.
queries folder is now removed as they will persist in idc-index-data