feat: report download progress #56

vkt1414 · 2024-03-24T13:18:53Z

This PR aims to address #24 and #51

there is no way to track download sequentially as s5cmd run or cp command will be locked on a thread. so, download progress is outsourced to a thread and will run simultaneously along side download thread on both series and manifest downloader

as the index only contains aws urls, when a manifest contains gcs urls, crdc series instance uuid is extracted from aws urls and queried against the aws urls in the index to get download size. For manifest download, download size is calculated first and the progress is tracked against as a whole.

manifest validator in download from manifest is offloaded to a dedicated function that will check not only the first line but every line, and if the manifest has urls from both gcp and aws to raise an exception.

s5cmd cp is replaced with sync to gracefully avoid downloading the same data again.

get functions will now return a message that data not found for the values given for a key.

queries folder is now removed as they will persist in idc-index-data

vkt1414 · 2024-03-24T19:10:50Z

@jcfr whenever possible, could you help troubleshoot with the pylint error in CI? I am unable to figure what's causing the issue. Thanks!
https://github.com/ImagingDataCommons/idc-index/actions/runs/8411497038/job/23031159059?pr=56

fedorov · 2024-03-25T18:40:26Z

Maybe this article can help troubleshoot pylint error 12: https://stackoverflow.com/questions/76181900/pylint-return-non-zero-code-error-only-if-errors-are-there

fedorov · 2024-03-25T18:41:46Z

@jcfr sorry for the silly question - but if we are running ruff already, why do we also need to run pylint?

jcfr · 2024-03-25T18:45:08Z

ruff and pylint are complementary

jcfr · 2024-03-25T18:52:25Z

When first updating the CI, I added a lot "exception" related to potential issues:

idc-index/pyproject.toml

Lines 150 to 174 in 1868f8a

    
           # Exceptions below are specific to idc-index 
        
           "B007",     # Loop control variable {name} not used within loop body 
        
           "B904",     # Checks for raise statements in exception handlers that lack a from clause. 
        
           "E722",     # Do not use bare except 
        
           "EM101",    # Exception must not use a string literal, assign to variable first 
        
           "F841",     # Local variable {name} is assigned to but never used 
        
           "G003",     # Logging statement uses + 
        
           "G004",     # Logging statement uses f-string 
        
           "PD011",    # Use .to_numpy() instead of .values 
        
           "PD901",    # Avoid using the generic variable name df for DataFrames 
        
           "PT009",    # Use a regular assert instead of unittest-style {assertion} 
        
           "PTH100",   # os.path.abspath() should be replaced by Path.resolve() 
        
           "PTH103",   # os.makedirs() should be replaced by Path.mkdir(parents=True) 
        
           "PTH107",   # os.remove() should be replaced by Path.unlink() 
        
           "PTH110",   # os.path.exists() should be replaced by Path.exists() 
        
           "PTH118",   # Checks for uses of os.path.join 
        
           "PTH119",   # os.path.basename() should be replaced by Path.name 
        
           "PTH120",   # os.path.dirname() should be replaced by Path.parent 
        
           "PTH123",   # open() should be replaced by Path.open() 
        
           "RET504",   # Unnecessary assignment to {name} before return statement 
        
           "RET506",   # Unnecessary {branch} after raise statement 
        
           "SIM102",   # Use a single if statement instead of nested if statements 
        
           "SIM108",   # Use ternary operator {contents} instead of if-else-block 
        
           "SIM117",   # Use a single with statement with multiple contexts instead of nested with statements 
        
           "T201",     # print found

idc-index/pyproject.toml

Lines 196 to 206 in 1868f8a

    
           # Exceptions below are specific to idc-index 
        
           "invalid-name", 
        
           "missing-class-docstring", 
        
           "missing-function-docstring", 
        
           "logging-fstring-interpolation", 
        
           "logging-not-lazy", 
        
           "no-else-raise", 
        
           "raise-missing-from", 
        
           "undefined-loop-variable", 
        
           "unspecified-encoding", 
        
           "unused-variable",

To move forward, it would be sensible to:

gradually review and address these exceptions if it applies
address warnings associated with new code & feature

Note that in some cases, it is legitimate to have exception and not follow the recommendation associated with the warning. This should be address on a case-by-case basis

vkt1414 · 2024-03-25T21:22:53Z

@fedorov I added the initial disk size as a baseline, and the download progress is now going as expected. Calculating file size may be an overhead if too many files are in a directory. I do not know if there is any better way.

I do think if we implement patient/study/series hierarchy as in #22, we could only track the subfolders we know data is going to be downloaded into. Implementing this hierarchy may lessen some burden on disk size calculation.

I tried replacing threading with multiprocessing which seems to use Popen, it worked only on Ubuntu, but it did not work well on Windows and macOS systems, so I reverted to threading.

I can also confirm that s5cmd does not delete any existing files, so replacing cp with sync has been seamless.

Here's a run that did not work on Windows and Mac.
https://github.com/ImagingDataCommons/idc-index/actions/runs/8426861486

fedorov · 2024-03-29T18:49:18Z

@vkt1414 this is the use of Popen I was referring to at the meeting. What is wrong with this simple approach?

vkt1414 · 2024-03-29T18:53:52Z

@vkt1414 this is the use of Popen I was referring to at the meeting. What is wrong with this simple approach?

Thank you! Ubuntu(Linux) has no problem with any approach so far.

Either way, I'll give this a try.

fedorov · 2024-03-29T19:45:30Z

The idea is to use the simplest approach possible to improve readability and reduce maintenance.

vkt1414 · 2024-04-01T16:08:47Z

The idea is to use the simplest approach possible to improve readability and reduce maintenance.

Sure. I used Popen only this time, and this PR should now be ready for review.

fedorov · 2024-04-20T12:28:54Z

@vkt1414 this PR is still marked as draft - is this right?

idc_index/index.py

vkt1414 · 2024-04-24T22:56:36Z

@vkt1414 this PR is still marked as draft - is this right?

I have now marked it as ready for review now.

vkt1414 · 2024-04-25T06:04:37Z

@vkt1414 this PR is still marked as draft - is this right?

I have now marked it as ready for review now.

Changing back to draft as I think I can optimize the manifest validator and sync size calculations in seconds regardless of the manifest size. I'd be analyzing all series in the manifest, instead of one series at a time as it is currently implemented now. I'd be leveraging reading the manifest as a dataframe, followed by data wrangling operations to accomplish this.

vkt1414 · 2024-04-25T22:13:08Z

here's my attempt at creating the flowchart: (it's an svg file, so we can zoom)

fedorov · 2024-04-26T15:18:03Z

@vkt1414 I am going over the review. Your summary of how the function is implemented is very helpful for the review! But I really do not think it should be in the docstrings - those should contain details needed to understand what the function does. Not how it does it. I would recommend moving details of how the function is implemented into the comments within the body function and outside of the docstring.

Please wait with responding to this and with any other edits until I am done with my review!

fedorov

Nice work, I think it is very close!

idc_index/index.py

fedorov

As discussed today, there are some further optimization to be implemented in the validate function. Also, please see this PR for few comments/style changes: vkt1414#10. I have to say I did not finish the review - I only did the validate function...

vkt1414 · 2024-05-02T14:16:19Z

As discussed today, there are some further optimization to be implemented in the validate function. Also, please see this PR for few comments/style changes: vkt1414#10. I have to say I did not finish the review - I only did the validate function...

Sounds good. I made the validator as efficient as I possibly could. Made tests more robust now by testing all combinations of optional parameters. Made pytest more verbose in CI to aid in troubleshooting.

Also, please see this PR for few comments

Could you just comment any changes necessary on the lines directly? I think it is also easier to probably easier to keep the discussion in place. I also added you and JC as co-authors on this PR already.

there is no way to track download sequentially as s5cmd run or cp command will be locked on a thread. so, download progress is outsourced to a process and will run simultaneously along side download process for both download from selection and manifest For manifest or selection, download size is calculated first and the progress is tracked against as a whole. manifest validator in download from manifest is now offloaded to a dedicated function that can check not only the first line but every line when sync dry run is enabled, s5cmd cp is replaced with sync to gracefully avoid downloading the same data again. add additional endpoints download_dicom_studies, download_dicom_patients, download_dicom_patients and download_collection, all of them routed to download_from_selection Co-Authored-By: Andrey Fedorov <[email protected]> Co-Authored-By: Jean-Christophe Fillion-Robin <[email protected]>

* propagate s5cmd errors to the user * instead of raising RuntimeError, inform user about the error and save them from polluting the console with the stack trace

this speeds up test time and exercises the situation where the input manifest contains invalid URLs

* update the parameter to allow the user to request the use of s5cmd sync instead of the default s5cmd cp * eliminate the code that would condition the use of s5cmd sync on the progress bar parameter value

vkt1414 force-pushed the feat-download-progress branch from a1de6f6 to b9f8ba8 Compare March 24, 2024 19:06

vkt1414 requested review from jcfr and fedorov March 24, 2024 19:09

vkt1414 force-pushed the feat-download-progress branch from 301aeb7 to 62c9d7a Compare April 1, 2024 16:05

vkt1414 force-pushed the feat-download-progress branch from e40856f to 67a5be2 Compare April 5, 2024 12:58

vkt1414 marked this pull request as draft April 5, 2024 13:34

fedorov reviewed Apr 20, 2024

View reviewed changes

idc_index/index.py Outdated Show resolved Hide resolved

fedorov reviewed Apr 20, 2024

View reviewed changes

idc_index/index.py Outdated Show resolved Hide resolved

vkt1414 force-pushed the feat-download-progress branch from 67a5be2 to b29812e Compare April 24, 2024 13:46

vkt1414 marked this pull request as ready for review April 24, 2024 22:56

vkt1414 marked this pull request as draft April 25, 2024 06:04

vkt1414 force-pushed the feat-download-progress branch 2 times, most recently from bf1493a to 2a9838d Compare April 25, 2024 17:14

vkt1414 marked this pull request as ready for review April 25, 2024 22:13

fedorov requested changes Apr 26, 2024

View reviewed changes

vkt1414 force-pushed the feat-download-progress branch from 88123a2 to c08aff5 Compare April 30, 2024 12:50

fedorov requested changes May 2, 2024

View reviewed changes

vkt1414 mentioned this pull request May 2, 2024

style: add comments and small fixes vkt1414/idc-index#10

Closed

vkt1414 force-pushed the feat-download-progress branch from c04acac to 02ee675 Compare May 2, 2024 14:13

vkt1414 linked an issue May 2, 2024 that may be closed by this pull request

Revisit progress reporting #24

Closed

vkt1414 force-pushed the feat-download-progress branch 4 times, most recently from 2906d93 to 2e2bf83 Compare May 2, 2024 16:10

vkt1414 force-pushed the feat-download-progress branch 2 times, most recently from c81ac0f to 213b5c7 Compare May 3, 2024 12:13

fedorov force-pushed the feat-download-progress branch from 530c0f2 to 546ec31 Compare May 3, 2024 17:10

bug: fix error reporting

4d60e6e

* propagate s5cmd errors to the user * instead of raising RuntimeError, inform user about the error and save them from polluting the console with the stack trace

fedorov force-pushed the feat-download-progress branch from 546ec31 to 4d60e6e Compare May 3, 2024 17:38

fedorov added 7 commits May 3, 2024 14:04

test: use smallest samples for testing and add bogus test

5ff6c90

this speeds up test time and exercises the situation where the input manifest contains invalid URLs

ci: disable consider-using-with for specific line

1dc6f7d

doc: add query used to select small test samples

6d740ec

bug: add missing bogus manifest

92e9be2

bug: fix removal of the temp file

fb2b595

style: format SQL and overrule ruff in this regard

654b2c8

style: reformat console output and switch to logger from print

6ca4ba0

fedorov force-pushed the feat-download-progress branch 2 times, most recently from b43afb7 to 9dd8f62 Compare May 3, 2024 21:20

enh: change handling of s5cmd sync

531444e

* update the parameter to allow the user to request the use of s5cmd sync instead of the default s5cmd cp * eliminate the code that would condition the use of s5cmd sync on the progress bar parameter value

fedorov force-pushed the feat-download-progress branch from 9dd8f62 to 531444e Compare May 3, 2024 21:25

fedorov approved these changes May 3, 2024

View reviewed changes

Merge branch 'ImagingDataCommons:main' into feat-download-progress

de34b81

vkt1414 merged commit f008e83 into ImagingDataCommons:main May 3, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: report download progress #56

feat: report download progress #56

vkt1414 commented Mar 24, 2024 •

edited by jcfr

Loading

vkt1414 commented Mar 24, 2024

fedorov commented Mar 25, 2024

fedorov commented Mar 25, 2024

jcfr commented Mar 25, 2024

jcfr commented Mar 25, 2024

vkt1414 commented Mar 25, 2024 •

edited

Loading

fedorov commented Mar 29, 2024

vkt1414 commented Mar 29, 2024

fedorov commented Mar 29, 2024

vkt1414 commented Apr 1, 2024

fedorov commented Apr 20, 2024

vkt1414 commented Apr 24, 2024

vkt1414 commented Apr 25, 2024

vkt1414 commented Apr 25, 2024

fedorov commented Apr 26, 2024 •

edited

Loading

fedorov left a comment

fedorov left a comment •

edited

Loading

vkt1414 commented May 2, 2024 •

edited

Loading

feat: report download progress #56

feat: report download progress #56

Conversation

vkt1414 commented Mar 24, 2024 • edited by jcfr Loading

vkt1414 commented Mar 24, 2024

fedorov commented Mar 25, 2024

fedorov commented Mar 25, 2024

jcfr commented Mar 25, 2024

jcfr commented Mar 25, 2024

vkt1414 commented Mar 25, 2024 • edited Loading

fedorov commented Mar 29, 2024

vkt1414 commented Mar 29, 2024

fedorov commented Mar 29, 2024

vkt1414 commented Apr 1, 2024

fedorov commented Apr 20, 2024

vkt1414 commented Apr 24, 2024

vkt1414 commented Apr 25, 2024

vkt1414 commented Apr 25, 2024

fedorov commented Apr 26, 2024 • edited Loading

fedorov left a comment

Choose a reason for hiding this comment

fedorov left a comment • edited Loading

Choose a reason for hiding this comment

vkt1414 commented May 2, 2024 • edited Loading

vkt1414 commented Mar 24, 2024 •

edited by jcfr

Loading

vkt1414 commented Mar 25, 2024 •

edited

Loading

fedorov commented Apr 26, 2024 •

edited

Loading

fedorov left a comment •

edited

Loading

vkt1414 commented May 2, 2024 •

edited

Loading