Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: report download progress #56

Merged
merged 11 commits into from
May 3, 2024

Conversation

vkt1414
Copy link
Collaborator

@vkt1414 vkt1414 commented Mar 24, 2024

This PR aims to address #24 and #51

there is no way to track download sequentially as s5cmd run or cp command will be locked on a thread. so, download progress is outsourced to a thread and will run simultaneously along side download thread on both series and manifest downloader

as the index only contains aws urls, when a manifest contains gcs urls, crdc series instance uuid is extracted from aws urls and queried against the aws urls in the index to get download size. For manifest download, download size is calculated first and the progress is tracked against as a whole.

manifest validator in download from manifest is offloaded to a dedicated function that will check not only the first line but every line, and if the manifest has urls from both gcp and aws to raise an exception.

s5cmd cp is replaced with sync to gracefully avoid downloading the same data again.

get functions will now return a message that data not found for the values given for a key.

queries folder is now removed as they will persist in idc-index-data

@vkt1414 vkt1414 force-pushed the feat-download-progress branch from a1de6f6 to b9f8ba8 Compare March 24, 2024 19:06
@vkt1414 vkt1414 requested review from jcfr and fedorov March 24, 2024 19:09
@vkt1414
Copy link
Collaborator Author

vkt1414 commented Mar 24, 2024

@jcfr whenever possible, could you help troubleshoot with the pylint error in CI? I am unable to figure what's causing the issue. Thanks!
https://github.com/ImagingDataCommons/idc-index/actions/runs/8411497038/job/23031159059?pr=56

@fedorov
Copy link
Member

fedorov commented Mar 25, 2024

Maybe this article can help troubleshoot pylint error 12: https://stackoverflow.com/questions/76181900/pylint-return-non-zero-code-error-only-if-errors-are-there

@fedorov
Copy link
Member

fedorov commented Mar 25, 2024

@jcfr sorry for the silly question - but if we are running ruff already, why do we also need to run pylint?

@jcfr
Copy link
Collaborator

jcfr commented Mar 25, 2024

ruff and pylint are complementary

@jcfr
Copy link
Collaborator

jcfr commented Mar 25, 2024

When first updating the CI, I added a lot "exception" related to potential issues:

idc-index/pyproject.toml

Lines 150 to 174 in 1868f8a

# Exceptions below are specific to idc-index
"B007", # Loop control variable {name} not used within loop body
"B904", # Checks for raise statements in exception handlers that lack a from clause.
"E722", # Do not use bare except
"EM101", # Exception must not use a string literal, assign to variable first
"F841", # Local variable {name} is assigned to but never used
"G003", # Logging statement uses +
"G004", # Logging statement uses f-string
"PD011", # Use .to_numpy() instead of .values
"PD901", # Avoid using the generic variable name df for DataFrames
"PT009", # Use a regular assert instead of unittest-style {assertion}
"PTH100", # os.path.abspath() should be replaced by Path.resolve()
"PTH103", # os.makedirs() should be replaced by Path.mkdir(parents=True)
"PTH107", # os.remove() should be replaced by Path.unlink()
"PTH110", # os.path.exists() should be replaced by Path.exists()
"PTH118", # Checks for uses of os.path.join
"PTH119", # os.path.basename() should be replaced by Path.name
"PTH120", # os.path.dirname() should be replaced by Path.parent
"PTH123", # open() should be replaced by Path.open()
"RET504", # Unnecessary assignment to {name} before return statement
"RET506", # Unnecessary {branch} after raise statement
"SIM102", # Use a single if statement instead of nested if statements
"SIM108", # Use ternary operator {contents} instead of if-else-block
"SIM117", # Use a single with statement with multiple contexts instead of nested with statements
"T201", # print found

idc-index/pyproject.toml

Lines 196 to 206 in 1868f8a

# Exceptions below are specific to idc-index
"invalid-name",
"missing-class-docstring",
"missing-function-docstring",
"logging-fstring-interpolation",
"logging-not-lazy",
"no-else-raise",
"raise-missing-from",
"undefined-loop-variable",
"unspecified-encoding",
"unused-variable",

To move forward, it would be sensible to:

  • gradually review and address these exceptions if it applies
  • address warnings associated with new code & feature

Note that in some cases, it is legitimate to have exception and not follow the recommendation associated with the warning. This should be address on a case-by-case basis

@vkt1414
Copy link
Collaborator Author

vkt1414 commented Mar 25, 2024

@fedorov I added the initial disk size as a baseline, and the download progress is now going as expected. Calculating file size may be an overhead if too many files are in a directory. I do not know if there is any better way.

I do think if we implement patient/study/series hierarchy as in #22, we could only track the subfolders we know data is going to be downloaded into. Implementing this hierarchy may lessen some burden on disk size calculation.

I tried replacing threading with multiprocessing which seems to use Popen, it worked only on Ubuntu, but it did not work well on Windows and macOS systems, so I reverted to threading.

I can also confirm that s5cmd does not delete any existing files, so replacing cp with sync has been seamless.

Here's a run that did not work on Windows and Mac.
https://github.com/ImagingDataCommons/idc-index/actions/runs/8426861486

@fedorov
Copy link
Member

fedorov commented Mar 29, 2024

@vkt1414 this is the use of Popen I was referring to at the meeting. What is wrong with this simple approach?

image

@vkt1414
Copy link
Collaborator Author

vkt1414 commented Mar 29, 2024

@vkt1414 this is the use of Popen I was referring to at the meeting. What is wrong with this simple approach?

image

Thank you! Ubuntu(Linux) has no problem with any approach so far.

Either way, I'll give this a try.

@fedorov
Copy link
Member

fedorov commented Mar 29, 2024

The idea is to use the simplest approach possible to improve readability and reduce maintenance.

@vkt1414 vkt1414 force-pushed the feat-download-progress branch from 301aeb7 to 62c9d7a Compare April 1, 2024 16:05
@vkt1414
Copy link
Collaborator Author

vkt1414 commented Apr 1, 2024

The idea is to use the simplest approach possible to improve readability and reduce maintenance.

Sure. I used Popen only this time, and this PR should now be ready for review.

@vkt1414 vkt1414 force-pushed the feat-download-progress branch from e40856f to 67a5be2 Compare April 5, 2024 12:58
@vkt1414 vkt1414 marked this pull request as draft April 5, 2024 13:34
@fedorov
Copy link
Member

fedorov commented Apr 20, 2024

@vkt1414 this PR is still marked as draft - is this right?

idc_index/index.py Outdated Show resolved Hide resolved
idc_index/index.py Outdated Show resolved Hide resolved
@vkt1414 vkt1414 force-pushed the feat-download-progress branch from 67a5be2 to b29812e Compare April 24, 2024 13:46
@vkt1414 vkt1414 marked this pull request as ready for review April 24, 2024 22:56
@vkt1414
Copy link
Collaborator Author

vkt1414 commented Apr 24, 2024

@vkt1414 this PR is still marked as draft - is this right?

I have now marked it as ready for review now.

@vkt1414
Copy link
Collaborator Author

vkt1414 commented Apr 25, 2024

@vkt1414 this PR is still marked as draft - is this right?

I have now marked it as ready for review now.

Changing back to draft as I think I can optimize the manifest validator and sync size calculations in seconds regardless of the manifest size. I'd be analyzing all series in the manifest, instead of one series at a time as it is currently implemented now. I'd be leveraging reading the manifest as a dataframe, followed by data wrangling operations to accomplish this.

@vkt1414 vkt1414 marked this pull request as draft April 25, 2024 06:04
@vkt1414 vkt1414 force-pushed the feat-download-progress branch 2 times, most recently from bf1493a to 2a9838d Compare April 25, 2024 17:14
@vkt1414
Copy link
Collaborator Author

vkt1414 commented Apr 25, 2024

here's my attempt at creating the flowchart: (it's an svg file, so we can zoom)
flowchart

@vkt1414 vkt1414 marked this pull request as ready for review April 25, 2024 22:13
@fedorov
Copy link
Member

fedorov commented Apr 26, 2024

@vkt1414 I am going over the review. Your summary of how the function is implemented is very helpful for the review! But I really do not think it should be in the docstrings - those should contain details needed to understand what the function does. Not how it does it. I would recommend moving details of how the function is implemented into the comments within the body function and outside of the docstring.

Please wait with responding to this and with any other edits until I am done with my review!

Copy link
Member

@fedorov fedorov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, I think it is very close!

idc_index/index.py Show resolved Hide resolved
idc_index/index.py Outdated Show resolved Hide resolved
idc_index/index.py Outdated Show resolved Hide resolved
idc_index/index.py Outdated Show resolved Hide resolved
idc_index/index.py Outdated Show resolved Hide resolved
idc_index/index.py Outdated Show resolved Hide resolved
@vkt1414 vkt1414 force-pushed the feat-download-progress branch from 88123a2 to c08aff5 Compare April 30, 2024 12:50
Copy link
Member

@fedorov fedorov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed today, there are some further optimization to be implemented in the validate function. Also, please see this PR for few comments/style changes: vkt1414#10. I have to say I did not finish the review - I only did the validate function...

@vkt1414
Copy link
Collaborator Author

vkt1414 commented May 2, 2024

As discussed today, there are some further optimization to be implemented in the validate function. Also, please see this PR for few comments/style changes: vkt1414#10. I have to say I did not finish the review - I only did the validate function...

Sounds good. I made the validator as efficient as I possibly could. Made tests more robust now by testing all combinations of optional parameters. Made pytest more verbose in CI to aid in troubleshooting.

Also, please see this PR for few comments

Could you just comment any changes necessary on the lines directly? I think it is also easier to probably easier to keep the discussion in place. I also added you and JC as co-authors on this PR already.

@vkt1414 vkt1414 linked an issue May 2, 2024 that may be closed by this pull request
@vkt1414 vkt1414 force-pushed the feat-download-progress branch 4 times, most recently from 2906d93 to 2e2bf83 Compare May 2, 2024 16:10
there is no way to track download sequentially
as s5cmd run or cp command will be locked on a thread.
so, download progress is outsourced to a process
and will run simultaneously along side download
process for both download from selection and manifest

For manifest or selection, download size
is calculated first and the progress is tracked
against as a whole.

manifest validator in download from manifest is now
offloaded to a dedicated function that can check
not only the first line but every line

when sync dry run is enabled, s5cmd cp is replaced with sync to
gracefully avoid downloading the same data again.

add additional endpoints download_dicom_studies, download_dicom_patients,
download_dicom_patients and download_collection, all of them routed to
download_from_selection

Co-Authored-By: Andrey Fedorov <[email protected]>
Co-Authored-By: Jean-Christophe Fillion-Robin <[email protected]>
@vkt1414 vkt1414 force-pushed the feat-download-progress branch 2 times, most recently from c81ac0f to 213b5c7 Compare May 3, 2024 12:13
@fedorov fedorov force-pushed the feat-download-progress branch from 530c0f2 to 546ec31 Compare May 3, 2024 17:10
* propagate s5cmd errors to the user
* instead of raising RuntimeError, inform user about the error
and save them from polluting the console with the stack trace
@fedorov fedorov force-pushed the feat-download-progress branch from 546ec31 to 4d60e6e Compare May 3, 2024 17:38
@fedorov fedorov force-pushed the feat-download-progress branch 2 times, most recently from b43afb7 to 9dd8f62 Compare May 3, 2024 21:20
* update the parameter to allow the user to request the use of s5cmd sync
instead of the default s5cmd cp
* eliminate the code that would condition the use of s5cmd sync on
the progress bar parameter value
@fedorov fedorov force-pushed the feat-download-progress branch from 9dd8f62 to 531444e Compare May 3, 2024 21:25
@vkt1414 vkt1414 merged commit f008e83 into ImagingDataCommons:main May 3, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revisit progress reporting
3 participants