Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list_indices and fetch_index implemented #101

Merged
merged 6 commits into from
Aug 2, 2024

Conversation

DanielaSchacherer
Copy link
Contributor

@DanielaSchacherer DanielaSchacherer commented Jul 22, 2024

I made a suggestion on how to implement

  • list_indices()
  • fetch_index()

There is certainly room for improvement, but we can use it as basis for further discussion.

@fedorov
Copy link
Member

fedorov commented Jul 22, 2024

@DanielaSchacherer to help take care of the CI issues, you can install pre-commit hooks with pre-commit install in the repo folder. This will apply those checks and auto-corrections every time you try to commit locally, so you do not need to wait to see what happens with CI.

@fedorov
Copy link
Member

fedorov commented Jul 22, 2024

Daniela, you didn't need to close the PR - all of those issues can be resolved by subsequent commits to the same branch.

@fedorov
Copy link
Member

fedorov commented Jul 22, 2024

@DanielaSchacherer also here's a useful hint - if you are working on a PR and want to mark it as not ready for review, you can set it to "Draft" status - upper right corner of the page "Convert to draft" link.

image

@DanielaSchacherer DanielaSchacherer marked this pull request as draft July 22, 2024 14:15
@DanielaSchacherer
Copy link
Contributor Author

Okay, sorry. I am not yet very familiar with CI tools in Github.
Very useful tipps! Thanks!

@fedorov
Copy link
Member

fedorov commented Jul 22, 2024

@DanielaSchacherer you will have to bear with one more tip!

As you go over small refinements, you will inevitably be adding commits that add small tweaks but do not bring any value for the code development history purposes (such as in your case last the last 4 commits).

In such cases, good practice is to squash those commits into the one commit. In your case, it would be git rebase -i HEAD~5 (you can read about interactive rebase here, for example https://itexus.com/glossary/git-rebase-interactive/ - and I am sure there are better/more comprehensive articles).

Also, the commit message has to be more informative, and, when PR corresponds to an existing issue, it should link to that issue (you can do it easily using github notation and just mention #97). After you squashed those 5 commits into 1, you can modify the commit message with git commit --amend. You would then need to force-push your branch to github with git push -f.

Finally, we use the convention to have prefix indicating the nature of the contribution. This is a good guide to follow: https://slicer.readthedocs.io/en/latest/developer_guide/contributing.html#how-to-write-commit-messages.

Copy link
Member

@fedorov fedorov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the comments here and the comment above, please add a test!

idc_index/index.py Outdated Show resolved Hide resolved
idc_index/index.py Outdated Show resolved Hide resolved
idc_index/index.py Outdated Show resolved Hide resolved
idc_index/index.py Outdated Show resolved Hide resolved
@fedorov
Copy link
Member

fedorov commented Jul 22, 2024

@DanielaSchacherer I am happy to discuss the above on a call - I understand it can be confusing.

@DanielaSchacherer
Copy link
Contributor Author

DanielaSchacherer commented Jul 24, 2024

Note: I experienced a problem with accessing github assets, as there appears to be an API limit (error 403). That's why one of the test fails.
Also as a note: I changed the code to only consider parquet files, instead of parquet and csv. Is that alright?

@DanielaSchacherer DanielaSchacherer force-pushed the main branch 3 times, most recently from 91b722b to 11777cc Compare July 24, 2024 13:30
@vkt1414
Copy link
Collaborator

vkt1414 commented Jul 24, 2024

Note: I experienced a problem with accessing github assets, as there appears to be an API limit (error 403). That's why one of the test fails.

There is a limit of 60 requests per hour.

https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#primary-rate-limit-for-unauthenticated-users

Also as a note: I changed the code to only consider parquet files, instead of parquet and csv. Is that alright?

CSV is too inefficient to revert to. Currently, all indices are generated in the parquet file format only.

@fedorov
Copy link
Member

fedorov commented Jul 24, 2024

Definitely, no need to consider CSV.

@fedorov
Copy link
Member

fedorov commented Jul 26, 2024

Note: I experienced a problem with accessing github assets, as there appears to be an API limit (error 403). That's why one of the test fails.

I believe it is failing because you are trying to access sm_index, which is indeed not installed. idc-index currently depends on idc-index-data, and specifically on version 18.0.1 of the latter, which does not contain sm_index release attachment. It was added in 18.1.0.

I am also now questioning whether we should install all of the parquet attachments blindly. If we do that, we need to think how to communicate the descriptions. It might be more practical to hard-code installation of the additional indices. It might be easier this way to explain to the users what they are.

@DanielaSchacherer
Copy link
Contributor Author

DanielaSchacherer commented Jul 30, 2024

I believe it is failing because you are trying to access sm_index, which is indeed not installed. idc-index currently depends on idc-index-data, and specifically on version 18.0.1 of the latter, which does not contain sm_index release attachment. It was added in 18.1.0.

I think you are right. Additionally, I encountered some API limits before when using specifically the latest idc-index-data version. I am trying to reproduce it now.

I am also now questioning whether we should install all of the parquet attachments blindly. If we do that, we need to think how to communicate the descriptions. It might be more practical to hard-code installation of the additional indices. It might be easier this way to explain to the users what they are.

I am not concerned so much about whether we download all parquet attachments blindly or on user's request. I am struggling more with how to explain available indices and their use (especially, which one is used when calling a function).

Regarding what you said in Slack:

I don't recall when the decision to download all assets was made.

Did we make that decision at all? I thought we made the decision to download them by request? But as said above, in either case we will need good descriptions on the intended use and I am not sure how to do that.

@fedorov
Copy link
Member

fedorov commented Jul 30, 2024

Did we make that decision at all? I thought we made the decision to download them by request?

Sorry, I was not precise. What I meant is that right now the availability of indices is discovered on the fly by examining all release attachments that match a pattern. An alternative to this could be maintaining a list of known indices within the package, and interacting with GitHub only to retrieve (but not to discover) extra indices. Let's meet to discuss this @DanielaSchacherer.

@fedorov
Copy link
Member

fedorov commented Jul 31, 2024

Looking at the CI logs, it is error 403

INFO     idc_index.index:index.py:183 Fetching the list of indices from idc-index-data 18.1.0 release.
ERROR    idc_index.index:index.py:199 Failed to fetch releases: 403

At this point, I would just switch to the fixed list of external indices. I personally do not see the value in debugging this. I would do that independently from this error.

@DanielaSchacherer
Copy link
Contributor Author

@fedorov I'll give it a last try, but you are probably right.

@fedorov
Copy link
Member

fedorov commented Jul 31, 2024

@DanielaSchacherer I forgot to mention it earlier, but I noticed you are working in the main branch of your fork. This, in my experience, is not a good idea, since usually you would want the main branch of your fork to track main upstream. This also creates challenges for collaborators, since they would need to deal with two main branches - yours and upstream.

Next time, I recommend you first make a dedicated branch for your PR.

This and earlier issues motivated me to add contribution guidelines to the repo - if you see anything missing there, please let me know: https://github.com/ImagingDataCommons/idc-index/blob/main/CONTRIBUTING.md.

@DanielaSchacherer DanielaSchacherer force-pushed the main branch 2 times, most recently from 1d69823 to fb10b15 Compare July 31, 2024 15:43
@fedorov
Copy link
Member

fedorov commented Jul 31, 2024

@DanielaSchacherer I resolved the conflicts locally, and rebased to the current main branch. I have no idea why the author has changed. I also have no idea why the tests are now failing .... To be safe, I copied your main branch here (before rebase and push into your current main) so you can compare the content: https://github.com/fedorov/idc-index/tree/100-main-daniela-backup.

Maybe you can take a look first, and then I will continue with my review? Or you can make a separate from main branch, we can close this PR and open a new one, if we think it will take some more time to iterate on this...

Sorry for not providing the guidance how to set up dev process from the start. This contributed a lot to the current confusion...

@vkt1414
Copy link
Collaborator

vkt1414 commented Jul 31, 2024

I also have no idea why the tests are now failing

The tests are failing because.. the url is not correct.
If we are hardcoding the urls..there is no need to use API anymore..
we need to modify..
asset_endpoint_url to
https://github.com/ImagingDataCommons/idc-index-data/releases/download/{idc_index_data.__version__}

https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/18.0.1

image

@fedorov
Copy link
Member

fedorov commented Aug 1, 2024

I also have no idea why the tests are now failing

What I meant to say is that I could not explain how the tests were succeeding before I resolved the conflict, but are failing now. Perhaps I messed something up while resolving the conflicts, and maybe that is the API URL. I was thinking Daniela would be best to review and confirm. I admit I did not have the time to actually review this PR.

@DanielaSchacherer
Copy link
Contributor Author

@fedorov thank you for setting up the contribution README, that's very helpful and hopefully it will prevent more merge conflicts.

It was a combination of the URL (what Vamsi said) and some confusion about whether indices_overview is a dictionary or dataframe.

@DanielaSchacherer
Copy link
Contributor Author

We are still missing descriptions of the indices. Is that something we should target here or does it make sense to have a separate PR for that?

@fedorov
Copy link
Member

fedorov commented Aug 1, 2024

I will take a shot at that as part of my review!

@fedorov fedorov marked this pull request as ready for review August 1, 2024 16:17
DanielaSchacherer and others added 2 commits August 1, 2024 12:54
* fixed URL for download and handling of indices overview
* fixed pylint bug about iterating dictionaries
* fixed tests for list_indices and fetch_index
indices_overview (pd.DataFrame): DataFrame containing information per index.
"""

return pd.DataFrame.from_dict(self.indices_overview, orient="index")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation to convert to a dataframe here? To me at least, this adds unnecessary complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to display it in a readable way to the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would not this function at all. I think it pollutes the API without good reason. I don't think we would want to have a dedicated function that would make a DF for every other dict class variable, right?

If we want a convenience function that would make a DataFrame from a dict, it is better to add a generic helper function that would do this for any dict.


return pd.DataFrame.from_dict(self.indices_overview, orient="index")

def fetch_index(self, index) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this function I would not just fetch the file, but also load it into the class variable named as the index name - consistent with the main index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@fedorov
Copy link
Member

fedorov commented Aug 2, 2024

@DanielaSchacherer I guess I lost track of the original purpose of this PR while reviewing yesterday. Are you planning to update the API to allow download of the instances based on the instance index as part of this PR?

@DanielaSchacherer
Copy link
Contributor Author

@fedorov For that very reason, I would prefer to later make a new PR for this and for now continue the discussion that we started in #97.

In the future, we can add convenience conversion util function that
could serve same purpose without being attached to a specific class
variable
@fedorov fedorov merged commit 26e6d13 into ImagingDataCommons:main Aug 2, 2024
10 checks passed
Comment on lines +209 to +217
if response.status_code == 200:
filepath = os.path.join(
idc_index_data.IDC_INDEX_PARQUET_FILEPATH.parents[0],
f"{index}.parquet",
)
with open(filepath, mode="wb") as file:
file.write(response.content)
setattr(self.__class__, index, pd.read_parquet(filepath))
self.indices_overview[index]["installed"] = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See related topic here: #107

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants