-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
list_indices and fetch_index implemented #101
Conversation
@DanielaSchacherer to help take care of the CI issues, you can install pre-commit hooks with |
Daniela, you didn't need to close the PR - all of those issues can be resolved by subsequent commits to the same branch. |
@DanielaSchacherer also here's a useful hint - if you are working on a PR and want to mark it as not ready for review, you can set it to "Draft" status - upper right corner of the page "Convert to draft" link. |
Okay, sorry. I am not yet very familiar with CI tools in Github. |
@DanielaSchacherer you will have to bear with one more tip! As you go over small refinements, you will inevitably be adding commits that add small tweaks but do not bring any value for the code development history purposes (such as in your case last the last 4 commits). In such cases, good practice is to squash those commits into the one commit. In your case, it would be Also, the commit message has to be more informative, and, when PR corresponds to an existing issue, it should link to that issue (you can do it easily using github notation and just mention #97). After you squashed those 5 commits into 1, you can modify the commit message with Finally, we use the convention to have prefix indicating the nature of the contribution. This is a good guide to follow: https://slicer.readthedocs.io/en/latest/developer_guide/contributing.html#how-to-write-commit-messages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the comments here and the comment above, please add a test!
@DanielaSchacherer I am happy to discuss the above on a call - I understand it can be confusing. |
Note: I experienced a problem with accessing github assets, as there appears to be an API limit (error 403). That's why one of the test fails. |
91b722b
to
11777cc
Compare
There is a limit of 60 requests per hour.
CSV is too inefficient to revert to. Currently, all indices are generated in the parquet file format only. |
Definitely, no need to consider CSV. |
I believe it is failing because you are trying to access I am also now questioning whether we should install all of the parquet attachments blindly. If we do that, we need to think how to communicate the descriptions. It might be more practical to hard-code installation of the additional indices. It might be easier this way to explain to the users what they are. |
I think you are right. Additionally, I encountered some API limits before when using specifically the latest idc-index-data version. I am trying to reproduce it now.
I am not concerned so much about whether we download all parquet attachments blindly or on user's request. I am struggling more with how to explain available indices and their use (especially, which one is used when calling a function). Regarding what you said in Slack:
Did we make that decision at all? I thought we made the decision to download them by request? But as said above, in either case we will need good descriptions on the intended use and I am not sure how to do that. |
Sorry, I was not precise. What I meant is that right now the availability of indices is discovered on the fly by examining all release attachments that match a pattern. An alternative to this could be maintaining a list of known indices within the package, and interacting with GitHub only to retrieve (but not to discover) extra indices. Let's meet to discuss this @DanielaSchacherer. |
Looking at the CI logs, it is error 403
At this point, I would just switch to the fixed list of external indices. I personally do not see the value in debugging this. I would do that independently from this error. |
@fedorov I'll give it a last try, but you are probably right. |
@DanielaSchacherer I forgot to mention it earlier, but I noticed you are working in the Next time, I recommend you first make a dedicated branch for your PR. This and earlier issues motivated me to add contribution guidelines to the repo - if you see anything missing there, please let me know: https://github.com/ImagingDataCommons/idc-index/blob/main/CONTRIBUTING.md. |
1d69823
to
fb10b15
Compare
… sm-specific ones.
…ained table. Co-authored-by: Daniela Schacherer <[email protected]>
@DanielaSchacherer I resolved the conflicts locally, and rebased to the current Maybe you can take a look first, and then I will continue with my review? Or you can make a separate from Sorry for not providing the guidance how to set up dev process from the start. This contributed a lot to the current confusion... |
The tests are failing because.. the url is not correct. https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/18.0.1 |
What I meant to say is that I could not explain how the tests were succeeding before I resolved the conflict, but are failing now. Perhaps I messed something up while resolving the conflicts, and maybe that is the API URL. I was thinking Daniela would be best to review and confirm. I admit I did not have the time to actually review this PR. |
@fedorov thank you for setting up the contribution README, that's very helpful and hopefully it will prevent more merge conflicts. It was a combination of the URL (what Vamsi said) and some confusion about whether indices_overview is a dictionary or dataframe. |
We are still missing descriptions of the indices. Is that something we should target here or does it make sense to have a separate PR for that? |
I will take a shot at that as part of my review! |
* fixed URL for download and handling of indices overview * fixed pylint bug about iterating dictionaries * fixed tests for list_indices and fetch_index
idc_index/index.py
Outdated
indices_overview (pd.DataFrame): DataFrame containing information per index. | ||
""" | ||
|
||
return pd.DataFrame.from_dict(self.indices_overview, orient="index") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation to convert to a dataframe here? To me at least, this adds unnecessary complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to display it in a readable way to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally would not this function at all. I think it pollutes the API without good reason. I don't think we would want to have a dedicated function that would make a DF for every other dict class variable, right?
If we want a convenience function that would make a DataFrame from a dict, it is better to add a generic helper function that would do this for any dict.
|
||
return pd.DataFrame.from_dict(self.indices_overview, orient="index") | ||
|
||
def fetch_index(self, index) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this function I would not just fetch the file, but also load it into the class variable named as the index name - consistent with the main index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
…ded this in the respective test
@DanielaSchacherer I guess I lost track of the original purpose of this PR while reviewing yesterday. Are you planning to update the API to allow download of the instances based on the instance index as part of this PR? |
In the future, we can add convenience conversion util function that could serve same purpose without being attached to a specific class variable
if response.status_code == 200: | ||
filepath = os.path.join( | ||
idc_index_data.IDC_INDEX_PARQUET_FILEPATH.parents[0], | ||
f"{index}.parquet", | ||
) | ||
with open(filepath, mode="wb") as file: | ||
file.write(response.content) | ||
setattr(self.__class__, index, pd.read_parquet(filepath)) | ||
self.indices_overview[index]["installed"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See related topic here: #107
I made a suggestion on how to implement
There is certainly room for improvement, but we can use it as basis for further discussion.