list_indices and fetch_index implemented #101

DanielaSchacherer · 2024-07-22T12:54:13Z

I made a suggestion on how to implement

list_indices()
fetch_index()

There is certainly room for improvement, but we can use it as basis for further discussion.

fedorov · 2024-07-22T13:41:43Z

@DanielaSchacherer to help take care of the CI issues, you can install pre-commit hooks with pre-commit install in the repo folder. This will apply those checks and auto-corrections every time you try to commit locally, so you do not need to wait to see what happens with CI.

fedorov · 2024-07-22T13:57:21Z

Daniela, you didn't need to close the PR - all of those issues can be resolved by subsequent commits to the same branch.

fedorov · 2024-07-22T14:13:13Z

@DanielaSchacherer also here's a useful hint - if you are working on a PR and want to mark it as not ready for review, you can set it to "Draft" status - upper right corner of the page "Convert to draft" link.

DanielaSchacherer · 2024-07-22T14:15:55Z

Okay, sorry. I am not yet very familiar with CI tools in Github.
Very useful tipps! Thanks!

fedorov · 2024-07-22T15:59:28Z

@DanielaSchacherer you will have to bear with one more tip!

As you go over small refinements, you will inevitably be adding commits that add small tweaks but do not bring any value for the code development history purposes (such as in your case last the last 4 commits).

In such cases, good practice is to squash those commits into the one commit. In your case, it would be git rebase -i HEAD~5 (you can read about interactive rebase here, for example https://itexus.com/glossary/git-rebase-interactive/ - and I am sure there are better/more comprehensive articles).

Also, the commit message has to be more informative, and, when PR corresponds to an existing issue, it should link to that issue (you can do it easily using github notation and just mention #97). After you squashed those 5 commits into 1, you can modify the commit message with git commit --amend. You would then need to force-push your branch to github with git push -f.

Finally, we use the convention to have prefix indicating the nature of the contribution. This is a good guide to follow: https://slicer.readthedocs.io/en/latest/developer_guide/contributing.html#how-to-write-commit-messages.

fedorov

In addition to the comments here and the comment above, please add a test!

idc_index/index.py

fedorov · 2024-07-22T17:31:51Z

@DanielaSchacherer I am happy to discuss the above on a call - I understand it can be confusing.

DanielaSchacherer · 2024-07-24T12:34:33Z

Note: I experienced a problem with accessing github assets, as there appears to be an API limit (error 403). That's why one of the test fails.
Also as a note: I changed the code to only consider parquet files, instead of parquet and csv. Is that alright?

vkt1414 · 2024-07-24T15:29:14Z

Note: I experienced a problem with accessing github assets, as there appears to be an API limit (error 403). That's why one of the test fails.

There is a limit of 60 requests per hour.

https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#primary-rate-limit-for-unauthenticated-users

Also as a note: I changed the code to only consider parquet files, instead of parquet and csv. Is that alright?

CSV is too inefficient to revert to. Currently, all indices are generated in the parquet file format only.

fedorov · 2024-07-24T15:36:44Z

Definitely, no need to consider CSV.

fedorov · 2024-07-26T14:37:01Z

Note: I experienced a problem with accessing github assets, as there appears to be an API limit (error 403). That's why one of the test fails.

I believe it is failing because you are trying to access sm_index, which is indeed not installed. idc-index currently depends on idc-index-data, and specifically on version 18.0.1 of the latter, which does not contain sm_index release attachment. It was added in 18.1.0.

I am also now questioning whether we should install all of the parquet attachments blindly. If we do that, we need to think how to communicate the descriptions. It might be more practical to hard-code installation of the additional indices. It might be easier this way to explain to the users what they are.

DanielaSchacherer · 2024-07-30T15:27:53Z

I believe it is failing because you are trying to access sm_index, which is indeed not installed. idc-index currently depends on idc-index-data, and specifically on version 18.0.1 of the latter, which does not contain sm_index release attachment. It was added in 18.1.0.

I think you are right. Additionally, I encountered some API limits before when using specifically the latest idc-index-data version. I am trying to reproduce it now.

I am also now questioning whether we should install all of the parquet attachments blindly. If we do that, we need to think how to communicate the descriptions. It might be more practical to hard-code installation of the additional indices. It might be easier this way to explain to the users what they are.

I am not concerned so much about whether we download all parquet attachments blindly or on user's request. I am struggling more with how to explain available indices and their use (especially, which one is used when calling a function).

Regarding what you said in Slack:

I don't recall when the decision to download all assets was made.

Did we make that decision at all? I thought we made the decision to download them by request? But as said above, in either case we will need good descriptions on the intended use and I am not sure how to do that.

fedorov · 2024-07-30T16:24:14Z

Did we make that decision at all? I thought we made the decision to download them by request?

Sorry, I was not precise. What I meant is that right now the availability of indices is discovered on the fly by examining all release attachments that match a pattern. An alternative to this could be maintaining a list of known indices within the package, and interacting with GitHub only to retrieve (but not to discover) extra indices. Let's meet to discuss this @DanielaSchacherer.

fedorov · 2024-07-31T14:09:32Z

Looking at the CI logs, it is error 403

INFO     idc_index.index:index.py:183 Fetching the list of indices from idc-index-data 18.1.0 release.
ERROR    idc_index.index:index.py:199 Failed to fetch releases: 403

At this point, I would just switch to the fixed list of external indices. I personally do not see the value in debugging this. I would do that independently from this error.

DanielaSchacherer · 2024-07-31T14:14:41Z

@fedorov I'll give it a last try, but you are probably right.

fedorov · 2024-07-31T14:43:36Z

@DanielaSchacherer I forgot to mention it earlier, but I noticed you are working in the main branch of your fork. This, in my experience, is not a good idea, since usually you would want the main branch of your fork to track main upstream. This also creates challenges for collaborators, since they would need to deal with two main branches - yours and upstream.

Next time, I recommend you first make a dedicated branch for your PR.

This and earlier issues motivated me to add contribution guidelines to the repo - if you see anything missing there, please let me know: https://github.com/ImagingDataCommons/idc-index/blob/main/CONTRIBUTING.md.

… sm-specific ones.

…ained table. Co-authored-by: Daniela Schacherer <[email protected]>

fedorov · 2024-07-31T18:19:10Z

@DanielaSchacherer I resolved the conflicts locally, and rebased to the current main branch. I have no idea why the author has changed. I also have no idea why the tests are now failing .... To be safe, I copied your main branch here (before rebase and push into your current main) so you can compare the content: https://github.com/fedorov/idc-index/tree/100-main-daniela-backup.

Maybe you can take a look first, and then I will continue with my review? Or you can make a separate from main branch, we can close this PR and open a new one, if we think it will take some more time to iterate on this...

Sorry for not providing the guidance how to set up dev process from the start. This contributed a lot to the current confusion...

vkt1414 · 2024-07-31T18:29:05Z

I also have no idea why the tests are now failing

The tests are failing because.. the url is not correct.
If we are hardcoding the urls..there is no need to use API anymore..
we need to modify..
asset_endpoint_url to
https://github.com/ImagingDataCommons/idc-index-data/releases/download/{idc_index_data.__version__}

https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/18.0.1

fedorov · 2024-08-01T02:32:19Z

I also have no idea why the tests are now failing

What I meant to say is that I could not explain how the tests were succeeding before I resolved the conflict, but are failing now. Perhaps I messed something up while resolving the conflicts, and maybe that is the API URL. I was thinking Daniela would be best to review and confirm. I admit I did not have the time to actually review this PR.

DanielaSchacherer · 2024-08-01T16:07:33Z

@fedorov thank you for setting up the contribution README, that's very helpful and hopefully it will prevent more merge conflicts.

It was a combination of the URL (what Vamsi said) and some confusion about whether indices_overview is a dictionary or dataframe.

DanielaSchacherer · 2024-08-01T16:12:45Z

We are still missing descriptions of the indices. Is that something we should target here or does it make sense to have a separate PR for that?

fedorov · 2024-08-01T16:16:31Z

I will take a shot at that as part of my review!

* fixed URL for download and handling of indices overview * fixed pylint bug about iterating dictionaries * fixed tests for list_indices and fetch_index

fedorov · 2024-08-01T16:49:35Z

idc_index/index.py

+            indices_overview (pd.DataFrame): DataFrame containing information per index.
+        """
+
+        return pd.DataFrame.from_dict(self.indices_overview, orient="index")


What is the motivation to convert to a dataframe here? To me at least, this adds unnecessary complexity.

Just to display it in a readable way to the user.

I personally would not this function at all. I think it pollutes the API without good reason. I don't think we would want to have a dedicated function that would make a DF for every other dict class variable, right?

If we want a convenience function that would make a DataFrame from a dict, it is better to add a generic helper function that would do this for any dict.

fedorov · 2024-08-01T16:50:27Z

idc_index/index.py

+
+        return pd.DataFrame.from_dict(self.indices_overview, orient="index")
+
+    def fetch_index(self, index) -> None:


In this function I would not just fetch the file, but also load it into the class variable named as the index name - consistent with the main index.

…ded this in the respective test

fedorov · 2024-08-02T14:32:21Z

@DanielaSchacherer I guess I lost track of the original purpose of this PR while reviewing yesterday. Are you planning to update the API to allow download of the instances based on the instance index as part of this PR?

DanielaSchacherer · 2024-08-02T15:31:44Z

@fedorov For that very reason, I would prefer to later make a new PR for this and for now continue the discussion that we started in #97.

In the future, we can add convenience conversion util function that could serve same purpose without being attached to a specific class variable

fedorov · 2024-08-05T21:05:21Z

idc_index/index.py

+            if response.status_code == 200:
+                filepath = os.path.join(
+                    idc_index_data.IDC_INDEX_PARQUET_FILEPATH.parents[0],
+                    f"{index}.parquet",
+                )
+                with open(filepath, mode="wb") as file:
+                    file.write(response.content)
+                setattr(self.__class__, index, pd.read_parquet(filepath))
+                self.indices_overview[index]["installed"] = True


See related topic here: #107

DanielaSchacherer mentioned this pull request Jul 22, 2024

Add support for download of single instances instead of whole series #97

Closed

DanielaSchacherer closed this Jul 22, 2024

DanielaSchacherer reopened this Jul 22, 2024

DanielaSchacherer marked this pull request as draft July 22, 2024 14:15

fedorov requested changes Jul 22, 2024

View reviewed changes

idc_index/index.py Outdated Show resolved Hide resolved

idc_index/index.py Outdated Show resolved Hide resolved

idc_index/index.py Outdated Show resolved Hide resolved

idc_index/index.py Outdated Show resolved Hide resolved

DanielaSchacherer force-pushed the main branch from 5146454 to 5b20c9a Compare July 24, 2024 12:32

DanielaSchacherer force-pushed the main branch 3 times, most recently from 91b722b to 11777cc Compare July 24, 2024 13:30

fedorov force-pushed the main branch from ec4b35e to 0c7d405 Compare July 26, 2024 14:45

DanielaSchacherer force-pushed the main branch 2 times, most recently from 1d69823 to fb10b15 Compare July 31, 2024 15:43

fedorov and others added 2 commits July 31, 2024 13:49

ENH added functionality and tests to list available indices and fetch…

9e1817c

… sm-specific ones.

ENH: Replaced dynamic Github release asset access with manually maint…

3e5b526

…ained table. Co-authored-by: Daniela Schacherer <[email protected]>

fedorov force-pushed the main branch from fb10b15 to 3e5b526 Compare July 31, 2024 18:07

DanielaSchacherer force-pushed the main branch from 3871627 to 6fc0cdf Compare August 1, 2024 16:08

fedorov marked this pull request as ready for review August 1, 2024 16:17

DanielaSchacherer and others added 2 commits August 1, 2024 12:54

BUG: fixing CI issues

85f4091

* fixed URL for download and handling of indices overview * fixed pylint bug about iterating dictionaries * fixed tests for list_indices and fetch_index

DOC: add brief descriptions for indices

0c36cb1

fedorov force-pushed the main branch from 6fc0cdf to 0c36cb1 Compare August 1, 2024 16:57

fedorov requested changes Aug 1, 2024

View reviewed changes

ENH: now setting class variable within fetch_index function and inclu…

98375be

…ded this in the respective test

STYLE: remove DF conversion function for the index overview

aa021d5

In the future, we can add convenience conversion util function that could serve same purpose without being attached to a specific class variable

fedorov merged commit 26e6d13 into ImagingDataCommons:main Aug 2, 2024
10 checks passed

fedorov reviewed Aug 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

list_indices and fetch_index implemented #101

list_indices and fetch_index implemented #101

DanielaSchacherer commented Jul 22, 2024 •

edited

Loading

fedorov commented Jul 22, 2024

fedorov commented Jul 22, 2024 •

edited

Loading

fedorov commented Jul 22, 2024

DanielaSchacherer commented Jul 22, 2024

fedorov commented Jul 22, 2024 •

edited

Loading

fedorov left a comment

fedorov commented Jul 22, 2024

DanielaSchacherer commented Jul 24, 2024 •

edited

Loading

vkt1414 commented Jul 24, 2024

fedorov commented Jul 24, 2024

fedorov commented Jul 26, 2024

DanielaSchacherer commented Jul 30, 2024 •

edited

Loading

fedorov commented Jul 30, 2024

fedorov commented Jul 31, 2024

DanielaSchacherer commented Jul 31, 2024

fedorov commented Jul 31, 2024

fedorov commented Jul 31, 2024 •

edited

Loading

vkt1414 commented Jul 31, 2024

fedorov commented Aug 1, 2024

DanielaSchacherer commented Aug 1, 2024

DanielaSchacherer commented Aug 1, 2024

fedorov commented Aug 1, 2024

fedorov Aug 1, 2024

DanielaSchacherer Aug 2, 2024

fedorov Aug 2, 2024

fedorov Aug 1, 2024

DanielaSchacherer Aug 2, 2024

fedorov commented Aug 2, 2024

DanielaSchacherer commented Aug 2, 2024

fedorov Aug 5, 2024


		return pd.DataFrame.from_dict(self.indices_overview, orient="index")

		def fetch_index(self, index) -> None:

list_indices and fetch_index implemented #101

list_indices and fetch_index implemented #101

Conversation

DanielaSchacherer commented Jul 22, 2024 • edited Loading

fedorov commented Jul 22, 2024

fedorov commented Jul 22, 2024 • edited Loading

fedorov commented Jul 22, 2024

DanielaSchacherer commented Jul 22, 2024

fedorov commented Jul 22, 2024 • edited Loading

fedorov left a comment

Choose a reason for hiding this comment

fedorov commented Jul 22, 2024

DanielaSchacherer commented Jul 24, 2024 • edited Loading

vkt1414 commented Jul 24, 2024

fedorov commented Jul 24, 2024

fedorov commented Jul 26, 2024

DanielaSchacherer commented Jul 30, 2024 • edited Loading

fedorov commented Jul 30, 2024

fedorov commented Jul 31, 2024

DanielaSchacherer commented Jul 31, 2024

fedorov commented Jul 31, 2024

fedorov commented Jul 31, 2024 • edited Loading

vkt1414 commented Jul 31, 2024

fedorov commented Aug 1, 2024

DanielaSchacherer commented Aug 1, 2024

DanielaSchacherer commented Aug 1, 2024

fedorov commented Aug 1, 2024

fedorov Aug 1, 2024

Choose a reason for hiding this comment

DanielaSchacherer Aug 2, 2024

Choose a reason for hiding this comment

fedorov Aug 2, 2024

Choose a reason for hiding this comment

fedorov Aug 1, 2024

Choose a reason for hiding this comment

DanielaSchacherer Aug 2, 2024

Choose a reason for hiding this comment

fedorov commented Aug 2, 2024

DanielaSchacherer commented Aug 2, 2024

fedorov Aug 5, 2024

Choose a reason for hiding this comment

DanielaSchacherer commented Jul 22, 2024 •

edited

Loading

fedorov commented Jul 22, 2024 •

edited

Loading

fedorov commented Jul 22, 2024 •

edited

Loading

DanielaSchacherer commented Jul 24, 2024 •

edited

Loading

DanielaSchacherer commented Jul 30, 2024 •

edited

Loading

fedorov commented Jul 31, 2024 •

edited

Loading