Skip to content

Commit

Permalink
bug: fix check for installed index
Browse files Browse the repository at this point in the history
enh: switch to dict for index overview and fix tests

bug fixed

test to fix bug on macOS

test

next test

enh: add contributing guidelines

Based on https://github.com/Slicer/Slicer/blob/main/CONTRIBUTING.md

ENH: improve error reporting for unrecognized items from the manifest

Re #100

Whenever a crdc_series_uuid from the manifest is not matched to those
known to the index, provide error message informing the user of what
could be the reasons. Fixed error checking for unrecognized items in
the validation function. Report unrecognized items independently of
whether validation is requested or not.

ENH: add test of the manifest that has mismatches

replaced github release access with hardcoded indices overview

wip

wip
  • Loading branch information
fedorov authored and DanielaSchacherer committed Jul 31, 2024
1 parent c52a2a1 commit 1d69823
Show file tree
Hide file tree
Showing 5 changed files with 169 additions and 556 deletions.
122 changes: 122 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Contributing to idc-index

There are many ways to contribute to idc-index, with varying levels of effort.
Do try to look through the [documentation](idc-index-docs) first if something is
unclear, and let us know how we can do better.

- Ask a question on the [IDC forum][idc-forum]
- Use [idc-index issues][idc-index-issues] to submit a feature request or bug,
or add to the discussion on an existing issue
- Submit a [Pull Request](https://github.com/ImagingDataCommons/idc-index/pulls)
to improve idc-index or its documentation

We encourage a range of Pull Requests, from patches that include passing tests
and documentation, all the way down to half-baked ideas that launch discussions.

## The PR Process, Circle CI, and Related Gotchas

### How to submit a PR ?

If you are new to idc-index development and you don't have push access to the
repository, here are the steps:

1. [Fork and clone](https://docs.github.com/get-started/quickstart/fork-a-repo)
the repository.
2. Create a branch dedicated to the feature/bugfix you plan to implement (do not
use `main` branch - this will complicate further development and
collaboration)
3. [Push](https://docs.github.com/get-started/using-git/pushing-commits-to-a-remote-repository)
the branch to your GitHub fork.
4. Create a
[Pull Request](https://github.com/ImagingDataCommons/idc-index/pulls).

This corresponds to the `Fork & Pull Model` described in the
[GitHub collaborative development](https://docs.github.com/pull-requests/collaborating-with-pull-requests/getting-started/about-collaborative-development-models)
documentation.

When submitting a PR, the developers following the project will be notified.
That said, to engage specific developers, you can add `Cc: @<username>` comment
to notify them of your awesome contributions. Based on the comments posted by
the reviewers, you may have to revisit your patches.

### How to efficiently contribute ?

We encourage all developers to:

- set up pre-commit hooks so that you can remedy various formatting and other
issues early, without waiting for the continuous integration (CI) checks to
complete: `pre-commit install`

- add or update tests. You can see current tests
[here](https://github.com/ImagingDataCommons/idc-index/tree/main/tests). If
you contribute new functionality, adding test(s) covering it is mandatory!

- you can run individual tests from the root repository using the following
command: `python -m unittest -vv tests.idcindex.TestIDCClient.<test_name>`

### How to write commit messages ?

Write your commit messages using the standard prefixes for commit messages:

- `BUG:` Fix for runtime crash or incorrect result
- `COMP:` Compiler error or warning fix
- `DOC:` Documentation change
- `ENH:` New functionality
- `PERF:` Performance improvement
- `STYLE:` No logic impact (indentation, comments)
- `WIP:` Work In Progress not ready for merge

The body of the message should clearly describe the motivation of the commit
(**what**, **why**, and **how**). In order to ease the task of reviewing
commits, the message body should follow the following guidelines:

1. Leave a blank line between the subject and the body. This helps `git log` and
`git rebase` work nicely, and allows to smooth generation of release notes.
2. Try to keep the subject line below 72 characters, ideally 50.
3. Capitalize the subject line.
4. Do not end the subject line with a period.
5. Use the imperative mood in the subject line (e.g.
`BUG: Fix spacing not being considered.`).
6. Wrap the body at 80 characters.
7. Use semantic line feeds to separate different ideas, which improves the
readability.
8. Be concise, but honor the change: if significant alternative solutions were
available, explain why they were discarded.
9. If the commit refers to a topic discussed on the [IDC forum][idc-forum], or
fixes a regression test, provide the link. If it fixes a compiler error,
provide a minimal verbatim message of the compiler error. If the commit
closes an issue, use the
[GitHub issue closing keywords](https://docs.github.com/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue).

Keep in mind that the significant time is invested in reviewing commits and
_pull requests_, so following these guidelines will greatly help the people
doing reviews.

These guidelines are largely inspired by Chris Beam's
[How to Write a Commit Message](https://chris.beams.io/posts/git-commit/) post.

### How to integrate a PR ?

Getting your contributions integrated is relatively straightforward, here is the
checklist:

- All tests pass
- Consensus is reached. This usually means that at least two reviewers approved
the changes (or added a `LGTM` comment) and at least one business day passed
without anyone objecting. `LGTM` is an acronym for _Looks Good to Me_.
- To accommodate developers explicitly asking for more time to test the proposed
changes, integration time can be delayed by few more days.
- If you do NOT have push access, a core developer will integrate your PR. If
you would like to speed up the integration, do not hesitate to add a reminder
comment to the PR

### Automatic testing of pull requests

Every pull request is tested automatically using GitHub Actions each time you
push a commit to it. The GitHub UI will restrict users from merging pull
requests until the CI build has returned with a successful result indicating
that all tests have passed.

[idc-forum]: https://discourse.canceridc.dev
[idc-index-issues]: https://github.com/ImagingDataCommons/idc-index/issues
[idc-index-docs]: https://idc-index.readthedocs.io/
121 changes: 41 additions & 80 deletions idc_index/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

aws_endpoint_url = "https://s3.amazonaws.com"
gcp_endpoint_url = "https://storage.googleapis.com"
asset_endpoint_url = f"https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/{idc_index_data.__version__}"

logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -67,7 +68,24 @@ def __init__(self):
self.collection_summary = self.index.groupby("collection_id").agg(
{"Modality": pd.Series.unique, "series_size_MB": "sum"}
)
self.indices_overview = self.list_indices()

self.indices_overview = pd.DataFrame(
{
"index": {"description": None, "installed": True, "url": None},
"sm_index": {
"description": None,
"installed": True,
"url": os.path.join(asset_endpoint_url, "sm_index.parquet"),
},
"sm_instance_index": {
"description": None,
"installed": True,
"url": os.path.join(
asset_endpoint_url, "sm_instance_index.parquet"
),
},
}
)

# Lookup s5cmd
self.s5cmdPath = shutil.which("s5cmd")
Expand Down Expand Up @@ -172,33 +190,6 @@ def get_idc_version():
idc_version = Version(idc_index_data.__version__).major
return f"v{idc_version}"

@staticmethod
def _get_latest_idc_index_data_release_assets():
"""
Retrieves a list of the latest idc-index-data release assets.
Returns:
release_assets (list): List of tuples (asset_name, asset_url).
"""
release_assets = []
url = f"https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/{idc_index_data.__version__}"
try:
response = requests.get(url, timeout=30)
if response.status_code == 200:
release_data = response.json()
assets = release_data.get("assets", [])
for asset in assets:
release_assets.append(
(asset["name"], asset["browser_download_url"])
)
else:
logger.error(f"Failed to fetch releases: {response.status_code}")

except FileNotFoundError:
logger.error(f"Failed to fetch releases: {response.status_code}")

return release_assets

def list_indices(self):
"""
Lists all available indices including their installation status.
Expand All @@ -207,40 +198,6 @@ def list_indices(self):
indices_overview (pd.DataFrame): DataFrame containing information per index.
"""

if "indices_overview" not in locals():
indices_overview = {}
# Find installed indices
for file in distribution("idc-index-data").files:
if str(file).endswith("index.parquet"):
index_name = os.path.splitext(
str(file).rsplit("/", maxsplit=1)[-1]
)[0]

indices_overview[index_name] = {
"description": None,
"installed": True,
"local_path": os.path.join(
idc_index_data.IDC_INDEX_PARQUET_FILEPATH.parents[0],
f"{index_name}.parquet",
),
}

# Find available indices from idc-index-data
release_assets = self._get_latest_idc_index_data_release_assets()
for asset_name, asset_url in release_assets:
if asset_name.endswith(".parquet"):
asset_name = os.path.splitext(asset_name)[0]
if asset_name not in indices_overview:
indices_overview[asset_name] = {
"description": None,
"installed": False,
"url": asset_url,
}

self.indices_overview = pd.DataFrame.from_dict(
indices_overview, orient="index"
)

return self.indices_overview

def fetch_index(self, index) -> None:
Expand All @@ -251,23 +208,22 @@ def fetch_index(self, index) -> None:
index (str): Name of the index to be downloaded.
"""

if index not in self.indices_overview.index.tolist():
if index not in self.indices_overview.keys():
logger.error(f"Index {index} is not available and can not be fetched.")
elif self.indices_overview.loc[index, "installed"]:
elif self.indices_overview[index]["installed"]:
logger.warning(
f"Index {index} already installed and will not be fetched again."
)
else:
response = requests.get(self.indices_overview.loc[index, "url"], timeout=30)
response = requests.get(self.indices_overview[index]["url"], timeout=30)
if response.status_code == 200:
filepath = os.path.join(
idc_index_data.IDC_INDEX_PARQUET_FILEPATH.parents[0],
f"{index}.parquet",
)
with open(filepath, mode="wb") as file:
file.write(response.content)
self.indices_overview.loc[index, "installed"] = True
self.indices_overview.loc[index, "local_path"] = filepath
self.indices_overview[index]["installed"] = True
else:
logger.error(f"Failed to fetch index: {response.status_code}")

Expand Down Expand Up @@ -668,8 +624,8 @@ def _validate_update_manifest_and_get_download_size(
# create a copy of the index
index_df_copy = self.index

# Extract s3 url and crdc_instance_uuid from the manifest copy commands
# Next, extract crdc_instance_uuid from aws_series_url in the index and
# Extract s3 url and crdc_series_uuid from the manifest copy commands
# Next, extract crdc_series_uuid from aws_series_url in the index and
# try to verify if every series in the manifest is present in the index

# TODO: need to remove the assumption that manifest commands will have 'cp'
Expand Down Expand Up @@ -697,8 +653,9 @@ def _validate_update_manifest_and_get_download_size(
seriesInstanceuid,
s3_url,
series_size_MB,
index_crdc_series_uuid==manifest_crdc_series_uuid AS crdc_series_uuid_match,
index_crdc_series_uuid is not NULL as crdc_series_uuid_match,
s3_url==series_aws_url AS s3_url_match,
manifest_temp.manifest_cp_cmd,
CASE
WHEN s3_url==series_aws_url THEN 'aws'
ELSE
Expand All @@ -717,19 +674,23 @@ def _validate_update_manifest_and_get_download_size(

endpoint_to_use = None

if validate_manifest:
# Check if crdc_instance_uuid is found in the index
if not all(merged_df["crdc_series_uuid_match"]):
missing_manifest_cp_cmds = merged_df.loc[
~merged_df["crdc_series_uuid_match"], "manifest_cp_cmd"
]
missing_manifest_cp_cmds_str = f"The following manifest copy commands do not have any associated series in the index: {missing_manifest_cp_cmds.tolist()}"
raise ValueError(missing_manifest_cp_cmds_str)
# Check if any crdc_series_uuid are not found in the index
if not all(merged_df["crdc_series_uuid_match"]):
missing_manifest_cp_cmds = merged_df.loc[
~merged_df["crdc_series_uuid_match"], "manifest_cp_cmd"
]
logger.error(
"The following manifest copy commands are not recognized as referencing any associated series in the index.\n"
"This means either these commands are invalid, or they may correspond to files available in a release of IDC\n"
f"different from {self.get_idc_version()} used in this version of idc-index. The corresponding files will not be downloaded.\n"
)
logger.error("\n" + "\n".join(missing_manifest_cp_cmds.tolist()))

# Check if there are more than one endpoints
if validate_manifest:
# Check if there is more than one endpoint
if len(merged_df["endpoint"].unique()) > 1:
raise ValueError(
"Either GCS bucket path is invalid or manifest has a mix of GCS and AWS urls. If so, please use urls from one provider only"
"Either GCS bucket path is invalid or manifest has a mix of GCS and AWS urls. "
)

if (
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ disallow_incomplete_defs = true

[tool.ruff]
src = ["idc_index"]
extend-exclude = ["./CONTRIBUTING.md"]

[tool.ruff.lint]
extend-select = [
Expand Down
Loading

0 comments on commit 1d69823

Please sign in to comment.