Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Schema] update dataset identifier description #184

Closed
odscjen opened this issue Aug 10, 2023 · 6 comments · Fixed by #239
Closed

[Schema] update dataset identifier description #184

odscjen opened this issue Aug 10, 2023 · 6 comments · Fixed by #239
Assignees

Comments

@odscjen
Copy link
Contributor

odscjen commented Aug 10, 2023

From a suggestion in GFDRR/rdls-spreadsheet-template#3 (comment), update the description of identifier from recommending use of URL to use of project ID

@duncandewhurst
Copy link
Contributor

@matamadio can one project generate more than one dataset?

@matamadio
Copy link
Contributor

matamadio commented Aug 15, 2023

@matamadio can one project generate more than one dataset?

Yes, indeed it can. Project number would be used as general ID to group related datasets. So it's not unique.
Would the same happen using HTTP URI?

@duncandewhurst
Copy link
Contributor

duncandewhurst commented Aug 16, 2023

Related issue: #53

The current description of id specifies that the identifier should be unique:

A unique identifier for the dataset. Use of an HTTP URI is recommended.

In order to conform to that description, if using an HTTP URI, publishers would need to ensure that uniquely identifies an individual dataset, e.g. http://www.example.com/projects/1/datasets/1, rather than being a URI that relates to many datasets, such as the URI of the web-page for a project (e.g. http://www.example.com/projects/1) or a list of datasets (e.g. http://www.example.com/projects/1/datasets).

I think this discussion points to a need to author some guidance on how to populate id.

I propose adding the following content to https://rdl-standard.readthedocs.io/en/dev/guides/metadata/#how-to-publish-rdls-metadata and to update the description of id to link to it.

@odscjen @matamadio @stufraser1 please let me know what you think. The final paragraph speaks to the case of creating RDLS metadata using the spreadsheet template before a dataset is added to the World Bank's Data Catalog.

How to assign a dataset identifier

You need to assign a unique identifier (id) to each dataset for which you are publishing RDLS metadata. The preferred approach is to use a persistent HTTP URI in accordance with Data on the Web Best Practices $8.7 Data Identifiers.

If you are authoring RDLS metadata for a dataset that is already uniquely identified by a persistent HTTP URI, you ought to set id to the existing HTTP URI for the dataset.

For example, the GHS-BUILT-S R2023A - GHS built-up surface grid, derived from Sentinel2 composite and Landsat, multitemporal (1975-2030) dataset is identified by the following URI in the publisher's data catalog: http://data.europa.eu/89h/9f06f36f-4b11-47ec-abb0-4f8b7b1d72ea. Therefore, in the RDLS metadata describing the dataset, id is set to the existing URI:

{
  "id": "http://data.europa.eu/89h/9f06f36f-4b11-47ec-abb0-4f8b7b1d72ea",
  "title": "GHS-BUILT-S R2023A - GHS built-up surface grid, derived from Sentinel2 composite and Landsat, multitemporal (1975-2030)"
}

If you are authoring RDLS metadata for a dataset that is not already uniquely identified by a persistent HTTP URI, you ought to generate a persistent HTTP URI for the dataset. For example, by adding the dataset to a data catalog that assigns persistent HTTP URIs.

Otherwise, if you cannot generate a persistent HTTP URI for a dataset, for example, because you are authoring RDLS metadata before adding the dataset to a data catalog, you ought to set id to a globally unique identifier of your choice, such as a version 4 UUID. For more information, see how to generate a universally unique identifier.

How to generate a universally unique identifier

If you are writing your own software or if you prefer to use the command line, several libraries and tools are available to generate universally unique identifiers (UUIDS), for example:

If you prefer to use a graphical user interface, several web-based tools are available, for example Online UUID Generator.

@matamadio
Copy link
Contributor

matamadio commented Aug 16, 2023

If I understand correctly, in the case of DDH-RDL collection that would mean either:

  • first create the dataset entry (draft) and then assign it the (persistent?) id that is created by catalog.
  • generate a random id

@duncandewhurst
Copy link
Contributor

If I understand correctly, in the case of DDH-RDL collection that would mean either:

* first create the dataset entry (draft) and then assign it the (persistent?) id that is created by catalog.

* generate a random id

Yep, that's correct.

@duncandewhurst
Copy link
Contributor

duncandewhurst commented Sep 4, 2023

Closed by #239

Edit: Update link to point to PR rather than issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
3 participants