Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension of API {id}/versions and {id}/versions/{versionId} with an optional excludeMetadataBlocks parameter #10778

Merged

Conversation

johannes-darms
Copy link
Contributor

@johannes-darms johannes-darms commented Aug 19, 2024

What this PR does / why we need it:

Extension of API {id}/versions and {id}/versions/{versionId} with an optional excludeMetadataBlocks parameter,
that specifies whether the metadataBlocks should be listed in the output. It defaults to false, preserving backward
compatibility. (Note that for a dataset with a large number of versions and/or metadataBlocks having the metadata blocks
included can dramatically increase the volume of the output).

We have slow response from api/datasets/%s/versions due to the large response body. Most of the information included (metadatablocks) is not needed as we just want to display a dropdown list with all available versions.

Which issue(s) this PR closes:

Suggestions on how to test this: Call the API once with the flag and once without.

Does this PR introduce a user interface change?: No

Is there a release notes update needed for this change?: Maybe, it is a new optional property.

Preview docs at https://dataverse-guide--10778.org.readthedocs.build/en/10778/api/native-api.html#list-versions-of-a-dataset

@coveralls
Copy link

coveralls commented Aug 19, 2024

Coverage Status

coverage: 22.694% (-0.001%) from 22.695%
when pulling bada794 on johannes-darms:feat/10171-versions-smaller-response
into 825ab15 on IQSS:develop.

@pdurbin pdurbin added the Size: 3 A percentage of a sprint. 2.1 hours. label Aug 20, 2024
@pdurbin pdurbin added the Type: Feature a feature request label Oct 9, 2024
@johannes-darms
Copy link
Contributor Author

@GPortas we are experiencing some performance issues with our SPA when a user requests information about dataset versions, particularly those with many versions. Are you experiencing similar problems? We believe that reducing the payload by omitting the metadata would solve our problem. As we can load the metadata with another query if needed. What do you think?

@GPortas
Copy link
Contributor

GPortas commented Oct 11, 2024

@GPortas we are experiencing some performance issues with our SPA when a user requests information about dataset versions, particularly those with many versions. Are you experiencing similar problems? We believe that reducing the payload by omitting the metadata would solve our problem. As we can load the metadata with another query if needed. What do you think?

I'm not sure if we've experienced issues with the metadata blocks, and if we have, they may have been minor, possibly because we don't tend to add complex metadata block configurations in our test datasets.

It's reasonable to think it will improve performance, as additional queries are omitted. This is somewhat similar to what we did with the files, where we added the optional query parameter called excludeFiles.

This makes me wonder if it might be interesting to create a 'reduced information' endpoint instead of continuing to include parameters for excluding properties in the general endpoint.

@pdurbin
Copy link
Member

pdurbin commented Oct 15, 2024

@johannes-darms
Copy link
Contributor Author

@GPortas we are experiencing some performance issues with our SPA when a user requests information about dataset versions, particularly those with many versions. Are you experiencing similar problems? We believe that reducing the payload by omitting the metadata would solve our problem. As we can load the metadata with another query if needed. What do you think?

I'm not sure if we've experienced issues with the metadata blocks, and if we have, they may have been minor, possibly because we don't tend to add complex metadata block configurations in our test datasets.

It's reasonable to think it will improve performance, as additional queries are omitted. This is somewhat similar to what we did with the files, where we added the optional query parameter called excludeFiles.

This makes me wonder if it might be interesting to create a 'reduced information' endpoint instead of continuing to include parameters for excluding properties in the general endpoint.

Our problem is only partly caused by the large complex metadata block, the other cause is the amount of versions (we have a dataset where an update is published every day, the file changes but the metadata is the same). So the payload of this API becomes huge and by omitting the metadata blocks we can mitigate the problem while still getting the necessary information about versions without introducing paging.

I'm not a fan of having different endpoints for more or less the same information. It is more code to maintain and more complicated for the user.

This PR is inspired by the excludeFiles feature and the code is quite similar.

@pdurbin
Copy link
Member

pdurbin commented Oct 16, 2024

@GPortas at some point we should probably test the SPA against a dataset with lots of versions. We should have datasets like this on the performance cluster.

@cmbz cmbz added the FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) label Nov 22, 2024
@pdurbin pdurbin self-assigned this Nov 25, 2024
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johannes-darms overall, looks good. I left some comments. Thanks.

doc/release-notes/10171-exlude-metadatablocks.md Outdated Show resolved Hide resolved
doc/sphinx-guides/source/api/native-api.rst Outdated Show resolved Hide resolved
doc/sphinx-guides/source/api/native-api.rst Outdated Show resolved Hide resolved
@cmbz cmbz added the FY25 Sprint 12 FY25 Sprint 12 (2024-12-04 - 2024-12-18) label Dec 5, 2024
@pdurbin
Copy link
Member

pdurbin commented Dec 16, 2024

@johannes-darms can you please merge the latest from develop? We need this anyway and it will trigger a Jenkins run, which is failing. Also, please consider the suggestions I made in my review. Thanks.

@pdurbin pdurbin added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Dec 16, 2024
@johannes-darms johannes-darms force-pushed the feat/10171-versions-smaller-response branch from b8d0f3d to 29b30c2 Compare December 18, 2024 13:26
@johannes-darms
Copy link
Contributor Author

@johannes-darms can you please merge the latest from develop? We need this anyway and it will trigger a Jenkins run, which is failing. Also, please consider the suggestions I made in my review. Thanks.

Sorry for the delay. I've merged, adapted the documentation and wrote a simple test. If you need or want more tests I'm happy to write them.

@pdurbin pdurbin removed the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Dec 18, 2024
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johannes-darms thanks for adding tests and addressing all my questions! I didn't run the code myself but tests are passing and code and docs look good. Approved!

@cmbz cmbz added the FY25 Sprint 14 FY25 Sprint 14 (2025-01-02 - 2025-01-15) label Jan 2, 2025
@sekmiller sekmiller self-assigned this Jan 3, 2025
@sekmiller
Copy link
Contributor

@johannes-darms, sorry to do this to you again, but we've recently released version 6.5, so we need you to get the latest from Dev into this branch. Thanks for your patience.

@johannes-darms
Copy link
Contributor Author

@johannes-darms, sorry to do this to you again, but we've recently released version 6.5, so we need you to get the latest from Dev into this branch. Thanks for your patience.

No worries, merged without manual effort :)

@johannes-darms johannes-darms removed their assignment Jan 7, 2025
@johannes-darms
Copy link
Contributor Author

Sadly, the Jenkins build fails and I cannot access the output. Could you forward the information @pdurbin ?

@sekmiller
Copy link
Contributor

@johannes-darms Thanks again for your patience.

@sekmiller sekmiller merged commit 6d6a509 into IQSS:develop Jan 8, 2025
12 checks passed
@pdurbin pdurbin added this to the 6.6 milestone Jan 8, 2025
@pdurbin
Copy link
Member

pdurbin commented Jan 9, 2025

@johannes-darms I know this PR has already been merged but to answer your question, the latest API test run passed: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-10778/8/testReport/

@johannes-darms
Copy link
Contributor Author

@johannes-darms I know this PR has already been merged but to answer your question, the latest API test run passed: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-10778/8/testReport/

Thanks for sharing but I cannot access the jenkins instance.

@pdurbin
Copy link
Member

pdurbin commented Jan 10, 2025

Yes, I know, sorry, that link was a "note to self".

Related:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) FY25 Sprint 12 FY25 Sprint 12 (2024-12-04 - 2024-12-18) FY25 Sprint 14 FY25 Sprint 14 (2025-01-02 - 2025-01-15) Size: 3 A percentage of a sprint. 2.1 hours. Type: Feature a feature request
Projects
Status: Done 🧹
Development

Successfully merging this pull request may close these issues.

Feature Request/Idea: List Versions of a Dataset API reduced response size
6 participants