Retry individual messages/requests when failing with 429
/Data too large
.
#21313
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note: This needs a backport to
6.0
&6.1
.Description
Motivation and Context
Before this PR, when OpenSearch nodes run out of heap space/circuit breakers are tripped during indexing, two things can happen:
429 Too Many Requests
- Currently this leads to halving the batch and retrying. When the batch size has reached zero, retrying will be given up.Data too large
error - Failures will be treated as permanent errors (like mapping exceptions) and will be written to the failure processing collection if available, or just dropped.In order to improve this and avoid potential data loss, this PR changes this to:
circuit_breaking_exception
, we are now using the retryer used for conditions where the target index does not or the indexer master is not discovered yet, to retry indefinitely with an exponential backoff.Data too large
exception, we will retry those, just as for blocked indices indefinitely with an exponential backoff as well.Fixes #21282.
How Has This Been Tested?
I wrote an integration test trying to simulate this condition by setting the
indices.breaker.total.limit
setting to a very low limit. The same procedure was used to test it locally as well.Screenshots (if appropriate):
Types of changes
Checklist: