feat(eval): reliability improvement for SWE-Bench eval_infer #6347
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
End-user friendly description of the problem this fixes or functionality that this introduces
Give a summary of what the PR does, explaining any non-trivial design decisions
We didn't handle the AgentRuntime exception directly been throw from inside the eval_infer process_instance function.
This causes some failed instance did not trigger "retry with larger instance", causing the evaluation to get stuck.
Link of any specific issues this addresses
To run this PR locally, use the following command: