feat(eval): reliability improvement for SWE-Bench eval_infer #6347

xingyaoww · 2025-01-18T17:34:16Z

End-user friendly description of the problem this fixes or functionality that this introduces

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

We didn't handle the AgentRuntime exception directly been throw from inside the eval_infer process_instance function.

This causes some failed instance did not trigger "retry with larger instance", causing the evaluation to get stuck.

Link of any specific issues this addresses

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:30df16b-nikolaik   --name openhands-app-30df16b   docker.all-hands.dev/all-hands-ai/openhands:30df16b

tofarr

🍰 - Will this lead to more graceful behavior when a runtime fails to start due to an unexpected error?

feat(eval): reliability improvement for SWE-Bench eval_infer

63cb081

xingyaoww requested a review from neubig January 18, 2025 17:34

fix commit

30df16b

xingyaoww marked this pull request as ready for review January 18, 2025 17:36

tofarr reviewed Jan 18, 2025

View reviewed changes

neubig approved these changes Jan 18, 2025

View reviewed changes

neubig merged commit 2b04ee2 into main Jan 18, 2025
15 checks passed

neubig deleted the xw/swebench-eval-fix branch January 18, 2025 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): reliability improvement for SWE-Bench eval_infer #6347

feat(eval): reliability improvement for SWE-Bench eval_infer #6347

xingyaoww commented Jan 18, 2025 •

edited by github-actions bot

Loading

tofarr left a comment

feat(eval): reliability improvement for SWE-Bench eval_infer #6347

feat(eval): reliability improvement for SWE-Bench eval_infer #6347

Conversation

xingyaoww commented Jan 18, 2025 • edited by github-actions bot Loading

tofarr left a comment

Choose a reason for hiding this comment

xingyaoww commented Jan 18, 2025 •

edited by github-actions bot

Loading