-
-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: started working on SWE-bench evals #142
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Reviewed everything up to 96f1ede in 12 seconds
More details
- Looked at
376
lines of code in6
files - Skipped
1
files when reviewing. - Skipped posting
4
drafted comments based on config settings.
1. gptme/eval/swebench/utils.py:10
- Draft comment:
The import statement forDownloadMode
is repeated. Remove the duplicate import to clean up the code. - Reason this comment was not posted:
Confidence changes required:50%
The import statement forDownloadMode
is repeated, which is unnecessary and can be removed.
2. gptme/eval/swebench/utils.py:46
- Draft comment:
Thecurrent_file
variable is initialized but never used. Consider removing it to clean up the code. - Reason this comment was not posted:
Confidence changes required:50%
Theget_file_spans_from_patch
function initializescurrent_file
but never uses it, which is unnecessary and can be removed.
3. gptme/eval/swebench/utils.py:74
- Draft comment:
Usingos.chdir
to change the working directory can have side effects. Consider using a context manager to temporarily change the directory. - Reason this comment was not posted:
Confidence changes required:50%
Thesetup_github_repo
function changes the current working directory usingos.chdir
, which can have side effects. It's better to use a context manager to temporarily change the directory.
4. gptme/eval/swebench/main.py:86
- Draft comment:
Thewrite_results
function is called but not defined in the provided code. Ensure that it is implemented or imported correctly. - Reason this comment was not posted:
Confidence changes required:50%
Thewrite_results
function is called but not defined in the provided code. Ensure that it is implemented or imported correctly.
Workflow ID: wflow_QDiWSjoiJJC7dGXD
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
Codecov ReportAttention: Patch coverage is
✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## master #142 +/- ##
==========================================
- Coverage 80.63% 77.15% -3.49%
==========================================
Files 52 57 +5
Lines 3145 3287 +142
==========================================
Hits 2536 2536
- Misses 609 751 +142
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on 4e9b48a in 6 seconds
More details
- Looked at
21
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
1
drafted comments based on config settings.
1. gptme/eval/swebench/main.py:4
- Draft comment:
The importEvalResult
is unused and can be removed to clean up the code. - Reason this comment was not posted:
Confidence changes required:50%
The import statement forEvalResult
is not used in the code, which is unnecessary and should be removed to keep the code clean.
Workflow ID: wflow_RhT1myKfhHe3YANu
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
Anthropic announced that Claude 3.5 (new), aka Claude "3.6", performs 49% on SWE-Bench Verified, with a simple harness: https://www.anthropic.com/research/swe-bench-sonnet I think optimizing for the particular benchmark might become less and less necessary over time, unless you want to squeeze performance out of smaller models. Would be cool to make a proper run and get listed on the SWE-Bench leaderboard, though. |
I got it kinda working with swe-agent and this dataset which contains many more issues: https://huggingface.co/datasets/nebius/SWE-bench-extra Might also integrate https://swe-rex.com/latest/ which seems pretty useful My branch is a giant mess atm though 😭 |
Implemented with gptme, given moatless-tools and aider as reference implementations.
Important
Introduces SWE-bench evaluation framework in
gptme
with new modules for instance loading, repository setup, and evaluation execution, along with CLI support and updated dependencies.gptme/eval/swebench
.run_swebench_evaluation()
inevaluate.py
to evaluate instances using anAgent
.main.py
for running evaluations with options for model, dataset, split, instance, and verbosity.utils.py
provides functions for loading instances, setting up repositories, and extracting file spans from patches.gptme-eval-swebench
script entry inpyproject.toml
.datasets
andfsspec
as dependencies inpyproject.toml
.This description was created by for 4e9b48a. It will automatically update as commits are pushed.