Edit tool prompt tweaking: only plain-text format is supported #6067

li-boxuan · 2025-01-06T06:43:43Z

End-user friendly description of the problem this fixes or functionality that this introduces

Tweak edit-related prompts to clarify that edit tools are for plain-text format only.

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

Without this PR:

With this PR:

and I verified the result is indeed of MS-doc format.

As a follow-up, we could create some micro-agents to remind LLM of python libraries related to office, such as python-docx.

Link of any specific issues this addresses

This is motivated by a failure I observed when running TheAgentCompany benchmark. Specifically, this one: https://github.com/TheAgentCompany/TheAgentCompany/blob/d456bb0d0c528e3495b6400f156be02695d6731f/workspaces/tasks/ds-answer-numerical-data-question/task.md?plain=1#L9

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:a9297a0-nikolaik   --name openhands-app-a9297a0   docker.all-hands.dev/all-hands-ai/openhands:a9297a0

github-actions · 2025-01-06T06:44:58Z

Running evaluation on the PR. Once eval is done, the results will be posted.

github-actions · 2025-01-06T14:02:01Z

Running evaluation on the PR. Once eval is done, the results will be posted.

mamoodi · 2025-01-06T14:21:21Z

The eval pipeline seems broken right now :(

github-actions · 2025-01-06T15:15:09Z

Running evaluation on the PR. Once eval is done, the results will be posted.

mamoodi · 2025-01-06T15:43:45Z

@li-boxuan got it working to a degree. Some issue with posting final comment.. Here it is:
evaluation.zip

li-boxuan · 2025-01-07T05:34:27Z

Here's the report - seems lower than expected? I am not sure what the baseline shall be but I thought it should be near 50%

SWE-bench Report

This folder contains the evaluation results of the SWE-bench using the official evaluation docker containerization.

Summary

submitted instances: 30
empty patch instances: 2
resolved instances: 10
unresolved instances: 20
error instances: 0

Resolved Instances

Unresolved Instances

Error Instances

Empty Patch Instances

Incomplete Instances

li-boxuan · 2025-01-07T08:17:54Z

I noticed that astropy__astropy-12907 was passing at some point: https://github.com/swe-bench/experiments/blob/203cc8fa5b8322983445b314da7038b8b6653d5f/evaluation/verified/20241029_OpenHands-CodeAct-2.1-sonnet-20241022/results/results.json#L14C6-L14C28

I inspected the evaluation log and I don't think the failure was related to this PR. It's either randomness, or a regression somewhere else.

mamoodi · 2025-01-07T14:15:25Z

Yeah a bit lower. I think 14-15 resolved was the baseline but as you mentioned it might be randomness. Xingyao would know better for sure.

enyst · 2025-01-07T14:53:37Z

You may want to run compare_outputs.py:

# diff=6
----------------------------------------------------------------------------------------------------
# x resolved but y not=4
               instance_id                                           report_x                                           report_y
0   astropy__astropy-12907  {'empty_generation': False, 'resolved': True, ...  {'empty_generation': False, 'resolved': False,...
14    django__django-11001  {'empty_generation': False, 'resolved': True, ...  {'empty_generation': False, 'resolved': False,...
2     django__django-11039  {'empty_generation': False, 'resolved': True, ...  {'empty_generation': False, 'resolved': False,...
4     django__django-11179  {'empty_generation': False, 'resolved': True, ...  {'empty_generation': False, 'resolved': False,...
----------------------------------------------------------------------------------------------------
# y resolved but x not=2
             instance_id                                           report_x                                           report_y
18  django__django-10924  {'empty_generation': False, 'resolved': False,...  {'empty_generation': False, 'resolved': True, ...
19  django__django-11049  {'empty_generation': False, 'resolved': False,...  {'empty_generation': False, 'resolved': True, ...
----------------------------------------------------------------------------------------------------
Repository comparison (x resolved vs y resolved):

astropy:
Difference: 1 instances!
X resolved but Y failed: (1 instances)
  ['astropy__astropy-12907']
Y resolved but X failed: (0 instances)

django:
Difference: 1 instances!
X resolved but Y failed: (3 instances)
  ['django__django-11001', 'django__django-11039', 'django__django-11179']
Y resolved but X failed: (2 instances)
  ['django__django-10924', 'django__django-11049']

xingyaoww · 2025-01-07T16:50:44Z

@mamoodi
Can we also ran an eval on main to compare with this result? 🤔

If they are comparable, this is probably ok

mamoodi · 2025-01-07T17:33:02Z

@li-boxuan and @xingyaoww for main:
evaluation-main.zip

xingyaoww · 2025-01-07T17:37:40Z

Evaluation results ready for t-main: t-main_25-01-07-17-31.tar.gz

Summary

submitted instances: 29
empty patch instances: 1
resolved instances: 13
unresolved instances: 16
error instances: 0

Looks like we do have some degradation? 13 -> 10?

neubig · 2025-01-08T00:58:28Z

Hard to say, maybe we can run a bigger eval?

li-boxuan · 2025-01-09T05:21:14Z

Seems the noise might have come from All-Hands-AI/openhands-aci#48

discussion on slack: https://openhands-ai.slack.com/archives/C080M7BBSSG/p1736346851359599

xingyaoww · 2025-01-09T14:55:08Z

@mamoodi do we have larger eval? e.g., 100?

mamoodi · 2025-01-09T15:02:48Z

@xingyaoww I can run one with the pipeline but it will have to be manually. Did you want me to go ahead with that? With Sonnet 3.5?

neubig · 2025-01-10T10:12:25Z

@mamoodi , yeah I think it'd be good to run that! We could click "update branch" and then run it on main and this branch.

mamoodi · 2025-01-14T22:07:13Z

Sorry folks it's been a busy few days. Is this still pending a 100 instance eval?
I still need to run one on main to figure out what the baseline is. I'm planning to run one on main tonight.
Did you all want to run 100 instance on this PR afterwards? Perhaps tomorrow?

neubig · 2025-01-14T22:54:55Z

Yeah, sounds good!

mamoodi · 2025-01-19T02:52:03Z

Alright. Sorry about the wait folks. Here are the results.
I will mention that the main result was from Jan 14 and some changes have gone in which may have affected this.
I believe @xingyaoww will know the real values of the current main.

Eval 100 instances for main:

Summary

submitted instances: 100
empty patch instances: 12
resolved instances: 37
unresolved instances: 63
error instances: 0

Eval 100 instances for this PR:

Summary

submitted instances: 100
empty patch instances: 14
resolved instances: 43
unresolved instances: 57
error instances: 0

The sizes of the evals are too large to upload here so I will post it on Slack.

Tweak edit tool prompts

7934f3f

li-boxuan mentioned this pull request Jan 6, 2025

Edit tool prompt tweaking: only plain-text format is supported #6065

Closed

1 task

li-boxuan added the run-eval-m Runs evaluation with 30 instances label Jan 6, 2025

li-boxuan requested review from xingyaoww and neubig and removed request for xingyaoww January 6, 2025 06:45

li-boxuan assigned neubig Jan 6, 2025

mamoodi added run-eval-m Runs evaluation with 30 instances and removed run-eval-m Runs evaluation with 30 instances labels Jan 6, 2025

Merge branch 'main' into boxuan/tweak-edit-prompt

9ed3c81

Merge branch 'main' into boxuan/tweak-edit-prompt

f78d639

neubig assigned mamoodi and unassigned neubig Jan 18, 2025

Merge branch 'main' into boxuan/tweak-edit-prompt

a9297a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Edit tool prompt tweaking: only plain-text format is supported #6067

Edit tool prompt tweaking: only plain-text format is supported #6067

li-boxuan commented Jan 6, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 6, 2025

github-actions bot commented Jan 6, 2025

mamoodi commented Jan 6, 2025

github-actions bot commented Jan 6, 2025

mamoodi commented Jan 6, 2025

li-boxuan commented Jan 7, 2025

li-boxuan commented Jan 7, 2025

mamoodi commented Jan 7, 2025

enyst commented Jan 7, 2025

xingyaoww commented Jan 7, 2025 •

edited

Loading

mamoodi commented Jan 7, 2025

xingyaoww commented Jan 7, 2025

neubig commented Jan 8, 2025

li-boxuan commented Jan 9, 2025

xingyaoww commented Jan 9, 2025

mamoodi commented Jan 9, 2025

neubig commented Jan 10, 2025

mamoodi commented Jan 14, 2025 •

edited

Loading

neubig commented Jan 14, 2025

mamoodi commented Jan 19, 2025

Edit tool prompt tweaking: only plain-text format is supported #6067

Are you sure you want to change the base?

Edit tool prompt tweaking: only plain-text format is supported #6067

Conversation

li-boxuan commented Jan 6, 2025 • edited by github-actions bot Loading

Without this PR:

With this PR:

github-actions bot commented Jan 6, 2025

github-actions bot commented Jan 6, 2025

mamoodi commented Jan 6, 2025

github-actions bot commented Jan 6, 2025

mamoodi commented Jan 6, 2025

li-boxuan commented Jan 7, 2025

SWE-bench Report

Summary

Resolved Instances

Unresolved Instances

Error Instances

Empty Patch Instances

Incomplete Instances

li-boxuan commented Jan 7, 2025

mamoodi commented Jan 7, 2025

enyst commented Jan 7, 2025

xingyaoww commented Jan 7, 2025 • edited Loading

mamoodi commented Jan 7, 2025

xingyaoww commented Jan 7, 2025

Summary

neubig commented Jan 8, 2025

li-boxuan commented Jan 9, 2025

xingyaoww commented Jan 9, 2025

mamoodi commented Jan 9, 2025

neubig commented Jan 10, 2025

mamoodi commented Jan 14, 2025 • edited Loading

neubig commented Jan 14, 2025

mamoodi commented Jan 19, 2025

Summary

Summary

li-boxuan commented Jan 6, 2025 •

edited by github-actions bot

Loading

xingyaoww commented Jan 7, 2025 •

edited

Loading

mamoodi commented Jan 14, 2025 •

edited

Loading