-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Edit tool prompt tweaking: only plain-text format is supported #6067
base: main
Are you sure you want to change the base?
Conversation
Running evaluation on the PR. Once eval is done, the results will be posted. |
Running evaluation on the PR. Once eval is done, the results will be posted. |
The eval pipeline seems broken right now :( |
Running evaluation on the PR. Once eval is done, the results will be posted. |
@li-boxuan got it working to a degree. Some issue with posting final comment.. Here it is: |
I noticed that I inspected the evaluation log and I don't think the failure was related to this PR. It's either randomness, or a regression somewhere else. |
Yeah a bit lower. I think 14-15 resolved was the baseline but as you mentioned it might be randomness. Xingyao would know better for sure. |
You may want to run
|
@mamoodi If they are comparable, this is probably ok |
@li-boxuan and @xingyaoww for main: |
Evaluation results ready for t-main: t-main_25-01-07-17-31.tar.gz Summary
Looks like we do have some degradation? 13 -> 10? |
Hard to say, maybe we can run a bigger eval? |
Seems the noise might have come from All-Hands-AI/openhands-aci#48 discussion on slack: https://openhands-ai.slack.com/archives/C080M7BBSSG/p1736346851359599 |
@mamoodi do we have larger eval? e.g., 100? |
@xingyaoww I can run one with the pipeline but it will have to be manually. Did you want me to go ahead with that? With Sonnet 3.5? |
@mamoodi , yeah I think it'd be good to run that! We could click "update branch" and then run it on main and this branch. |
Sorry folks it's been a busy few days. Is this still pending a 100 instance eval? |
Yeah, sounds good! |
Alright. Sorry about the wait folks. Here are the results. Eval 100 instances for main: Summary
Eval 100 instances for this PR: Summary
The sizes of the evals are too large to upload here so I will post it on Slack. |
End-user friendly description of the problem this fixes or functionality that this introduces
Tweak edit-related prompts to clarify that edit tools are for plain-text format only.
Give a summary of what the PR does, explaining any non-trivial design decisions
Without this PR:
With this PR:
and I verified the result is indeed of MS-doc format.
As a follow-up, we could create some micro-agents to remind LLM of python libraries related to office, such as
python-docx
.Link of any specific issues this addresses
This is motivated by a failure I observed when running TheAgentCompany benchmark. Specifically, this one: https://github.com/TheAgentCompany/TheAgentCompany/blob/d456bb0d0c528e3495b6400f156be02695d6731f/workspaces/tasks/ds-answer-numerical-data-question/task.md?plain=1#L9
To run this PR locally, use the following command: