world

New #1 open-source AI Agent on SWE-bench Verified

admin

May 22, 2025 - 11:45

0 0

New #1 open-source AI Agent on SWE-bench Verified

Refact.ai is the #1 open-source AI Agent on SWE-bench Verified with a 69.8% score

May 15, 2025

by Sergey Vakhreev

5 min read

Our SWE-bench pipeline is open-source now — check it on GitHub.

Refact.ai Agent achieved 69.8% on SWE-bench Verified — autonomously solving 349 out of 500 tasks. This makes Refact.ai a leading open-source AI programming Agent on SWE-bench and places it among the top ranks on the leaderboard.

SWE-bench Verified is a refined version of the original SWE-bench, featuring 500 real-world GitHub issues, selected manually. It provides a more accurate and consistent way to evaluate how well AI agents can handle practical software engineering tasks.

Key elements that made this possible:

Extensive guardrails that step in when the model gets stuck or goes off track
debug_script() sub-agent that uses pdb to fix bugs and can modify/create new scripts
strategic_planning() tool powered by o3 to rethink and refine fixes when needed

The full pipeline we used for SWE-bench Verified is open-source. You can implement the same components and run the benchmark just like we did — to reproduce Refact.ai Agent approach and score end-to-end.

Read on to see how the Agent is built for SWE-bench, and how the same ideas power real-world workflows in Refact.ai.

Model setup

Orchestration model: Claude-3.7
Debug sub-agent — debug_script(): Claude-3.7 + o4-mini
Planning tool — strategic_planning(): o3
pass@1: Each task is not attempted more than once.
Temperature: 0 for every Claude model.

For each SWE-bench Verified problem, Refact.ai Agent made one multi-step run aiming to produce a single, correct final solution. Our main goal was to achieve a maximum score in a single attempt.

Simpler, more effective Agent prompt

We revised the Agent prompt from our SWE-bench Lite run, where we top-ranked with a 59.7% score. Back then, it was more complex, and looking at how AI Agent behaved, we realized that simpler is better.

The new version is shorter and easier to follow. Since Refact.ai is open-source, you can explore it:

You are a fully autonomous agent for coding tasks.
  Your task is to identify and solve the problem from the given PR by directly changing files in the given project.
  You must follow the strategy, step by step in the given order without skipping.
  **Step 1: Explore the Problem**
    - Use `cat()` to open files. Use `search_symbol_definition()`, `search_symbol_usages()` if you know names of symbols.
    - Use `search_pattern()` for search by pattern, `search_semantic()` for a semantic search.
  **Step 2: Reproduce the Problem using `debug_script()`**
    - Find and run all project's existing tests to ensure the fix won't introduce new problems elsewhere.
    - Write a script that reproduces the issue. Cover as many corner cases as possible.
    - Set up necessary environment (e.g., create required folders or additional files) to run the script.
    - Run the script using `shell("python ...")` to verify that the error occurs and the script is correct.
    - After verifying that the script is correct and reproduces the issue, call `debug_script()` to debug it.
  **Step 3: Make a Plan using `strategic_planning()` and fix the Problem**
    - Open all new files mentioned in `debug_script()` report.
    - Call `strategic_planning()` once to think through and brainstorm the solution.
    - Update projects files directly without creating patches and diffs.
  **Step 4: Check and Improve Your Work by running tests**
    - Execute the script that reproduces the original issue.
    - Run project's existing tests again to ensure the fix doesn't introduce new problems elsewhere.
  
  **BEST PRACTICES**
    - You must follow the strategy (explore -> reproduce -> solve -> check), step by step in the given order.
    - Before each step explicitly announce your next actions. Make sure they are still align with the strategy.
    - Include your thoughts wrapped in  before any action.
    - %CD_INSTRUCTIONS%

Introducing a debugging sub-agent

When developers run into bugs, they investigate the code to figure out what went wrong. For our SWE-bench Verified run, that role was mimicked by debug_script().

debug_script() is a sub-agent inside Refact.ai that uses pdb to debug, modify, and generate scripts. It helps AI Agent gather key issue details:

Which files are affected
What actually caused the failure
And how it might be fixed.

Under the hood, debug_script() is powered by Claude-3.7, with o4-mini for summarizing debug info. We forced Refact.ai Agent to call this tool at least once — and up to three times — during each task.

In practice, this debugging sub-agent was really helpful for digging into the problem source.

Check out an example of its report (click to expand)

1 - Problem Overview  
Issue summary  
An empty string in the Django `TEMPLATES` 'DIRS' setting can be converted into a root-directory path (`/`), causing the `template_changed()` autoreload hook to always fire (since `/` is in every file's `.parents`).  
Expected behaviour  
Empty or blank directory entries in `TEMPLATES['DIRS']` should be silently ignored (not converted into `Path("/")`) so that the autoreloader only triggers when actual template directories change.

2 - Project's Files & Symbols Observed  
File                                                          | Key symbols                                     | Purpose / responsibility  
------------------------------------------------------------- | ----------------------------------------------- | ------------------------------------------  
/testbed/reproduce_autoreload_issue.py                         | empty_string, empty_path, template_changed, to_path, Path, test_file, get_template_directories | Test script to reproduce and inspect how empty strings in template dirs are handled  
…/django/utils/_os.py                                        | to_path(value)                                  | Utility that turns arbitrary values into `pathlib.Path`  
…/python3.X/pathlib.py                                        | Path, Path.__new__, _from_parts, _parse_args, parse_parts, absolute(), resolve(), parents | Standard‐library path‐handling logic, shows how `Path("")` → `"."` → cwd  
…/django/template/autoreload.py                               | template_changed(sender, file_path), get_template_directories() | Hooks into file‐change events to decide when to reload templates  

3 - Debugging Timeline  
1. Invoked the reproduction script under `pdb`.  
2. Stepped to `empty_path = to_path("")`. Confirmed `to_path` simply does `Path("")`.  
3. Stepped through `pathlib.Path` internals:  
 - `Path("")` is parsed as an empty part and becomes `Path(".")` (the current directory).  
 - `empty_path.absolute()` is therefore the absolute path of the working directory (e.g. `/testbed`).  
4. Checked membership:  
 - `/testbed` was **not** in `/tmp/test_file.py`.parents → no reload.  
5. Checked with true root Dir (`Path("/")`):  
 - `/` **is** in every file's `.parents` → reload always fires.  
 - Confirmed that the reported issue only occurs when the working directory equals `/`.  
6. Inspected Django's `template_changed()` hook:  
 ```python
 for template_dir in get_template_directories():
     if template_dir in file_path.parents:
         reset_loaders()
         return True

Examined get_template_directories(): it takes each entry of TEMPLATES['DIRS'] (including empty strings), runs it through to_path(), and collects Path objects without filtering out empties.
Simulated a typical settings entry DIRS = ['', '/some/path']:
- Path('') → cwd
- If the Django server is launched from /, cwd=/ → template_dirs includes / → triggers reload on all file changes.
Root cause: an empty string ends up as the process’s cwd; if that cwd is /, it is treated as a Template directory and thus matches every changed file.

4 - Lessons & Recommendations

Pitfalls / anti-patterns

• Converting unchecked user input (blank strings) directly to Path without validation.

• Relying on current working directory semantics leads to environment‐dependent bugs.

Codebase insights

• get_template_directories() does not sanitize or skip empty/None entries.

• No existing test covers the case of blank or empty template dir entries.

Suggested fix / next steps

• In get_template_directories(), filter out empty or purely whitespace strings before conversion:

raw_dirs = settings.TEMPLATES[…]['DIRS']
cleaned = [d for d in raw_dirs if is instance(d, str) and d.strip()]
template_dirs = [to_path(d) for d in cleaned]

• Add a unit test for TEMPLATES['DIRS'] = ["", "/valid"] to ensure blank entries are ignored.

• Document that blank entries are dropped and encourage users to validate environment variables before splitting.```

Guardrails to keep AI Agent on track

The more we tried to constrain the model, the more it resisted. Since the goal was to solve each task in one go, we needed ways to make Agent more reliable.

We added automatic guardrails that kick in when the model gets stuck or makes mistakes. Essentially, these are helper messages, inserted into the chat mid-run as if from a simulated “user” — to nudge Agent back on track. It’s all automated: the script runs static checks on the main model’s (Claude-3.7) messages, and if it detects signs that something is going off track, it sends a message into the chat to help guide the model back in the right direction. These small actions make a big difference in stability.

Extra prompts after sub-agent calls:

After debug_script:

💿 Open all visited files using `cat(file1,file2,file3,…)`!

After strategic_planning():

💿 Now implement the solution above.

Reminders:

- Do not create documents, README.md, or other files which are non-related to fixing the problem. - Convert generated changes into the `update_textdoc()` or `create_textdoc()` tool calls. Do not creat patches (in diff format) or monkey-patches! - Change the project directly to fix the issue but do not modify existing tests. - Find and run all project’s existing tests to ensure the fix won’t introduce new problems elsewhere. - Create new test files only using `create_textdoc()`.

Guardrails for Agent flow:

💿 Use `debug_script()` instead of `shell()`; dig deeper than previous attempts and set breakpoints inside the project. 💿 Do not call `debug_script()` more than three times. 💿 Call `strategic_planning()` before modifying the project. 💿 If you struggle to find the correct solution, consider using `debug_script()` or `strategic_planning()`. 💿 You cannot call {a_tool_name}\ while on the previous step—follow the strategy.

Strategic planning

The strategic_planning() tool comes in at Step 3 of the Agent prompt. It helps the model improve solution quality by reflecting on what went wrong — and what could be done better — based on the debug_script() report. It uses reasoning, powered by o3, and updates project files directly, without generating patches and diffs.

For this tool, we enforce one call per task.

Since the observation layer (search + pdb debug) was already quite efficient, strategy planning sometimes lagged. We tried the o4-mini and o3 models and found no obvious differences on a small subset of tasks. That said, both models were prone to overcomplicating tasks or not smart enough to identify the real root cause. Claude 3.7 might be a good candidate as a planning model in the future, given how well it did in other parts of the workflow.

Improvements over the SWE-bench Lite strategy

A 59.7% score on the SWE-bench Lite was a solid start. We shared the full technical breakdown in our earlier blog post — but even with a SOTA result, this run exposed a few weak spots.

Before tackling SWE-bench Verifies, we prioritized addressing these issues found.

Tools-related updates:

Fixed a few tool-related issues, making the tools more tolerant of the model’s uncertainty when calling them.
Renamed tools — the model often skipped some tools as their names were unclear. New names:

definition()     -> search\_symbol\_definition()  
references()     -> search\_symbol\_usages()  
regex\_search()  -> search\_pattern()  
search()         -> search\_semantic()  
deep\_analysis() -> strategic\_planning()

Fixed the AST mechanisms inside refact-lsp that prevented decorated symbols from being parsed.
Resolved an issue where Agent didn’t wait for ast/vecdb to finish indexing the project.
We now mark line numbers, which add extra stability with retrieval tools like cat, search, etc.

Context-related updates:

Reduced the strength of chat compression. Claude 3.7 often tried to cat files already in context; instead of blocking it (which caused loops), we now allow the model to receive them again.
Encouraged the model to open whole files instead of many tiny cat calls to read a file line by line.
When the model opens large files, it noticeably degrades as the context size grows quickly. We continue to adjust this balance.

During the SWE-bench Verified run, all these improvements were implemented.

What we tried that did not work

Not every experiment make it to production. Here’s what we tested — and what we implemented:

Didn’t work	What works instead
A separate `critique` tool that allowed the model to assess its own changes.	Turns out, the model does better when it just runs tests and decides the next steps based on results.
A complex `strategic_planning ()` tool flow with four steps: root-cause analysis → initial solution → critique → refined solution. It overcomplicated simple tasks and lowered success rates.	Now, `strategic_planning()` only generates a solution — and this works better.
Using a `pdb()` tool without a dedicated sub-agent. The Claude model preferred `shell()` over `pdb()`, so debugging rarely happened.	Introducing the `debug_script()` sub-agent made it reliable.
Running without sub-agents. As context grew, Claude 3.7 quickly became less accurate and stopped following instructions.	Letting sub-agents do their job.

From benchmark to real product

What makes Refact.ai stand out isn’t just the % of solved benchmark tasks — it’s how our AI Agent gets there. Our goal isn’t to win all leaderboards just for the sake of it, but to build an approach that actually works for real-world programming.

That’s why SWE-bench Verified is also a way to test and improve the actual engineering flow of our product. Many of the updates we made for the run (see: Tools-related updates, Context-related updates) are already shipping in Refact.ai.

The guard mechanisms are another example: in the product, we already have these helper messages that AI Agent automatically sends itself after calling certain tools. Like with debug_script(): it gets the tool output, and also a static instruction to open all the related files mentioned. So, these guard mechanisms are already part of specific flows. And we’re planning more, incluiding chat-wide checks to spot earlier off-tracks and react to them.

We’re also updating the AI Agent prompt used in Refact.ai for VS Code and JetBrains to improve product efficiency for our users. Notably, strategic_planning() isn’t (and won’t be) called by default in pluggin — it’s heavy on coins spent and not always necessary, since the main model is often enough to solve the task. That said, if you think your task needs deeper reasoning, you can still call it manually in chat with @. Just keep in mind it’s coin-expensive.

Refact.ai Agent solved SWE-bech Verified fully autonomously — but in real-world use, of course, developers often want more control. That’s why Refact.ai offers flexible interaction with manual overrides: you can delegate tasks to AI Agent, while it lets you preview and guide the process.

That reflects our philosophy: autonomous AI Agent for programming you can trust — and control when you need to.

Final score

Out of 500 tasks in SWE-bench Verified:

🥇 Solved: 349 (69,8% resolve rate)
Not solved: 151 (30,2%).

Evaluation results

Total Instances	Solved	Not solved	Solved (%)	Not solved (%)
500	349	151	69,8%	30,2%

Resolved by Repository:

astropy/astropy: 9/22 (40.91%) django/django: 165/231 (71.43%) matplotlib/matplotlib: 20/34 (58.82%) mwaskom/seaborn: 0/2 (0.0%) pallets/flask: 1/1 (100.0%) psf/requests: 6/8 (75.0%) pydata/xarray: 18/22 (81.82%) pylint-dev/pylint: 4/10 (40.0%) pytest-dev/pytest: 16/19 (84.21%) scikit-learn/scikit-learn: 28/32 (87.5%) sphinx-doc/sphinx: 28/44 (63.64%) sympy/sympy: 54/75 (72.0%)

Resolved by Time:

2013: 3/3 (100.0%) 2014: 2/2 (100.0%) 2015: 0/1 (0.0%) 2016: 2/2 (100.0%) 2017: 13/16 (81.25%) 2018: 14/24 (58.33%) 2019: 73/98 (74.49%) 2020: 79/108 (73.15%) 2021: 54/86 (62.79%) 2022: 71/102 (69.61%) 2023: 38/58 (65.52%)