← Back to all ideas

Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based

Project: carlcrm · View Paper ↗ · Score: 9/10

Prompt for your coding agent

# Research Integration: Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based

## Your Mission

Create a new branch and develop a detailed implementation plan for integrating this research idea into the codebase. Do NOT implement yet — focus on understanding, planning, and identifying risks.

## Branch Setup

```bash
git checkout -b experiment/agent-diff-benchmarking-llm-agents-on-enterprise-a
```

## The Research

**Paper**: [Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based](https://arxiv.org/abs/2602.11224)
**PDF**: https://arxiv.org/pdf/2602.11224

**Core Achievement**:
Introduces a state-diff-based evaluation that captures side effects in external APIs, moving beyond simple success/fail metrics to granular state change verification.

**Why This Matters for carlcrm**:
This is highly relevant to testing a CRM where the side effect (the DB change) is more important than the HTTP status code. It provides a framework for asserting that a 'Create Company' call only affects the 'companies' table.

**Suggested Integration Approach**:
In carlcrm/tests/conftest.py, implement a fixture that takes a snapshot of all table counts and row hashes before a test, then compares them after the test to ensure no unintended data was modified.

**Estimated Effort**: weekend project

## Goal Context

This addresses the following project goal:
> Implement a comprehensive testing infrastructure using pytest and FastAPI TestClient for a SQLite-backed CRM.

## Model Requirements

**Commercial APIs available**: GPT-4, Claude, Gemini, Claude Opus 4, Gemini 3 Pro, Gemini-3-Flash
**Open-source (local GPU)**: Llama, Qwen, DeepSeek, Mistral, Phi

✅ **You can implement this using API calls only** — no local GPU required.

*The paper evaluates agentic performance using commercial models like Claude Opus 4 and Gemini 3 Pro for task generation and Gemini-3-Flash as an evaluation judge. The techniques focus on prompting and code execution against API replicas, which a practitioner can implement using commercial API calls.*

## Your Task: Create the Integration Plan

### Phase 1: Understand the Codebase Context

1. **Identify the integration surface**: Which files/modules would this touch?
2. **Map dependencies**: What existing code would this interact with?
3. **Find similar patterns**: Is there existing code that does something similar we can learn from?

### Phase 2: Design the Integration

Create a detailed plan covering:

1. **Architecture**: How does this fit into the existing system?
2. **Data flow**: What inputs does it need? What outputs does it produce?
3. **Configuration**: What new settings/parameters are needed?
4. **Testing strategy**: How will we validate this works?

### Phase 3: Premortem — What Could Go Wrong?

**Think about this integration failing 2 weeks from now. Why did it fail?**

Consider:
- **Performance**: Could this slow down critical paths?
- **Complexity**: Are we adding too much complexity for the benefit?
- **Maintenance**: Will this be hard to maintain or debug?
- **Dependencies**: Are we adding risky dependencies?
- **Edge cases**: What inputs or states could break this?
- **Rollback**: If this doesn't work, how easily can we revert?

For each risk, note:
- Likelihood (low/medium/high)
- Impact (low/medium/high)
- Mitigation strategy

### Phase 4: Define Success Criteria

Before implementing, define:

1. **Minimum viable test**: What's the simplest way to prove this works?
2. **Quantitative metrics**: What numbers should improve? By how much?
3. **Qualitative checks**: What should "feel" better?
4. **Failure signals**: What would tell us to abandon this approach?

## Output Format

Create a `PLAN.md` file in the repo root with:

```markdown
# Experiment: [Title]

## Summary
[1-2 sentence summary of what we're trying]

## Integration Points
- [ ] File 1: description of changes
- [ ] File 2: description of changes

## Architecture Decision
[Explain the chosen approach and why]

## Risks & Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| ... | ... | ... | ... |

## Success Criteria
- [ ] Criterion 1
- [ ] Criterion 2

## Open Questions
- Question 1?
- Question 2?

## Next Steps
1. First implementation step
2. Second implementation step
```

## Important Guidelines

- **Read the paper first** — skim the abstract, intro, and methodology sections
- **Don't over-engineer** — start with the simplest version that could work
- **Preserve optionality** — design so we can easily extend or remove this later
- **Document decisions** — future you will thank present you
- **Ask questions** — if something is unclear, note it rather than assuming

---

*This prompt was generated by DSI (Daily Session Intelligence) to help you systematically explore research ideas.*

How to use this prompt

  1. Click Copy Prompt above
  2. Open your terminal in the carlcrm repo
  3. Start your coding agent (Claude Code: claude, Cursor, etc.)
  4. Paste the prompt and let it create the branch + plan
  5. Review the PLAN.md before implementing