Why Backtesting Agents Is Different
Traditional backtesting is straightforward: define rules, run them against historical data, measure results. For AI trading agents, there's an additional challenge: the LLM's decisions are non-deterministic. Ask the same model the same question twice and you may get different answers. This means a backtest of an LLM agent is inherently noisier than a backtest of a mechanical system.
The solution is a two-phase approach: first, backtest the mechanical strategy rules independently (these are deterministic and testable). Second, validate the LLM-driven decision layer through paper trading and simulated live environments where the model processes data sequentially, as it would in production.
Phase 1: Backtest the Strategy Rules
Before involving any LLM, backtest the core strategy your agent will implement. If your agent runs a momentum strategy, backtest the momentum rules themselves: rank by 12-month returns, buy top decile, rebalance monthly. This gives you a baseline: what should the strategy produce under ideal execution?
Tools for Rule-Based Backtesting
- VectorBT — Fastest option for Python. Vectorised operations process years of data in seconds. Best for simple to moderate rule-based strategies.
- Backtrader — Event-driven backtesting with realistic order simulation. More accurate for strategies that depend on fill prices, slippage, and commissions. Slower than VectorBT but more realistic.
- Custom pandas — For simple strategies, a pandas DataFrame with signal columns and a cumulative return calculation is often enough. Use Claude or our prompt templates to generate the code.
Key Metrics to Track
| Metric | What It Tells You | Red Flag |
|---|---|---|
| Total Return | Absolute performance | Below buy-and-hold over same period |
| Sharpe Ratio | Risk-adjusted return | Below 0.5 for a long-only strategy |
| Max Drawdown | Worst peak-to-trough decline | Greater than 30% for most retail traders |
| Win Rate | Percentage of profitable trades | Context-dependent — trend following is 35–40% |
| Profit Factor | Gross profit / Gross loss | Below 1.2 |
| Number of Trades | Statistical significance | Fewer than 30 trades — results are noise |
Phase 2: Validate the LLM Decision Layer
Once you've confirmed the strategy rules work historically, validate the LLM's ability to implement them correctly:
- Replay historical data through the agent. Feed your agent past market data one day at a time, as if it were live. Record every LLM response and every trading decision. This is expensive (API calls for every historical day) but reveals whether the LLM consistently follows your strategy rules or drifts.
- Compare agent decisions to rule-based signals. For each day, check: did the agent agree with the mechanical strategy? Where it disagreed, was the LLM's reasoning valid or did it hallucinate a justification?
- Paper trade forward. Run the agent on Alpaca paper trading for a minimum of 30 trading days. This is the most realistic test — the agent processes live data, makes real-time decisions, and handles the practical challenges (API latency, market closures, data gaps) that backtests don't capture.
The Overfitting Trap
Overfitting is the single biggest mistake in agent backtesting. It happens when you tune parameters (lookback periods, thresholds, LLM prompts) until the backtest looks profitable — but the performance doesn't generalise to new data. Signs you're overfitting:
- You've run 50+ backtest variations and are cherry-picking the best one
- The strategy works spectacularly on one specific period but poorly on others
- You've added more than 3–4 parameters to "improve" results
- Performance degrades sharply when you shift the test window by even a few months
Defence: Walk-forward testing. Split your data into training (70%) and test (30%) sets. Optimise on training data, validate on test data that the optimiser never saw. If performance holds on the unseen data, the strategy may be real. If it collapses, you've overfit.
The Validation Ladder
Professional agent builders follow a progression — each stage must pass before advancing:
- Backtest mechanical rules on 10+ years of data → Sharpe > 0.5, drawdown acceptable
- Walk-forward validation on unseen data → Performance holds within 20% of training results
- LLM replay test on 6 months of historical data → Agent follows strategy rules 90%+ of the time
- Paper trading for 30+ days → Real-time performance matches backtest expectations
- Small live capital (1–5% of portfolio) for 60+ days → Confirms execution quality and psychological tolerance
- Scale up gradually → Only after sustained validated performance
Disclaimer: This content is for educational purposes only. Backtested results do not guarantee future performance. All trading involves risk. Not financial advice.