An "agentic" trading agent is just an automation loop: gather data → reason about it (here, with an LLM) → decide → act → review. Backtesting replays that loop over historical data to see how it would have done. It's the cheapest way to kill a bad idea — and the easiest place to fool yourself.
The setup
I wired a model from the usual platforms to a Python backtest using a vectorised engine, feeding it a daily snapshot of OHLCV data plus a few indicators and asking for a position decision with a rationale. Nothing exotic — the point was to learn the failure modes, not to find a money printer.
Trap 1: look-ahead bias
The single biggest mistake. If any data point the model sees "now" actually includes information from the future — a same-bar close, a restated fundamental, a survivorship-filtered universe — your backtest is fiction. I caught two leaks: using the day's close to decide a trade on that day, and a ticker list that only contained companies that still exist today. Both flattered the results enormously.
Trap 2: overfitting to the prompt
With an LLM you don't overfit parameters so much as overfit the prompt. Tweaking wording until the backtest improves is the same curve-fitting sin as optimising a moving-average length on one dataset. The fix is the same: hold out data the prompt never "saw," and use walk-forward testing.
Trap 3: costs and slippage
Commissions, spread and slippage quietly eat strategies that look profitable on paper. Adding realistic per-trade costs turned a "promising" result into a flat one. If a strategy only works at zero cost, it doesn't work.
Trap 4: the LLM is non-deterministic
Run the same backtest twice and the agent can make different calls. That's a feature for brainstorming and a problem for evaluation. I pinned the temperature low, logged every decision and its rationale, and ran multiple seeds to look at the distribution of outcomes rather than a single lucky run.
What actually helped
- Separate data, strategy and execution so each can be tested alone.
- Walk-forward, not single backtest — train/observe on one window, test on the next, roll forward.
- Log the agent's reasoning — it's how you spot hallucinated "signals."
- Paper trade before anything else — see connecting an agent to a broker API.
The honest takeaway
The LLM was useful for summarising context and generating hypotheses — less so as a standalone signal generator. The value of the exercise wasn't a strategy; it was a reusable, leak-free harness I can trust. Treat a good backtest as a way to reject ideas cheaply, not to predict profit.