Retrace put a VCR inside the agent crash report (2026)

The debugger now needs a witness protection program.

Retrace launched on Product Hunt's July 2, 2026 daily board as a developer tool for debugging AI agents, ranked 16th on that page and described as a way to replay and fork agent runs. 1 Its own Product Hunt page says the product records, replays, forks, and shares AI-agent executions, showing every LLM call, tool invocation, and error, with a free tier for 1,000 traces per month. 2

That is a useful product idea. It is also a little funny. The AI industry spent two years stuffing probabilistic interns into workflows, then discovered it needed a flight recorder for the intern's hallucinated tool use.

The pitch: replay the crime scene

Retrace calls itself "CI for AI agent behavior" and says it records every model and tool call so a developer can re-run a failure, branch from the broken step, and verify a fix before shipping. 3 The core loop is: add one decorator, capture the run, fork from the failed span, change the prompt, tool input, or model, then re-run the fork and get a verdict. 3

The mechanics are clear enough:

What Retrace sells	Why developers will want it	The part with teeth
Trace recording for LLM calls, tool calls, errors, token usage, costs, duration, input/output, and metadata. 4	Agent failures are hard to reproduce from ordinary logs.	The trace is not a harmless status line. It is the run's full nervous system.
Auto-instrumentation for OpenAI, Anthropic, and Google Gemini through the SDK. 5	One decorator is much easier than building a custom observability stack.	Easy instrumentation is also easy oversharing if the agent touches customer data.
Forking from any step, editing the input, and replaying from that branch point. 6	This is a real debugging primitive, closer to `git bisect` than another pretty dashboard.	The replay after the fork still depends on a live model path, so determinism is borrowed, not owned.
CI gates that score runs with an LLM judge and fail a build below a threshold. 7	Teams want agent regressions to break a pull request before users see them.	Now your deploy gate may depend on another model judging the first model's homework.

The strongest thing about Retrace is that it refuses to pretend agent bugs are normal software bugs. A flaky multi-step agent can fail because the model drifted, the prompt changed, the tool returned something weird, the memory had poison in it, or the agent took the scenic route through nonsense. A timeline with span-level replay is genuinely better than staring at one final wrong answer and muttering at the logs.

The catch: the recorder eats the whole conversation

The docs say Retrace captures full request and response data for LLM calls, full input/output fields, token usage, cost, duration, errors, model names, and custom metadata. 4 The recording guide is blunter: "Input and output are stored as-is" and warns users to avoid passing sensitive PII in LLM messages if they share tapes publicly. 5

The privacy policy fills in the shape of that bargain. Retrace says trace data can include function inputs and outputs, LLM prompts and responses, tool call parameters and results, error messages and stack traces, timing data, token counts, cost calculations, model names, and provider information. 8 It also says the original text is stored, not only embeddings, so the product can display, search, and replay traces. 8

That is the product working as designed. It is still the exact part a security review will circle in red.

Retrace says relevant trace content may be transmitted to a third-party AI provider for detection, assistant, replay analysis, and related features, using an API tier configured not to train on customer inputs. 8 It says staff do not read trace content except for requested support or abuse, security, or legal investigations. 8 Fine. But the thing being stored and processed is still the thing your agent saw, said, called, broke, and possibly leaked.

This is where the marketing phrase 「execution replay engine」 stops sounding like developer candy and starts sounding like compliance furniture. If your agent only touches synthetic demos, no problem. If it touches support tickets, customer records, invoices, health workflows, CRM fields, internal code, or partner APIs, the debugging layer becomes another system of record.

The pricing staircase is also the retention staircase

Retrace's free tier includes 1,000 traces per month, 7-day retention, one user, and fork-and-replay as a $5 per month add-on. 3 Starter is $29 per month with 10,000 traces, 30-day retention, 100 fork replays, and 25 prove-the-fix runs. 3 Pro is $99 per month with 50,000 traces, 90-day retention, CI regression gates, and multi-agent detectors; Teams is $399 per month with 500,000 traces, 365-day retention, and up to 10 users. 3

That is normal SaaS packaging. It is also a reminder that the value of the product rises as you keep more agent history around. The same archive that helps you find regressions is the archive that widens your review surface. The higher tier does not only buy capacity. It buys a longer memory of everything your agents did wrong.

Retrace does try to answer the obvious key-handling question. The site says eval gates and server-side replays can run on a user's own model account, starting with Google/Gemini keys powering eval gates and replays, while OpenAI and Anthropic keys are validated and stored with native replay marked as coming. 3 It says those keys are validated on save, encrypted at rest with AES-256-GCM, shown only as the last four characters, and never returned again. 3

Good. Also: congratulations, your agent debugger now has provider credentials, trace history, CI authority, and enough context to become a very interesting target.

The honest flaw is hiding in the forum thread

Retrace's own Product Hunt discussion asks the right uncomfortable question: when a replay diverges, is that a real regression or just provider non-determinism? The post says steps before a fork come from the recording, while everything after the fork runs live against the model, so two runs of the same input rarely match exactly even when nothing broke. 9 It says Retrace currently shows a first-divergence diff and a verdict of improved, regressed, or unchanged. 9

That thread is more useful than half the homepage. It admits the product is not rewinding a deterministic machine. It is replaying part of a recorded mess, then asking a live model to walk forward from the branch. That can still be useful. It just means the verdict is not a unit test in a lab coat. It is a controlled re-run with a model-shaped wobble inside it.

For agent teams, this is probably the right compromise. Perfect determinism is a fantasy unless you snapshot the model, tools, retrieval corpus, memory, environment, credentials, and every upstream side effect. Retrace is selling a practical version: preserve enough context to inspect and fork the failure. The roast is that the practical version has to swallow a lot of private context to be practical.

Verdict

Retrace is worth trying if your AI agents already do real work and occasionally return from the woods holding a broken tool call. The product addresses a real pain: agent failures need replay, diffing, forked experiments, and CI gates, not another dashboard pretending one error message tells the story. But do not wire it into production before deciding what gets redacted, who can publish tapes, how long traces stay alive, which provider sees replay content, and whether a model-judged gate is allowed to block a deploy. Retrace does not kill the black box. It installs a rewind button, a search bar, and a subscription meter inside it.

Retrace put a VCR inside the agent crash report

The pitch: replay the crime scene

The catch: the recorder eats the whole conversation

The pricing staircase is also the retention staircase

The honest flaw is hiding in the forum thread

Verdict

참고 출처

이 채널의 다른 콘텐츠

관련 콘텐츠