AXIS is moving agent testing past pass or fail. Here is what we have recently shipped and
what we are building next. This page is a living document, so expect it to change as
priorities shift and as the community weighs in.
Help shape it
Have a use case we are missing, or want to weigh in on what comes next? Open an issue or
start a discussion on GitHub. The roadmap
below reflects our current thinking, not a commitment to dates or ordering.
Recently shipped
Shipped
Multi-dimensional evals, not just pass or fail
A green check tells you the agent finished. It does not tell you whether it thrashed,
called the wrong tools, or burned tokens getting there. AXIS scores the quality
of the agent experience across four independent dimensions, so a single run shows you not
only whether the agent succeeded but how well it worked with your system.
These are the capabilities we are actively designing and building. Each one extends AXIS from
scoring a single run toward understanding the full surface your agents touch.
In progress
Context relevance scoring
Was the context the agent pulled in actually relevant to the task? Relevance scoring looks
at the docs, skills, and files an agent loaded and measures how much of it mattered, so you
can spot context that is padding the window without earning its place.
In progress
Audits
Point AXIS at your docs and other context sources and get back a review of issues and
misalignment: stale instructions, contradictions, gaps, and guidance that steers agents
the wrong way. Audits catch problems in the material itself, before an agent ever reads it.
Planned
Context heatmaps
Across all of your scenarios and the context they exercise, which parts are actually being
tested and which parts are never touched by anything? Heatmaps show coverage at a glance, so
you can find untested surface area and dead context that no scenario relies on.
Planned
Interactive workflows
Not every task is fully autonomous. Interactive workflows bring a human into the loop,
letting you pause a run for input, approve a step, or steer the agent mid-task, then score
the collaboration the same way AXIS scores everything else.
Planned
Streamlined CI/CD integrations
Run AXIS where your code already lives. Drop-in integrations for your pipeline let you score
agent experience on every pull request, gate merges on a threshold, and track results over
time, so regressions surface before they ship.
Planned
Sharable reports
Publish a report to a link your whole team can open. Share results with stakeholders, compare
runs side by side, and point to a single source of truth for how your service performs for
agents, no local setup required.
And more to come
This is just what is on deck. We are actively exploring more ways to measure and improve the
agent experience. Tell us what would help most.