Roadmap

AXIS is moving agent testing past pass or fail. Here is what we have recently shipped and what we are building next. This page is a living document, so expect it to change as priorities shift and as the community weighs in.

Help shape it

Have a use case we are missing, or want to weigh in on what comes next? Open an issue or start a discussion on GitHub. The roadmap below reflects our current thinking, not a commitment to dates or ordering.

Recently shipped

Shipped

Multi-dimensional evals, not just pass or fail

A green check tells you the agent finished. It does not tell you whether it thrashed, called the wrong tools, or burned tokens getting there. AXIS scores the quality of the agent experience across four independent dimensions, so a single run shows you not only whether the agent succeeded but how well it worked with your system.

Goal Achievement Environment Service Agent

Read about the scoring framework →

Next up

These are the capabilities we are actively designing and building. Each one extends AXIS from scoring a single run toward understanding the full surface your agents touch.

In progress

Context relevance scoring

Was the context the agent pulled in actually relevant to the task? Relevance scoring looks at the docs, skills, and files an agent loaded and measures how much of it mattered, so you can spot context that is padding the window without earning its place.

In progress

Audits

Point AXIS at your docs and other context sources and get back a review of issues and misalignment: stale instructions, contradictions, gaps, and guidance that steers agents the wrong way. Audits catch problems in the material itself, before an agent ever reads it.

Planned

Context heatmaps

Across all of your scenarios and the context they exercise, which parts are actually being tested and which parts are never touched by anything? Heatmaps show coverage at a glance, so you can find untested surface area and dead context that no scenario relies on.

Planned

Interactive workflows

Not every task is fully autonomous. Interactive workflows bring a human into the loop, letting you pause a run for input, approve a step, or steer the agent mid-task, then score the collaboration the same way AXIS scores everything else.

Planned

Streamlined CI/CD integrations

Run AXIS where your code already lives. Drop-in integrations for your pipeline let you score agent experience on every pull request, gate merges on a threshold, and track results over time, so regressions surface before they ship.

Planned

Sharable reports

Publish a report to a link your whole team can open. Share results with stakeholders, compare runs side by side, and point to a single source of truth for how your service performs for agents, no local setup required.

And more to come

This is just what is on deck. We are actively exploring more ways to measure and improve the agent experience. Tell us what would help most.

Share an idea on GitHub →