Notes on designing a protocol-native agent eval substrate

Most agent-eval harnesses I’ve looked at are coupled to a specific framework or vendor SDK, which means they have to be rewritten every time the agent’s underlying framework changes and they can’t be used to compare two agents written in different stacks. This post is a design sketch of an alternative: treating the wire protocol as the eval boundary and letting the agent’s implementation stay a black box. It is a design document, not a release announcement — the status note at the end is honest about what exists and what doesn’t.

The problem

Evaluating an agent is a different problem than evaluating a model. Model evals run against a fixed prompt/response interface. Agent evals have to account for tool use, multi-turn state, streaming partial results, background tasks that produce deferred callbacks, and the agent’s own internal orchestration. Framework-specific harnesses (tied to LangChain, or LlamaIndex, or a specific vendor SDK) have to be rewritten whenever the agent’s underlying framework changes, and they cannot be used to compare two agents written in different stacks.

The fix, if one wants a framework-independent eval, is to move the eval boundary to the wire protocol rather than to the agent’s internal API. A harness built on that premise would talk to the agent the same way any other caller does: via A2A, across JSON-RPC, REST, WebSocket, or gRPC. The agent’s internal implementation stays a black box. Whatever it does to call tools, talk to other agents, or call a model, happens inside that box and does not leak into the eval code.

What such a harness would look like

A protocol-native harness designed on this substrate would have four pieces:

A test-case loader. YAML files under cases/, one per test case, containing the input message, the expected output shape, the expected tool-call sequence (if any), and the scoring rubric.
An A2A client that drives the agent via whichever transport the test case specifies. The same test case could be run against JSON-RPC, REST, WebSocket, or gRPC with a one-line transport selector change, which would be how you verify that the agent behaves consistently across wire protocols.
A scorer. A pluggable scoring function that consumes the agent’s response and emits a structured score. For grounded QA cases, a four-bucket factuality rubric (correct, wrong, unsupported, refused). For tool-use cases, a check of the expected tool-call sequence against what the agent actually invoked. For streaming cases, both the final output and the intermediate SSE events.
A report writer. JSON and Markdown outputs, shaped for both CI consumption and human review.

The whole thing would run as a single binary, drop a report into target/eval/<run-id>/, and exit non-zero if any score regressed beyond a configured tolerance. The kind of thing you’d wire into cargo test --release and forget about until it caught something.

Why A2A is the right substrate

Three properties of the protocol make it a better eval substrate than any framework-specific API I’ve looked at:

It’s framework-agnostic. An A2A agent written in Python using the official SDK, an A2A agent written in Rust, and an A2A agent written in Go all speak the same wire format. A harness could talk to all three identically, which would let you run the same eval against your production agent and a candidate replacement agent written in a completely different stack, and get numbers that are directly comparable.

It’s transport-agnostic. Real production agents tend to serve multiple transports (gRPC for service-to-service calls, REST for the web UI, WebSocket for interactive streaming). A harness that only speaks one transport can’t verify that the agent behaves consistently across the others. A2A’s four transports let the same test case cover the full surface.

Streaming is a first-class citizen. A2A defines Server-Sent Events as part of the core spec, so a harness could watch an agent’s intermediate steps (tool calls, partial-result events, agent-to-agent handoffs) rather than just the final response. That is the difference between “this answer is right” and “this answer is right and the agent got there through the expected sequence of steps.” The second is what you need for real debugging.

Tool use as a wire-level signal

From a harness’s perspective, tool use is a sequence of intermediate events on the SSE stream. A scorer could watch the stream, compare the observed tool-call sequence against the expected sequence in the test case, and score the match. Because it is watching the wire protocol, it would not need to know whether the agent internally uses a particular tool-use framework. The test case is written against the observed behavior, not the implementation.

That property means the same eval cases could run against agents written in completely different stacks without the harness changing. The eval contract is the wire format, not the library.

A test case, end to end

A concrete case in the format I have in mind would look something like this:

case: grounded-qa-001
transport: grpc
input:
  role: user
  content: "What penalty applies under clause 3.2 of contract C-2023-147?"
expected:
  factuality: correct
  cited_sources:
    - document_id: "C-2023-147"
      passage: "clause_3.2"
  tool_calls:
    - name: "retrieve_passage"
      args_contain: ["C-2023-147", "clause"]
    - name: "verify_citation"
max_steps: 6
scorer: refusal-rubric-v1

When a harness runs that case it would:

Open a gRPC A2A client against the agent under test.
Send the input message.
Watch the SSE stream for intermediate tool-call events and record them.
Capture the final answer and its citations.
Run the refusal-rubric-v1 scorer against the captured answer, emitting a four-bucket factuality score plus citation-exists and citation-supports axes.
Separately verify the tool-call sequence against the expected sequence (order-insensitive for steps marked as parallel, order-sensitive otherwise).
Write the combined score to the run report.

A pipeline change that moved factuality from correct to unsupported on that case would be a regression. A pipeline change that dropped the verify_citation tool call from the sequence would also be a regression. Both axes are independent and both have to be checked.

Why this substrate

The broader point is that a public eval suite for A2A-speaking agents, covering the general primitives (grounded QA, tool selection, refusal behavior, multi-agent handoffs) against a public corpus, is the kind of artifact the agent ecosystem is still missing. Model-eval suites exist; agent-eval suites at the same resolution mostly don’t. The wire protocol is the right place to start building one.

Status

This post is a design sketch, not a release. The architectural claim — that the wire protocol is the right eval boundary — is the load-bearing part and it stands on its own. A reference implementation is a follow-on I’d want to scope carefully against a real use case before writing, rather than building it speculatively to back up the post. If and when it exists, this page will link to it.