RunArtifacts — the structured record of everything the agent did — and produce a pass/fail result.
There are two types:
- Programmatic verifiers — Inspect the environment state directly. For example: “Did the agent send an email to the correct recipient?” is checked by querying the email tool server’s state, as well as reviewing the state diffs (before/after environment snapshots).
- Rubric-based Reward Models — Reward models that evaluates the agent’s actions against a rubric. Useful for subjective criteria like “Did the agent communicate professionally?”
RunArtifacts interface, so they work with any agent implementation.
