Skill

Ai Agent Evaluator

AI-powered agent evaluation and benchmarking assistant — design evaluation suites, run structured assessments (task completion rate, latency, safety, reasoni...

Verified: 2026-05-15 (clawhub-ingest-2026-05-15)

When to use Ai Agent Evaluator

Choose if

You're standing up an evaluation discipline for an AI agent and want methodology guidance — how to design eval suites, which benchmarks map to which use case, how to read failure modes from logs, how to plan red-team adversarial tests. Bilingual (EN / 中文). Pair with execution platforms (DeepEval, PromptFoo, Braintrust, LangSmith) which the skill references but does not embed.

Avoid if

You want a runnable eval harness rather than methodology — SKILL.md states this provides "evaluation methodology and guidance, not direct code execution". Also avoid for production safety sign-off on its own: the skill notes safety evaluations require human security team involvement and results must be "reviewed by qualified ML engineers before deployment decisions".

Risk Flags

  • LOW scope Methodology-only skill. SKILL.md states it provides "evaluation methodology and guidance, not direct code execution" — agents needing an actual eval runner must use DeepEval, PromptFoo, Braintrust, LangSmith, or equivalents.
  • LOW data_quality SKILL.md notes benchmark scores are "time-sensitive" and recommends "always check latest published leaderboards"; safety evaluations require human security team involvement, and results must be reviewed by qualified ML engineers before deployment decisions.

Cost

Type: Unknown

Distribution

ClawHub
ai-agent-evaluator
License
MIT-0