Skill
prompt-eval
Evaluate and optimize any AI prompt (`prompt_a`) with a 6-step pipeline: test plan, ~50 test cases, prompt execution, evaluator prompt (`prompt_b`), automate...
When to use prompt-eval
Choose if
You need a structured, evidence-backed prompt evaluation harness with ~50 test cases, a separate evaluator prompt (prompt_b), and pass/fail gates on P0 score count, core TP delta, and overall pass rate. Best when the prompt is heading to production and you want bad-case patterns grouped by root cause, not flat per-case failure lists.
Avoid if
You only need a one-shot prompt rewrite or a quick A/B compare — this skill runs a 6-stage pipeline with confirmations between stages and writes artifacts to `./prompt-eval-results/`. Also avoid for prompts handling secrets unless you can apply the README's recommended redaction + retention policy.
Risk Flags
- MEDIUM scope README mandates the skill treats prompt_a, test case payloads, and model outputs as untrusted; if adversarial instructions are not isolated to test-case fields the run is unsafe. Operators must enforce confirmations between stages.
- LOW scope README states the optimization loop runs at most one additional iteration if validation gates fail on first pass — agents needing deeper iteration must re-invoke the skill.
Cost
Type: Free
Distribution
- ClawHub
prompt-eval- License
- MIT-0