How to Test If Your AI Prompts Actually Work (Not Just Feel Like They Work)
March 18, 2026 · 8 min read · Prompt Optimizer Team
Most developers evaluate their prompts the same way they evaluate a hairstyle: look at the output, think "yeah, that seems good," and move on. The problem is that "seems good" is not a testing methodology.
Prompt quality is fuzzy, context-dependent, and easy to fool yourself about — especially when you wrote the prompt yourself. This guide is about replacing intuition with reproducible, structured evaluation.
Why Vibe-Checking Fails at Scale
When you're writing one prompt for one use case, intuition is fine. But most real-world prompt engineering involves prompts that run thousands of times with variable inputs, optimization passes that subtly break constraints, team members editing prompts without understanding the original intent, and model upgrades that change output behavior on prompts that weren't explicitly tested.
The Two-Phase Assertion Pipeline
Phase 1 — Deterministic assertions (the gatekeeper). Before an LLM evaluator touches the output, run it through synchronous, zero-cost checks. If it fails a hard constraint, the pipeline short-circuits. No API call, no latency, no cost.
Phase 2 — LLM-graded assertions (the nuance). If and only if the output passes Phase 1, route it to a cheap, context-aware model with a strict grading rubric. Score for tone, factuality, and clarity on a 0.0–1.0 scale.
Here's what a full assertion suite looks like for a LinkedIn post generation prompt:
[
{ "type": "length-max", "value": 300 },
{ "type": "contains", "value": "{{company_name}}" },
{ "type": "not-contains", "value": "synergy" },
{ "type": "llm-rubric", "value": "Opens with a hook that creates curiosity. Score 0-10." },
{ "type": "constraint-preservation", "value": "Professional tone maintained throughout" }
]Run this against 20 sample outputs. Now you have a pass rate — not a gut feeling.
Constraint Preservation: The Most Underrated Check
You start with a prompt that includes "Respond in bullet points only. Maximum 5 bullets." You optimize it. The new version is more precise — but dropped the bullet point instruction. Your prompt now generates prose. The constraint was lost silently. Constraint preservation checking catches this before it reaches production.
Semantic Drift: Did the Meaning Survive?
Semantic drift measures cosine similarity between the embedding of your original and optimized prompt. Score 0 = identical meaning. Score 1 = completely different.
| Drift Score | Interpretation | Action |
|---|---|---|
| 0.0 – 0.15 | Paraphrasing only | Safe to ship |
| 0.15 – 0.35 | Meaningful rewording | Review recommended |
| 0.35+ | May be asking something different | Investigate before deploying |
A Practical Testing Workflow
- Write your baseline prompt and generate 20 sample outputs with varied inputs.
- Define your assertions. Start deterministic, add LLM-rubric assertions for quality.
- Establish your pass rate baseline. Anything below that in future runs is regression.
- Run your optimization. Check assertion pass rate, constraint preservation, semantic drift.
- Only ship if all three signals are green.
Try assertion-based evaluation on your prompts
No dataset required. Paste a prompt, define assertions, get a pass rate in seconds.
Try quick-evaluate