AC Classifier
Paste a user story. Get acceptance criteria bucketed into positive, negative, and edge case — each tagged with a test strategy: unit, integration, UI, or manual. For when the team goes silent in the three amigos session.
Your developers now write tests through Copilot, Cursor, and Claude Code. AI agents are fast — and they make the same mistakes consistently. These tools target the exact moments where AI-generated test artifacts fail in predictable, repeatable ways. None of them try to replace a tester.
AI agents generate tests faster than humans can review them — and they make the same mistakes consistently. The Sentinel family is the gate that catches those mistakes at PR time. The Detective family diagnoses what slipped through.
Browser-based tools for the conversation that happens before a line of code is written — where AI-assisted teams most often skip the AC review.
Paste a user story. Get acceptance criteria bucketed into positive, negative, and edge case — each tagged with a test strategy: unit, integration, UI, or manual. For when the team goes silent in the three amigos session.
Paste an oversized story. Get back vertical slices — each with its own dev scope, QA scope, and a verdict on whether it’s independently releasable. Stops teams from creating sub-tasks instead of properly splitting work.
GitHub Actions bots that score AI-generated test artifacts against a published rubric, post structured PR comments, and block merge below a threshold your team sets.
A GitHub Actions bot that reviews BDD .feature files in pull requests. Scores each scenario against a published rubric, suggests rewrites for widget-level steps, and blocks merge when the score falls below a threshold you set.
A GitHub Actions bot that reviews Karate .feature files in pull requests. Catches schema-blind matchers, hard-coded auth, missing negative coverage, and embedded JS smells — the failure modes AI agents produce when generating API contract tests.
A GitHub Actions bot that reviews Playwright .spec.ts files in pull requests. Detects brittle selectors, missing assertions, and test patterns that produce false passes when AI generates the test.
Diagnostic tools that ingest CI run history and reason over test failures — generating auditable hypotheses rather than asking you to read 4,000 lines of log output.
Ingests your CI run history and identifies tests that pass and fail non-deterministically. For each flaky test, generates a ranked hypothesis on the root cause — timing, shared state, environment dependency, or test ordering.
A Copilot Review comment tells you a test could be better. A Sentinel tells you which rule it broke, by how much, and won’t let it merge until it doesn’t.
| Generalist LLM reviewer | Sentinel family | |
|---|---|---|
| Rubric | Implicit, lives in the model | Published, versioned, debatable |
| Calibration | Generic “good code” | AI-mistake failure modes |
| Output | Advisory comment | Structured score + gate |
| Determinism | Drifts with model updates | Pinned rubric, reproducible |
| Governance | “The LLM thought so” | “Rule 7 of rubric v1.4” |