QA Toolkit

Quality gates for the era of AI-generated tests.

Your developers now write tests through Copilot, Cursor, and Claude Code. AI agents are fast — and they make the same mistakes consistently. These tools target the exact moments where AI-generated test artifacts fail in predictable, repeatable ways. None of them try to replace a tester.

6 tools 2 live 4 in design Built by Fayaz Mohammed
AI agents generate tests faster than humans can review them — and they make the same mistakes consistently. The Sentinel family is the gate that catches those mistakes at PR time. The Detective family diagnoses what slipped through.
family · 01 · upstream

Before code is written

Browser-based tools for the conversation that happens before a line of code is written — where AI-assisted teams most often skip the AC review.

01

AC Classifier

Paste a user story. Get acceptance criteria bucketed into positive, negative, and edge case — each tagged with a test strategy: unit, integration, UI, or manual. For when the team goes silent in the three amigos session.

Live Web tool Claude API
Try it
02

Story Splitter

Paste an oversized story. Get back vertical slices — each with its own dev scope, QA scope, and a verdict on whether it’s independently releasable. Stops teams from creating sub-tasks instead of properly splitting work.

Live Web tool Claude API
Try it
family · 02 · sentinel

At the pull request

GitHub Actions bots that score AI-generated test artifacts against a published rubric, post structured PR comments, and block merge below a threshold your team sets.

03

QA Sentinel

A GitHub Actions bot that reviews BDD .feature files in pull requests. Scores each scenario against a published rubric, suggests rewrites for widget-level steps, and blocks merge when the score falls below a threshold you set.

In design GitHub Actions Python · Claude API
Read the rubric
04

Karate Reviewer

A GitHub Actions bot that reviews Karate .feature files in pull requests. Catches schema-blind matchers, hard-coded auth, missing negative coverage, and embedded JS smells — the failure modes AI agents produce when generating API contract tests.

In design GitHub Actions Java / JS · Claude API
Coming soon
05

Spec Sentinel

A GitHub Actions bot that reviews Playwright .spec.ts files in pull requests. Detects brittle selectors, missing assertions, and test patterns that produce false passes when AI generates the test.

In design GitHub Actions TypeScript · Claude API
Coming soon
family · 03 · detective

After tests run

Diagnostic tools that ingest CI run history and reason over test failures — generating auditable hypotheses rather than asking you to read 4,000 lines of log output.

06

Flake Detective

Ingests your CI run history and identifies tests that pass and fail non-deterministically. For each flaky test, generates a ranked hypothesis on the root cause — timing, shared state, environment dependency, or test ordering.

In design CLI GitHub Action Claude API
Coming soon
how a sentinel differs

Not a smarter Copilot Review.

A Copilot Review comment tells you a test could be better. A Sentinel tells you which rule it broke, by how much, and won’t let it merge until it doesn’t.

Generalist LLM reviewer Sentinel family
Rubric Implicit, lives in the model Published, versioned, debatable
Calibration Generic “good code” AI-mistake failure modes
Output Advisory comment Structured score + gate
Determinism Drifts with model updates Pinned rubric, reproducible
Governance “The LLM thought so” “Rule 7 of rubric v1.4”