QA Toolkit

Quality gates for the era of AI-generated tests.

Your developers now write tests through Copilot, Cursor, and Claude Code. AI agents are fast — and they make the same mistakes consistently. These tools target the exact moments where AI-generated test artifacts fail in predictable, repeatable ways. None of them try to replace a tester.

8 tools 4 live 4 in design Built by Fayaz Mohammed

AI agents generate tests faster than humans can review them — and they make the same mistakes consistently. The Sentinel family is the gate that catches those mistakes at PR time. The Detective family diagnoses what slipped through.

family · 01 · upstream

Before code is written

Browser-based tools for the conversation that happens before a line of code is written — where AI-assisted teams most often skip the AC review.

AC Classifier

Paste a user story. Get acceptance criteria bucketed into positive, negative, and edge case — each tagged with a test strategy: unit, integration, UI, or manual. For when the team goes silent in the three amigos session.

Live Web tool Claude API

Try it

Story Splitter

Paste an oversized story. Get back vertical slices — each with its own dev scope, QA scope, and a verdict on whether it’s independently releasable. Stops teams from creating sub-tasks instead of properly splitting work.

Live Web tool Claude API

Try it

family · 02 · sentinel

At the pull request

GitHub Actions bots that score AI-generated test artifacts against a published rubric, post structured PR comments, and block merge below a threshold your team sets.

QA Sentinel

A GitHub Actions bot that reviews BDD .feature files in pull requests. Scores each scenario against a published rubric, suggests rewrites for widget-level steps, and blocks merge when the score falls below a threshold you set.

In design GitHub Actions Python · Claude API

Read the rubric

Karate Reviewer

A GitHub Actions bot that reviews Karate .feature files in pull requests. Catches schema-blind matchers, hard-coded auth, missing negative coverage, and embedded JS smells — the failure modes AI agents produce when generating API contract tests.

In design GitHub Actions Java / JS · Claude API

Coming soon

Spec Sentinel

A GitHub Actions bot that reviews Playwright .spec.ts files in pull requests. Detects brittle selectors, missing assertions, and test patterns that produce false passes when AI generates the test.

In design GitHub Actions TypeScript · Claude API

Coming soon

family · 03 · detective

After tests run

Diagnostic tools that ingest CI run history and reason over test failures — generating auditable hypotheses rather than asking you to read 4,000 lines of log output.

Flake Detective

Ingests your CI run history and identifies tests that pass and fail non-deterministically. For each flaky test, generates a ranked hypothesis on the root cause — timing, shared state, environment dependency, or test ordering.

In design CLI GitHub Action Claude API

Coming soon

family · 04 · visibility

Making QA work seen.

Tools for QA engineers who need to quantify their impact and communicate it upward — to leadership, to peers, to anyone who influences what happens to their career.

QA Score

Five questions across five visibility dimensions. Takes two minutes. Tells you exactly which gap is holding your career back — impact documentation, stakeholder visibility, strategic alignment, leadership language, or upward communication.

Live Web tool Free

Take it free

QA Impact Report

Paste your QA metrics and context. Get a leadership-ready impact report — framed in the language engineering leaders and product managers actually respond to. Stops the "what does QA actually do?" question before it starts.

Live Web tool Claude API

Generate report

how a sentinel differs

Not a smarter Copilot Review.

A Copilot Review comment tells you a test could be better. A Sentinel tells you which rule it broke, by how much, and won’t let it merge until it doesn’t.

	Generalist LLM reviewer	Sentinel family
Rubric	Implicit, lives in the model	Published, versioned, debatable
Calibration	Generic “good code”	AI-mistake failure modes
Output	Advisory comment	Structured score + gate
Determinism	Drifts with model updates	Pinned rubric, reproducible
Governance	“The LLM thought so”	“Rule 7 of rubric v1.4”