Quality checks

The check comes first. muster comes second.

Here’s the thing: if you’re running an invoice processing agent in production and you’re not checking whether the totals add up — that’s a problem that exists before muster enters the picture. Any responsible engineering team validates their agent’s outputs. Not for an observability tool. For themselves. For their users. For their risk team.

# This code belongs in your codebase regardless of muster
line_items_sum = sum(item["total"] for item in result["line_items"])
subtotal_ok = abs(line_items_sum - result["stated_subtotal"]) < 0.01

Without muster, that check runs, you maybe log it somewhere, and you manually trawl through logs every few days to see if anything looks off. muster changes what you do with the result — not what the check is.

# Same check. One extra line to tell muster about it.
muster_emit(job_id, checks=[{
    "check_id": "subtotal_arithmetic",
    "severity": "HIGH",
    "passed": subtotal_ok,
}])

Now instead of reading logs, you get: trend charts, degradation alerts, fleet-wide pass rates, anomaly detection, and benchmarks against peers — across every agent you operate.

The unit test analogy

You don’t write unit tests for your CI system. You write them because untested code is risky. The CI system just runs them automatically and tells you when something breaks. Same here. You don’t write quality checks for muster. You write them because unmonitored AI agents are risky. muster just aggregates the results and tells you when something degrades.

What good output validation looks like

For an invoice processing agent

result = llm.extract(invoice_pdf_text)

checks = [
    # Did we get anything?
    {"check_id": "output_not_empty",       "severity": "HIGH",   "passed": bool(result)},
    # Do the line items add up?
    {"check_id": "subtotal_arithmetic",    "severity": "HIGH",   "passed": line_sum_ok,
     "expected": str(stated_subtotal),     "actual": str(computed_sum)},
    # Does subtotal + tax = grand total?
    {"check_id": "grand_total_arithmetic", "severity": "HIGH",   "passed": grand_total_ok},
    # Is the date format valid?
    {"check_id": "invoice_date_valid",     "severity": "MEDIUM", "passed": date_ok},
]

For a decision-making agent (loan approval, fraud flag)

decision = agent.decide(application)

checks = [
    # Is the decision one of the expected values?
    {"check_id": "decision_is_valid_enum",  "severity": "HIGH",
     "passed": decision in {"APPROVE", "REJECT", "ESCALATE"}},
    # Did it explain itself?
    {"check_id": "decision_has_rationale",  "severity": "HIGH",
     "passed": len(decision_rationale) > 50},
    # Did it not refuse to answer?
    {"check_id": "no_refusal_in_output",    "severity": "HIGH",
     "passed": "cannot" not in decision.lower() and "unable" not in decision.lower()},
]

For a document summarisation agent

summary = agent.summarise(document)

checks = [
    {"check_id": "output_not_empty",     "severity": "HIGH",   "passed": bool(summary)},
    # Is the summary meaningfully shorter than the source?
    {"check_id": "compression_achieved", "severity": "MEDIUM",
     "passed": len(summary) < len(document) * 0.4},
    # Does it avoid hallucinated citations?
    {"check_id": "no_fabricated_urls",   "severity": "HIGH",
     "passed": not contains_urls_not_in_source(summary, document)},
]

What muster adds

Once these checks are emitting to muster, you get things that are impossible to build yourself across a fleet of 20+ agents:

Without muster	With muster
Log files per agent	Fleet-wide heatmap — all agents, all checks, one view
Manual review to spot degradation	Automatic alerts when pass rate drops by >15%
No idea which check is failing most	Sorted by worst-performing check across all agents
No external reference point	Benchmark comparisons against similar agents (opt-in)
Ops team reads logs reactively	Finance team catches invoice errors before payment runs

The key principle

Your check logic is yours. muster receives the result — pass or fail, expected vs actual, severity. It never sees your prompts, your model, your data. It just sees whether the check passed. That means you can write checks for anything your business cares about — and muster tracks them all without knowing anything proprietary about your agent.

Getting Started

Integration

Integrations

Connectors

Concepts

Quality checks

The check comes first. muster comes second.

The unit test analogy

What good output validation looks like

For an invoice processing agent

For a decision-making agent (loan approval, fraud flag)

For a document summarisation agent

What muster adds

The key principle

Getting Started

Integration

Integrations

Connectors

Concepts

Documentation Index

​The check comes first. muster comes second.

​The unit test analogy

​What good output validation looks like

​For an invoice processing agent

​For a decision-making agent (loan approval, fraud flag)

​For a document summarisation agent

​What muster adds

​The key principle

The check comes first. muster comes second.

The unit test analogy

What good output validation looks like

For an invoice processing agent

For a decision-making agent (loan approval, fraud flag)

For a document summarisation agent

What muster adds

The key principle