Documentation Index
Fetch the complete documentation index at: https://docs.getmuster.io/llms.txt
Use this file to discover all available pages before exploring further.
The check comes first. muster comes second.
Here’s the thing: if you’re running an invoice processing agent in production and you’re not checking whether the totals add up — that’s a problem that exists before muster enters the picture.
Any responsible engineering team validates their agent’s outputs. Not for an observability tool. For themselves. For their users. For their risk team.
# This code belongs in your codebase regardless of muster
line_items_sum = sum(item["total"] for item in result["line_items"])
subtotal_ok = abs(line_items_sum - result["stated_subtotal"]) < 0.01
Without muster, that check runs, you maybe log it somewhere, and you manually trawl through logs every few days to see if anything looks off.
muster changes what you do with the result — not what the check is.
# Same check. One extra line to tell muster about it.
muster_emit(job_id, checks=[{
"check_id": "subtotal_arithmetic",
"severity": "HIGH",
"passed": subtotal_ok,
}])
Now instead of reading logs, you get: trend charts, degradation alerts, fleet-wide pass rates, anomaly detection, and benchmarks against peers — across every agent you operate.
The unit test analogy
You don’t write unit tests for your CI system. You write them because untested code is risky. The CI system just runs them automatically and tells you when something breaks.
Same here. You don’t write quality checks for muster. You write them because unmonitored AI agents are risky. muster just aggregates the results and tells you when something degrades.
What good output validation looks like
For an invoice processing agent
result = llm.extract(invoice_pdf_text)
checks = [
# Did we get anything?
{"check_id": "output_not_empty", "severity": "HIGH", "passed": bool(result)},
# Do the line items add up?
{"check_id": "subtotal_arithmetic", "severity": "HIGH", "passed": line_sum_ok,
"expected": str(stated_subtotal), "actual": str(computed_sum)},
# Does subtotal + tax = grand total?
{"check_id": "grand_total_arithmetic", "severity": "HIGH", "passed": grand_total_ok},
# Is the date format valid?
{"check_id": "invoice_date_valid", "severity": "MEDIUM", "passed": date_ok},
]
For a decision-making agent (loan approval, fraud flag)
decision = agent.decide(application)
checks = [
# Is the decision one of the expected values?
{"check_id": "decision_is_valid_enum", "severity": "HIGH",
"passed": decision in {"APPROVE", "REJECT", "ESCALATE"}},
# Did it explain itself?
{"check_id": "decision_has_rationale", "severity": "HIGH",
"passed": len(decision_rationale) > 50},
# Did it not refuse to answer?
{"check_id": "no_refusal_in_output", "severity": "HIGH",
"passed": "cannot" not in decision.lower() and "unable" not in decision.lower()},
]
For a document summarisation agent
summary = agent.summarise(document)
checks = [
{"check_id": "output_not_empty", "severity": "HIGH", "passed": bool(summary)},
# Is the summary meaningfully shorter than the source?
{"check_id": "compression_achieved", "severity": "MEDIUM",
"passed": len(summary) < len(document) * 0.4},
# Does it avoid hallucinated citations?
{"check_id": "no_fabricated_urls", "severity": "HIGH",
"passed": not contains_urls_not_in_source(summary, document)},
]
What muster adds
Once these checks are emitting to muster, you get things that are impossible to build yourself across a fleet of 20+ agents:
| Without muster | With muster |
|---|
| Log files per agent | Fleet-wide heatmap — all agents, all checks, one view |
| Manual review to spot degradation | Automatic alerts when pass rate drops by >15% |
| No idea which check is failing most | Sorted by worst-performing check across all agents |
| No external reference point | Benchmark comparisons against similar agents (opt-in) |
| Ops team reads logs reactively | Finance team catches invoice errors before payment runs |
The key principle
Your check logic is yours. muster receives the result — pass or fail, expected vs actual, severity. It never sees your prompts, your model, your data. It just sees whether the check passed.
That means you can write checks for anything your business cares about — and muster tracks them all without knowing anything proprietary about your agent.