Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.getmuster.io/llms.txt

Use this file to discover all available pages before exploring further.

The check comes first. muster comes second.

Here’s the thing: if you’re running an invoice processing agent in production and you’re not checking whether the totals add up — that’s a problem that exists before muster enters the picture. Any responsible engineering team validates their agent’s outputs. Not for an observability tool. For themselves. For their users. For their risk team.
# This code belongs in your codebase regardless of muster
line_items_sum = sum(item["total"] for item in result["line_items"])
subtotal_ok = abs(line_items_sum - result["stated_subtotal"]) < 0.01
Without muster, that check runs, you maybe log it somewhere, and you manually trawl through logs every few days to see if anything looks off. muster changes what you do with the result — not what the check is.
# Same check. One extra line to tell muster about it.
muster_emit(job_id, checks=[{
    "check_id": "subtotal_arithmetic",
    "severity": "HIGH",
    "passed": subtotal_ok,
}])
Now instead of reading logs, you get: trend charts, degradation alerts, fleet-wide pass rates, anomaly detection, and benchmarks against peers — across every agent you operate.

The unit test analogy

You don’t write unit tests for your CI system. You write them because untested code is risky. The CI system just runs them automatically and tells you when something breaks. Same here. You don’t write quality checks for muster. You write them because unmonitored AI agents are risky. muster just aggregates the results and tells you when something degrades.

What good output validation looks like

For an invoice processing agent

result = llm.extract(invoice_pdf_text)

checks = [
    # Did we get anything?
    {"check_id": "output_not_empty",       "severity": "HIGH",   "passed": bool(result)},
    # Do the line items add up?
    {"check_id": "subtotal_arithmetic",    "severity": "HIGH",   "passed": line_sum_ok,
     "expected": str(stated_subtotal),     "actual": str(computed_sum)},
    # Does subtotal + tax = grand total?
    {"check_id": "grand_total_arithmetic", "severity": "HIGH",   "passed": grand_total_ok},
    # Is the date format valid?
    {"check_id": "invoice_date_valid",     "severity": "MEDIUM", "passed": date_ok},
]

For a decision-making agent (loan approval, fraud flag)

decision = agent.decide(application)

checks = [
    # Is the decision one of the expected values?
    {"check_id": "decision_is_valid_enum",  "severity": "HIGH",
     "passed": decision in {"APPROVE", "REJECT", "ESCALATE"}},
    # Did it explain itself?
    {"check_id": "decision_has_rationale",  "severity": "HIGH",
     "passed": len(decision_rationale) > 50},
    # Did it not refuse to answer?
    {"check_id": "no_refusal_in_output",    "severity": "HIGH",
     "passed": "cannot" not in decision.lower() and "unable" not in decision.lower()},
]

For a document summarisation agent

summary = agent.summarise(document)

checks = [
    {"check_id": "output_not_empty",     "severity": "HIGH",   "passed": bool(summary)},
    # Is the summary meaningfully shorter than the source?
    {"check_id": "compression_achieved", "severity": "MEDIUM",
     "passed": len(summary) < len(document) * 0.4},
    # Does it avoid hallucinated citations?
    {"check_id": "no_fabricated_urls",   "severity": "HIGH",
     "passed": not contains_urls_not_in_source(summary, document)},
]

What muster adds

Once these checks are emitting to muster, you get things that are impossible to build yourself across a fleet of 20+ agents:
Without musterWith muster
Log files per agentFleet-wide heatmap — all agents, all checks, one view
Manual review to spot degradationAutomatic alerts when pass rate drops by >15%
No idea which check is failing mostSorted by worst-performing check across all agents
No external reference pointBenchmark comparisons against similar agents (opt-in)
Ops team reads logs reactivelyFinance team catches invoice errors before payment runs

The key principle

Your check logic is yours. muster receives the result — pass or fail, expected vs actual, severity. It never sees your prompts, your model, your data. It just sees whether the check passed. That means you can write checks for anything your business cares about — and muster tracks them all without knowing anything proprietary about your agent.