System Observability
Quick Reference
- Who: Jordan the MLOps Specialist
- Where:
observability/orchestrator.logand SQLite- Time: ~5-10 minutes for investigation
- Key Tool:
grepand SQL queries
Prerequisites
- [ ] Access to the
storage/andobservability/directories - [ ] Basic knowledge of SQL
Step-by-Step Guide
Step 1: Trace a Workflow
Each mission is assigned a unique
trace_id(e.g.,tr_a1b2c3d4).Search for all events related to a specific trace in the log:
bashgrep "tr_a1b2c3d4" observability/orchestrator.logLook for the
event_typefield to identify the lifecycle stage:WORKFLOW_STARTAGENT_ACTIONFINAL_DECISION
Step 2: Audit Agent Reliability
The system tracks how often agents are overridden or vetoed.
Query the
agent_opinionstable to see current reliability scores:bashpython3 -c "import sqlite3; conn = sqlite3.connect('storage/memory.db'); print(conn.execute('SELECT agent_name, AVG(confidence) FROM agent_opinions GROUP BY agent_name').fetchall())"
Step 3: Inspect Model Failover
- If you suspect high latency or model errors, check for fallback events.
- Search for "Switching to fallback model" in the logs.
INFO
Model routing includes a 5-minute cooldown period for unhealthy models (agents/model_router.py:15).
Expected Results
- ✅ Every agent action is timestamped and logged with a trace ID.
- ✅ Failures are captured with full stack traces or error payloads.
- ✅ Multi-pass JSON parsing attempts are logged for debugging.
Troubleshooting
🔴 Error: JSON logs are unreadable
Cause: Log rotation or corrupted writes.
Solution:
- Use a JSON formatter like
jqto prettify the output:bashtail -f observability/orchestrator.log | jq .
FAQ
Q: Where can I see the full reasoning trace?
A: It is stored in the decisions table as a JSON blob in the reasoning_trace column. (memory/manager.py:145)