Debug a Production Incident
From alert to root cause in under 30 seconds.
Scenario: a user reports that checkout failed. You have no idea why.
Step 1 — Find the failing request
Stream live requests and spot the failure:
$ flux tail
POST /checkout 201 120ms req:4f9a3b2c
POST /checkout 500 44ms req:550e8400
└─ Error: Stripe API timeout
Copy the request ID from the error line: req:550e8400.
If the incident happened in the past, filter by time:
$ flux tail --since 1h --filter status=500
POST /checkout 500 44ms req:550e8400 14:22:01
POST /checkout 500 51ms req:7a8b9c0d 14:22:44
Step 2 — Get the root cause
$ flux why 550e8400
ROOT CAUSE Stripe API timeout after 10s
LOCATION payments/create.ts:42
DATA CHANGES users.id=42 plan: free → null (rolled back)
SUGGESTION → Add 5s timeout + idempotency key retry
flux why reads the execution record and returns the first error in the span tree along with the exact file and line, any DB mutations that were made (including rollbacks), and a suggested fix.
Step 3 (optional) — Inspect the full execution
If you need more context — latencies, upstream spans, or the full input/output at each step:
$ flux trace 550e8400
gateway 2ms
└─ create_order 8ms
├─ db.insert(orders) 4ms
├─ stripe.charge 180ms ← timeout here
└─ send_slack — skipped (upstream error)
Step through interactively:
$ flux trace debug 550e8400
Step 1/4 gateway
──────────────────────────────────
Input: POST /checkout { ... }
Output: { tenant_id: "t_123" }
Time: 2ms
↓ next ↑ prev e expand q quit
What to do next
Once you understand the root cause:
- Fix the code locally
- Replay the incident against your fix before deploying
- Deploy with
flux deploy - Monitor with
flux tailto confirm the fix holds