Debug a Production Incident

From alert to root cause in under 30 seconds.

Scenario: a user reports that checkout failed. You have no idea why.

Step 1 — Find the failing request

Stream live requests and spot the failure:

$ flux tail

  POST /checkout    201  120ms  req:4f9a3b2c
  POST /checkout    500   44ms  req:550e8400
     └─ Error: Stripe API timeout

Copy the request ID from the error line: req:550e8400.

If the incident happened in the past, filter by time:

$ flux tail --since 1h --filter status=500

  POST /checkout  500  44ms  req:550e8400  14:22:01
  POST /checkout  500  51ms  req:7a8b9c0d  14:22:44

Step 2 — Get the root cause

$ flux why 550e8400

  ROOT CAUSE   Stripe API timeout after 10s
  LOCATION     payments/create.ts:42
  DATA CHANGES users.id=42  plan: free → null  (rolled back)
  SUGGESTION   → Add 5s timeout + idempotency key retry

flux why reads the execution record and returns the first error in the span tree along with the exact file and line, any DB mutations that were made (including rollbacks), and a suggested fix.

Step 3 (optional) — Inspect the full execution

If you need more context — latencies, upstream spans, or the full input/output at each step:

$ flux trace 550e8400

  gateway                   2ms
  └─ create_order           8ms
     ├─ db.insert(orders)   4ms
     ├─ stripe.charge     180ms  ← timeout here
     └─ send_slack          — skipped (upstream error)

Step through interactively:

$ flux trace debug 550e8400

  Step 1/4  gateway
  ──────────────────────────────────
  Input:   POST /checkout  { ... }
  Output:  { tenant_id: "t_123" }
  Time:    2ms

  ↓ next  ↑ prev  e expand  q quit

What to do next

Once you understand the root cause:

  • Fix the code locally
  • Replay the incident against your fix before deploying
  • Deploy with flux deploy
  • Monitor with flux tail to confirm the fix holds

← Common Tasks  ·  Replay a Production Incident →