Incident Response Plan

Created: 2026-04-17 Owner: Joe Etherage Status: Active — minimum viable plan for Gate 1/Gate 2 Linear: NQU-420

How You'll Know Something Is Wrong

Nquiry has two alerting channels. Both send to joe.etherage@gmail.com.

CloudWatch Alarms → SNS → Email:

RDS CPU > 80% for 5 min
RDS free storage < 5 GB
RDS connections > 80% of max
Redis CPU and memory alarms
Bedrock throttling > 10 requests / 5 min
Bedrock latency p90 sustained high
Bedrock server errors elevated
Audit logging failure (compliance-critical)
Auth failure spike (possible brute force)
Rate limit spike (possible DoS)
Synthetics canary failure (app.nquiry.ai unreachable)
Pending NQU-601: ALB HealthyHostCount < 1, ALB 5xx spike, ECS CPU/memory

Sentry → Email:

Unhandled JavaScript exceptions (client and server)
API route errors

Not covered yet (acceptable for Gate 1/2):

No PagerDuty or phone alerting — email only
No on-call rotation — Joe is the sole responder
No formal SLA — F&F users understand it's early

Severity Levels

SEV-1 — App is down or data is at risk. Users cannot access app.nquiry.ai, or data integrity is compromised. Drop everything.

SEV-2 — Core feature is broken. Users can log in but can't generate analyses, upload evidence, or export reports. Fix within hours.

SEV-3 — Degraded experience. Slow responses, intermittent errors, non-critical feature broken. Fix within 1 business day.

Failure Mode Runbooks

1. ECS Task Down (App Unreachable)

Detection: Synthetics canary alarm, ALB HealthyHostCount < 1 (after NQU-601), or user report that app.nquiry.ai returns 502/503.

Severity: SEV-1

First response (5 minutes):

Check ECS console — is the task running?

aws ecs describe-services \
  --cluster invapp-dev-cluster \
  --services invapp-dev-app-service \
  --query 'services[0].{desired:desiredCount,running:runningCount,status:status}'

If task count is 0 or task is in STOPPED state, check why:

aws ecs describe-services \
  --cluster invapp-dev-cluster \
  --services invapp-dev-app-service \
  --query 'services[0].events[:5]'

Check stopped task reason:

aws ecs list-tasks --cluster invapp-dev-cluster --service-name invapp-dev-app-service --desired-status STOPPED
aws ecs describe-tasks --cluster invapp-dev-cluster --tasks <task-id> --query 'tasks[0].stoppedReason'

Common causes and fixes:

Cause	Fix
OOM kill (container exceeded memory)	Task will auto-restart. If recurring, increase memory in ECS module.
Failed health check	Check `/api/health` — if the app is crashing on startup, check recent deploy. Roll back (see below).
Bad deploy	Roll back to previous task definition revision.
ECR image pull failure	Check ECR — image may have been deleted. Redeploy from last known-good SHA.

Rollback — Option A (fast, ~3 minutes):

# Find current task def revision
aws ecs describe-services --cluster invapp-dev-cluster --services invapp-dev-app-service \
  --query 'services[0].taskDefinition'

# List recent revisions
aws ecs list-task-definitions --family-prefix invapp-dev-app --sort DESC --max-items 5

# Roll back to previous revision
aws ecs update-service \
  --cluster invapp-dev-cluster \
  --service invapp-dev-app-service \
  --task-definition invapp-dev-app:<previous-revision> \
  --force-new-deployment

# Wait for stabilization
aws ecs wait services-stable --cluster invapp-dev-cluster --services invapp-dev-app-service

# Smoke test
curl -s -o /dev/null -w "%{http_code}" https://app.nquiry.ai/api/health

Rollback — Option B (clean, ~10 minutes):

git revert <bad-commit-sha>
git push origin main
# CI rebuilds and redeploys

2. RDS Database Issues

Detection: RDS CPU alarm, RDS connections alarm, RDS storage alarm, or application errors referencing database connection failures.

Severity: SEV-1 (connection failure) or SEV-2 (high CPU / approaching storage limit)

First response:

Check RDS status in AWS console or CLI:

aws rds describe-db-instances \
  --db-instance-identifier invapp-dev-db \
  --query 'DBInstances[0].{status:DBInstanceStatus,cpu:PerformanceInsightsEnabled,storage:AllocatedStorage}'

Check current connections vs. max:

# Via bastion tunnel (see deployment-flow.md for setup)
psql -h localhost -p 5433 -U app_admin -d investigation_app \
  -c "SELECT count(*) as active_connections FROM pg_stat_activity WHERE state = 'active';"

Common causes and fixes:

Cause	Fix
Connection pool exhaustion	Restart ECS task (force new deployment). Check for connection leaks in recent code.
High CPU from expensive query	Check `pg_stat_activity` for long-running queries. `SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 5;` Cancel with `SELECT pg_cancel_backend(<pid>);`
Storage approaching limit	Increase allocated storage in Terraform (`rds_allocated_storage`). This is a non-destructive operation.
RDS instance down	Check AWS Health Dashboard. If maintenance event, wait. If unexpected, check CloudTrail for changes.

Nuclear option — restore from snapshot:

# List available snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier invapp-dev-db \
  --query 'DBSnapshots[*].{id:DBSnapshotIdentifier,time:SnapshotCreateTime}' \
  --output table

# Restore to a new instance (does NOT replace the existing one)
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier invapp-dev-db-restored \
  --db-snapshot-identifier <snapshot-id>

After restoring, update the ECS task definition to point to the new RDS endpoint, or rename instances. This is a last resort — expect 15-30 minutes.

3. Bedrock Throttling / AI Analysis Failures

Detection: Bedrock throttling alarm, Bedrock error alarm, Sentry errors from analysis generation routes, or user reports that "analysis is stuck."

Severity: SEV-2

First response:

Check Bedrock CloudWatch metrics:

aws cloudwatch get-metric-statistics \
  --namespace AWS/Bedrock \
  --metric-name InvocationThrottles \
  --dimensions Name=ModelId,Value=us.anthropic.claude-sonnet-4-6 \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 --statistics Sum

Check current quota:

aws service-quotas get-service-quota \
  --service-code bedrock \
  --quota-code L-ABCD1234  # Check actual quota code in console

Check the in-app shedder + fallback state (NQU-783 PR 1):

# Hit the admin queue-stats endpoint with a valid admin session token:
curl -H "Authorization: Bearer ${ADMIN_TOKEN}" https://app.nquiry.ai/api/admin/queue-stats | jq '.shedder'
# Expected during incident: { "hot": true, "recentThrottles": N>3, "fallbackRegion": "us-west-2", "disabled": false }
# Search ECS logs for [AI:fallback] (region fallback fired) and [AI:Shed] (quality calls shed)

Built-in resilience (NQU-783 PR 1):

The app now ships two automatic resilience mechanisms before any operator action is needed:

Multi-region fallback. lib/ai/client.ts retries on us-west-2 after the primary region (us-east-1) exhausts retries with a retryable error (throttle, 5xx, connection). Look for [AI:fallback] log entries — these mean fallback fired, which is expected behavior under a us-east-1-only incident, not a code bug.
Throttle shedder. Each RateLimitError / ThrottlingException increments a 60-second sliding-window counter shared across all Bedrock limiters. When the count exceeds 3, quality-check calls (lib/ai/quality/*) reject pre-flight with QualityShedError so primary analysis traffic keeps its quota. Look for [AI:Throttle] (a throttle was recorded) and [AI:Shed] (a quality call was shed). Primary calls never shed.

Common causes and fixes:

Cause	Fix
Burst usage exceeding RPM quota	Wait — the throttle shedder will drop quality calls automatically; primary traffic continues. If shedder.hot stays true for >5 min, request a quota increase via AWS console. Current limit: 50 RPM per region.
us-east-1 regional outage	The fallback to us-west-2 should fire automatically (look for `[AI:fallback]` log lines). If those are present and the incident persists, both regions are likely degraded — check AWS Health Dashboard. No code fix needed.
us-west-2 regional outage	No user impact unless us-east-1 is also down. Confirm by checking that `[AI:fallback]` log lines are NOT firing.
Shedder firing false positives	Emergency disable via env var: `BEDROCK_SHEDDER_DISABLED=true` (jq upsert into the live task def via the CI deploy pattern). File a Linear bug to investigate.
Model deprecated or changed	Check Bedrock model catalog. Update `BEDROCK_PRIMARY_MODEL` / `BEDROCK_QUALITY_MODEL` env vars. Run the full pre-flight checklist in `.claude/rules/model-upgrade.md` before swapping.

User communication: If Bedrock is degraded in BOTH regions, analyses will fail with an error message. Users can retry later. No data is lost — the inquiry and evidence remain intact. If only quality checks are affected (shedder active, primary OK), the analysis ships without a quality score; the user sees a slightly degraded UX but the analysis itself is correct.

4. CloudFront / ALB Errors (502, 503, 504)

Detection: ALB 5xx alarm (after NQU-601), synthetics canary, or user reports of error pages.

Severity: SEV-1 (sustained) or SEV-3 (intermittent)

First response:

Determine which layer is failing:

# Direct to ALB (bypasses CloudFront)
curl -s -o /dev/null -w "%{http_code}" https://app.nquiry.ai/api/health

# Check ECS task health
aws ecs describe-services --cluster invapp-dev-cluster --services invapp-dev-app-service \
  --query 'services[0].{running:runningCount,desired:desiredCount}'

If ALB returns 502: the ECS task is not responding. See Runbook #1 (ECS Task Down).
If ALB returns 504 (timeout): the app is running but slow. Check RDS CPU, Bedrock latency, or look for a long-running request.
If CloudFront returns an error but ALB is healthy: check CloudFront distribution status in console. WAF may be blocking requests — check WAF logs.

WAF false positive: WAF is in count mode for bot control (not blocking), but rate limiting is active. If a legitimate user is being rate-limited:

# Check WAF sampled requests in console
# AWS Console → WAF & Shield → Web ACLs → invapp-dev-waf → Sampled requests

5. Cognito Authentication Failures

Detection: Auth failure spike alarm, user reports they can't log in, Sentry errors from auth routes.

Severity: SEV-1 (nobody can log in) or SEV-3 (one user locked out)

First response:

Check if Cognito itself is up:

aws cognito-idp describe-user-pool \
  --user-pool-id <pool-id> \
  --query 'UserPool.Status'

If a single user is locked out:

aws cognito-idp admin-get-user \
  --user-pool-id <pool-id> \
  --username <email> \
  --query '{status:UserStatus,enabled:Enabled}'

If user is disabled or locked (Advanced Security risk-based auth):

aws cognito-idp admin-enable-user \
  --user-pool-id <pool-id> \
  --username <email>

Common causes and fixes:

Cause	Fix
Cognito Advanced Security blocking legitimate user	Check risk evaluation in Cognito console. Disable adaptive authentication temporarily if needed.
User forgot password	Direct them to the forgot-password flow in the app.
Cognito regional outage	Check AWS Health Dashboard. No workaround — auth is centralized.
MFA device lost	Admin can reset MFA: `aws cognito-idp admin-set-user-mfa-preference --user-pool-id <id> --username <email> --software-token-mfa-settings Enabled=false`

Communication Templates

For F&F Users (SEV-1 or extended SEV-2)

Subject: Nquiry — temporary service interruption

Hi [name],

Nquiry is currently experiencing [brief description — e.g., "difficulty connecting to our AI analysis service"]. Your data is safe and no action is needed on your end.

I'm working on a fix and expect to have things back to normal within [timeframe]. I'll follow up when it's resolved.

Sorry for the interruption, and thanks for your patience while we work through these early days.

— Joe

Resolution Follow-Up

Hi [name],

The issue from [date/time] has been resolved. [One sentence on what happened and what was done.]

If you notice anything still off, just let me know.

— Joe

Escalation Path

Level	Who	When
L1	Joe (email alerts)	All incidents — first responder
L2	AWS Support (case)	Infrastructure issue Joe can't resolve in 30 min. Use AWS console → Support → Create case. Current plan: Basic (free).
L3	AWS Premium Support	If/when upgraded to Business support ($100/mo). Provides 24/7 access and < 1hr response for production-down.

Note: Upgrading to AWS Business Support before Gate 2 is worth considering. $100/mo buys < 1hr response time for production-system-down events and access to Trusted Advisor checks.

Post-Incident

After any SEV-1 or recurring SEV-2:

What happened? One paragraph.
What was the impact? Duration, users affected.
What was the root cause?
What prevented faster detection/resolution?
What changes will prevent recurrence? File Linear issues.

Keep it simple — a Linear comment on the relevant issue is sufficient for now. No formal post-mortem template needed at this scale.

Alarm Anchors (NQU-678)

Every CloudWatch alarm under infrastructure/terraform/modules/{monitoring,cloudwatch-dashboard,synthetics,elasticache} carries a Runbook: reference pointing here. Each subsection below is a 30-second triage card: what fired, three checks to do first, when to escalate. If a check produces a clear answer, follow that thread; if all three look fine, escalate.

Severity column legend: C = critical / page; W = warning / next business hour; I = informational / no action.

Application logic & background jobs

audit-failure

Severity: C — compliance trail at risk. What fired: lib/shared/audit.ts emitted [AUDIT FAILURE] — at least one audit row failed to write in the last 5 min. First look:

GET /api/admin/audit-health — what action types are failing? Recent error message?
CloudWatch logs filter [AUDIT FAILURE] in the last hour — clustered around one route or random?
Is RDS healthy (see #rds-cpu-high, #rds-connections-high)? Escalation: Joe → if the table itself is unwriteable (FK constraint or column mismatch), suspect a recent migration; check _migrations for last applied row.

auth-failure-spike

Severity: W — possible brute force or password-reset campaign. What fired: > N auth.login_failed audit rows in 5 min (threshold in terraform var auth_failure_threshold). First look:

CloudWatch logs filter auth.login_failed — same email repeating, or many distinct emails?
WAF console → recent rate-limit blocks → confirm WAF is doing its job for Cognito IPs.
Slack/email user reports of locked-out accounts? Escalation: Joe → if pattern is credential-stuffing, enable Cognito advanced security (adaptive auth) or temporarily lower the WAF rate-limit on /api/auth/*.

rate-limit-spike

Severity: W — possible DoS or misconfigured client. What fired: > N rate-limit hits in 5 min (terraform var rate_limit_hits_threshold). First look:

CloudWatch logs filter for the rate-limit log line — what user IDs / IPs are repeating?
One user hitting it = misconfigured client (e.g. a polling loop without backoff). Many distinct = abuse.
Confirm the rate limiter (Redis-backed) is actually keying on the right field — RATE_LIMITS config. Escalation: Joe → if abuse, block at WAF; if client bug, contact the user and tighten the limit if structural.

webauthn-verify-failures

Severity: W — passkey auth regression risk. What fired: WebAuthn verify endpoint failures spiking. Pre-existing context: NQU-655 documented Cognito DeviceConfiguration as a recurring root cause. First look:

CloudWatch logs for auth.webauthn.authenticate.verify failures — error message recurring?
Cognito user-pool DeviceConfiguration setting — has anyone re-enabled it?
Is one user repeating, or many distinct users? One = stale credential; many = systemic. Escalation: Joe → if many users, revert the most recent auth-related deploy.

quality-check-failure

Severity: W — analyses still complete, but quality scores missing. What fired: Background quality check (lib/ai/quality/) failed for at least one analysis in the last 5 min — typically a Bedrock timeout or a JSON-parse error in the merged faithfulness+contradiction prompt (NQU-608). First look:

CloudWatch logs filter [QUALITY_CHECK_FAILURE] — error message recurring?
Bedrock side: see #bedrock-throttling, #bedrock-errors, #bedrock-latency.
Inspect failing analysis row's generation_status — should be complete or quality_unavailable, never stuck on checking. Escalation: Joe → if Bedrock-side, retry will self-heal; if parse error, file a Linear ticket and disable the failing check via env until fixed.

usage-recording-failure

Severity: C if sustained — billing data may be incomplete. What fired: recordAIUsage call failed in the last 5 min — quota counters and downstream invoicing are at risk. First look:

CloudWatch logs for the failure line — DB error or constraint violation?
Is RDS healthy (see #rds-*)?
Spot-check a recent quota-using user via /api/admin/users — does their ai_usage count match expected? Escalation: Joe → if sustained > 30 min, freeze any pending invoices until backfill is verified; check _migrations for recent ai_usage schema changes (see NQU-667).

retention-cron-missing

Severity: W — daily retention sweep didn't run. What fired: No RetentionCronLastRun log line in the last 26 hours. Either the GitHub Actions workflow didn't trigger or it died before emitting. First look:

GitHub Actions → Retention Cron workflow → did the latest scheduled run start? If not, GitHub Actions outage.
If it started but failed, the run logs show the curl response — Bearer token mismatch, app down, or RDS unreachable.
Trigger a manual run via the workflow_dispatch button to confirm the path works. Escalation: Joe → if cron has been missing for > 48 h, a manual psql cleanup may be warranted before stale data accumulates.

embedding-pending-over-24h

Severity: W — embedding pipeline stuck. What fired: Embedding sweeper (/api/admin/cron/embedding-sweeper) reports rows pending > 24 h after a sweep. Either the sweeper isn't running or the embedding pipeline itself is broken (NQU-669). First look:

GitHub Actions → Embedding Sweeper Cron workflow → recent runs failing?
Manual trigger of the sweeper — does it complete with pendingOver24hAfterSweep: 0?
Inspect a failing row in embedding_source_status — what's the error_message? Escalation: Joe → if Bedrock Titan embeddings are throttling, temporarily reduce MAX_REPROCESS_PER_RUN and run sweeper hourly from a workflow_dispatch.

Compute (ECS / Amplify / ALB)

ecs-task-deficit

Severity: C — service running fewer tasks than desired for 10 min. What fired: ECS service health below desired count. Task is dying or failing to start. First look:

ECS Console → service → events tab — what's the last task-stopped reason?
See #ecs-oom-or-exit if the reason is OOM / non-zero exit.
See #alb-healthy-hosts-low if no targets are registering. Escalation: Joe → roll back to previous task definition (see "ECS Task Down" runbook section above).

ecs-oom-or-exit

Severity: C — container crashed. What fired: ECS task logged OutOfMemoryError or a non-zero exit code in the last 5 min. First look:

CloudWatch logs filter OutOfMemoryError or exit code — which container, which time?
Memory metric (see #ecs-memory-high) — sustained pressure or a sudden spike?
Recent deploy? Roll back if so. Escalation: Joe → if memory pressure is sustained, increase task memory and re-deploy; if exit code, check Sentry for an unhandled exception around the same timestamp.

ecs-cpu-high

Severity: W — service CPU > threshold for 10 min. What fired: ECS service CPU above ecs_cpu_threshold (default 75%) for 10 minutes. First look:

ALB request rate — has traffic spiked? If yes, this is just load.
CPU per task — one task pinned at 100%? Could be a stuck request loop.
Recent deploy with new sync work on the request path? Escalation: Joe → if traffic-driven, scale up; if a hot loop, identify via CPU profiler and roll back the suspect change.

ecs-memory-high

Severity: W — container approaching OOM. What fired: ECS service memory above ecs_memory_threshold (default 80%) for 10 minutes. First look:

Memory metric trend — gradual climb (leak) or step (legitimate increased usage)?
CloudWatch logs for any high-memory operation indicators (large file processing, long contexts).
Recent change to context budgets in lib/ai/context-budget.ts? Escalation: Joe → restart task to buy time; investigate leak via heap snapshot if pattern is leak-shaped.

amplify-5xx-high

Severity: C — Amplify 5xx error rate > 5%. What fired: Amplify 5xx error rate exceeds 5% for 5 minutes. First look:

Amplify console → app → recent deploys.
Sentry → recent unhandled exceptions clustering by route.
ALB target health (see #alb-healthy-hosts-low) — is the underlying target serving? Escalation: Joe → roll back the most recent Amplify deployment if it correlates.

alb-target-5xx-high

Severity: C — > N% of ALB targets returning 5xx. What fired: ALB target 5xx error rate exceeds alb_5xx_error_rate_threshold_pct% over 5 minutes (with > 100 reqs in the window). First look:

ECS task health — see #ecs-task-deficit.
Sentry exceptions correlated to the time window.
Recent deploy? Escalation: Joe → roll back the deploy or scale tasks; if root cause is in DB, see #rds-*.

alb-target-latency-p95-high

Severity: W — sustained p95 latency above threshold. What fired: ALB target p95 response time above alb_latency_p95_threshold_seconds for 3 consecutive 5-min periods. First look:

Slowest endpoints — CloudWatch logs Slow query lines (development) or AWS X-Ray traces (when enabled).
DB load (see #rds-cpu-high, #rds-connections-high).
Bedrock latency (see #bedrock-latency) — analysis routes can dominate p95 on small fleets. Escalation: Joe → if AI-driven, that's expected during heavy generation periods; if DB-driven, see #rds-*.

alb-healthy-hosts-low

Severity: C — app unreachable. What fired: No healthy targets registered behind the ALB. First look:

ECS service → tasks tab → are any tasks RUNNING? See #ecs-task-deficit.
ALB target group health checks — what's the failure reason (e.g. timeout, 5xx from /health)?
Recent task-def change to health-check path / port? Escalation: Joe → roll back via "ECS Task Down" runbook; if stuck, AWS Support L2.

synthetics-canary-failure

Severity: C in prod, W in staging — health check failing. What fired: Synthetics canary against app.nquiry.ai (or staging URL) is failing. First look:

Manually curl https://app.nquiry.ai/api/health — what status?
ALB / ECS health (see #alb-healthy-hosts-low, #ecs-task-deficit).
DNS / Route 53 health checks — anything stale? Escalation: Joe → SEV-1 if production. Use the F&F user comms template above if outage exceeds 15 min.

Data layer (RDS / Redis)

rds-cpu-high

Severity: W → C if sustained > 30 min. What fired: RDS CPU > 80% for 5 min. First look:

RDS console → Performance Insights → top SQL by CPU.
New slow query? Check pg_stat_statements (or CloudWatch RDS Enhanced Monitoring).
Connection storm (see #rds-connections-high)? Escalation: Joe → if a single query is hot, kill the offender (pg_terminate_backend(pid)) and file a Linear ticket; if sustained from legitimate load, scale instance class.

rds-storage-low

Severity: C — < 5 GB free. What fired: RDS free storage below 5 GB. First look:

Has retention cron (#retention-cron-missing) run recently? Failed retention is a likely cause.
WAL bloat — SELECT slot_name, active, restart_lsn FROM pg_replication_slots for inactive slots holding WAL.
Largest tables — is one table growing unexpectedly? pg_size_pretty(pg_total_relation_size('<table>')). Escalation: Joe → emergency: aws rds modify-db-instance --allocated-storage <bigger>; long-term: enable Storage Autoscaling and set max ceiling.

rds-connections-high

Severity: W. What fired: RDS connections exceed 80% of max (rds_max_connections). First look:

SELECT state, count(*) FROM pg_stat_activity GROUP BY state — idle in transaction = connection leak.
Pool sizing in lib/db/pool.ts (max: 10) × Lambda concurrency — at scale, may need PgBouncer.
Recent deploy add long-held transactions? Escalation: Joe → kill leaked sessions; long-term, deploy RDS Proxy or PgBouncer.

redis-cpu-high

Severity: W. What fired: Elasticache Redis CPU sustained high. First look:

Redis SLOWLOG (SLOWLOG GET 50) — is one command pattern dominating?
Recent change to rate-limit window or to a new Redis-backed feature?
Key-count growth — INFO keyspace — runaway key creation? Escalation: Joe → scale Redis instance; if rate-limit-driven, lower request volume at WAF.

redis-memory-high

Severity: W → C if approaching maxmemory. What fired: Redis memory usage high. First look:

INFO memory — used_memory vs maxmemory, fragmentation ratio.
--bigkeys scan — any pathological keys?
Eviction policy on the cluster — is it dropping things silently? Escalation: Joe → scale up; if rate-limit only, switch to LRU eviction safely.

External AI (Bedrock)

bedrock-throttling

Severity: W — Bedrock 429s spiking. What fired: Bedrock throttling exceeds 10 requests in 5 minutes. First look:

Which model? (Claude vs Titan embed vs other.) The terraform alarm includes ${bedrock_model_id}.
AI queue stats: getAIQueueStats (lib/ai/client.ts) — pending depth growing?
Per-org concurrency cap (lib/ai/concurrency.ts) — is one org saturating? Escalation: Joe → request a quota increase via AWS console; reduce MAX_CONCURRENT_GENERATIONS_PER_ORG if a single org is the problem.

bedrock-latency

Severity: W — slow AI responses. What fired: Bedrock p90 invocation latency above bedrock_latency_threshold_ms for 15 + minutes. First look:

AWS Service Health Dashboard — Bedrock incident in this region?
Has the analysisType mix shifted toward larger contexts? Check token_usage in recent analysis rows.
Throttling adjacent (see #bedrock-throttling)? Escalation: Joe → no immediate fix; communicate slowness to active users; consider lower-tier model fallback if sustained.

bedrock-errors

Severity: W → C if sustained. What fired: Bedrock server errors above bedrock_error_threshold in 5 min. First look:

Sentry exceptions tagged with the Bedrock client.
AWS Service Health.
Did we recently change model_id or version? Bedrock occasionally retires preview models. Escalation: Joe → swap to a known-good model via env override; AWS Support L2 if Bedrock-side.

bedrock-token-spike

Severity: C — runaway cost risk. What fired: Bedrock output tokens > bedrock_token_spike_threshold in 1 hour. First look:

analysis rows from the last hour — any with abnormally large token_usage.output_tokens?
Did a new prompt template ship with max_tokens cranked up? Check prompt_template table.
One org dominating? Cross-reference with /api/admin/jobs/status. Escalation: Joe → if a prompt template is the cause, set is_active = false on it via DB; if user behavior, lower per-user rate limit immediately.

Milestones (informational)

first-paid-invoice

Severity: I — milestone marker, not an incident. What fired: Stripe webhook just processed the first non-zero invoice.paid event. Idempotent — fires once ever. First look: None required. Open the celebration channel. Escalation: None.

How You'll Know Something Is Wrong​

Severity Levels​

Failure Mode Runbooks​

1. ECS Task Down (App Unreachable)​

2. RDS Database Issues​

3. Bedrock Throttling / AI Analysis Failures​

4. CloudFront / ALB Errors (502, 503, 504)​

5. Cognito Authentication Failures​

Communication Templates​

For F&F Users (SEV-1 or extended SEV-2)​

Resolution Follow-Up​

Escalation Path​

Post-Incident​

Alarm Anchors (NQU-678)​

Application logic & background jobs​

audit-failure​

auth-failure-spike​

rate-limit-spike​

webauthn-verify-failures​

quality-check-failure​

usage-recording-failure​

retention-cron-missing​

embedding-pending-over-24h​

Compute (ECS / Amplify / ALB)​

ecs-task-deficit​

ecs-oom-or-exit​

ecs-cpu-high​

ecs-memory-high​

amplify-5xx-high​

alb-target-5xx-high​

alb-target-latency-p95-high​

alb-healthy-hosts-low​

synthetics-canary-failure​

Data layer (RDS / Redis)​

rds-cpu-high​

rds-storage-low​

rds-connections-high​

redis-cpu-high​

redis-memory-high​

External AI (Bedrock)​

bedrock-throttling​

bedrock-latency​

bedrock-errors​

bedrock-token-spike​

Milestones (informational)​

first-paid-invoice​

How You'll Know Something Is Wrong

Severity Levels

Failure Mode Runbooks

1. ECS Task Down (App Unreachable)

2. RDS Database Issues

3. Bedrock Throttling / AI Analysis Failures

4. CloudFront / ALB Errors (502, 503, 504)

5. Cognito Authentication Failures

Communication Templates

For F&F Users (SEV-1 or extended SEV-2)

Resolution Follow-Up

Escalation Path

Post-Incident

Alarm Anchors (NQU-678)

Application logic & background jobs

audit-failure

auth-failure-spike

rate-limit-spike

webauthn-verify-failures

quality-check-failure

usage-recording-failure

retention-cron-missing

embedding-pending-over-24h

Compute (ECS / Amplify / ALB)

ecs-task-deficit

ecs-oom-or-exit

ecs-cpu-high

ecs-memory-high

amplify-5xx-high

alb-target-5xx-high

alb-target-latency-p95-high

alb-healthy-hosts-low

synthetics-canary-failure

Data layer (RDS / Redis)

rds-cpu-high

rds-storage-low

rds-connections-high

redis-cpu-high

redis-memory-high

External AI (Bedrock)

bedrock-throttling

bedrock-latency

bedrock-errors

bedrock-token-spike

Milestones (informational)

first-paid-invoice