Incident Response Plan
Created: 2026-04-17 Owner: Joe Etherage Status: Active — minimum viable plan for Gate 1/Gate 2 Linear: NQU-420
How You'll Know Something Is Wrong
Nquiry has two alerting channels. Both send to joe.etherage@gmail.com.
CloudWatch Alarms → SNS → Email:
- RDS CPU > 80% for 5 min
- RDS free storage < 5 GB
- RDS connections > 80% of max
- Redis CPU and memory alarms
- Bedrock throttling > 10 requests / 5 min
- Bedrock latency p90 sustained high
- Bedrock server errors elevated
- Audit logging failure (compliance-critical)
- Auth failure spike (possible brute force)
- Rate limit spike (possible DoS)
- Synthetics canary failure (app.nquiry.ai unreachable)
- Pending NQU-601: ALB HealthyHostCount < 1, ALB 5xx spike, ECS CPU/memory
Sentry → Email:
- Unhandled JavaScript exceptions (client and server)
- API route errors
Not covered yet (acceptable for Gate 1/2):
- No PagerDuty or phone alerting — email only
- No on-call rotation — Joe is the sole responder
- No formal SLA — F&F users understand it's early
Severity Levels
SEV-1 — App is down or data is at risk. Users cannot access app.nquiry.ai, or data integrity is compromised. Drop everything.
SEV-2 — Core feature is broken. Users can log in but can't generate analyses, upload evidence, or export reports. Fix within hours.
SEV-3 — Degraded experience. Slow responses, intermittent errors, non-critical feature broken. Fix within 1 business day.
Failure Mode Runbooks
1. ECS Task Down (App Unreachable)
Detection: Synthetics canary alarm, ALB HealthyHostCount < 1 (after NQU-601), or user report that app.nquiry.ai returns 502/503.
Severity: SEV-1
First response (5 minutes):
-
Check ECS console — is the task running?
aws ecs describe-services \--cluster invapp-dev-cluster \--services invapp-dev-app-service \--query 'services[0].{desired:desiredCount,running:runningCount,status:status}' -
If task count is 0 or task is in STOPPED state, check why:
aws ecs describe-services \--cluster invapp-dev-cluster \--services invapp-dev-app-service \--query 'services[0].events[:5]' -
Check stopped task reason:
aws ecs list-tasks --cluster invapp-dev-cluster --service-name invapp-dev-app-service --desired-status STOPPEDaws ecs describe-tasks --cluster invapp-dev-cluster --tasks <task-id> --query 'tasks[0].stoppedReason'
Common causes and fixes:
| Cause | Fix |
|---|---|
| OOM kill (container exceeded memory) | Task will auto-restart. If recurring, increase memory in ECS module. |
| Failed health check | Check /api/health — if the app is crashing on startup, check recent deploy. Roll back (see below). |
| Bad deploy | Roll back to previous task definition revision. |
| ECR image pull failure | Check ECR — image may have been deleted. Redeploy from last known-good SHA. |
Rollback — Option A (fast, ~3 minutes):
# Find current task def revision
aws ecs describe-services --cluster invapp-dev-cluster --services invapp-dev-app-service \
--query 'services[0].taskDefinition'
# List recent revisions
aws ecs list-task-definitions --family-prefix invapp-dev-app --sort DESC --max-items 5
# Roll back to previous revision
aws ecs update-service \
--cluster invapp-dev-cluster \
--service invapp-dev-app-service \
--task-definition invapp-dev-app:<previous-revision> \
--force-new-deployment
# Wait for stabilization
aws ecs wait services-stable --cluster invapp-dev-cluster --services invapp-dev-app-service
# Smoke test
curl -s -o /dev/null -w "%{http_code}" https://app.nquiry.ai/api/health
Rollback — Option B (clean, ~10 minutes):
git revert <bad-commit-sha>
git push origin main
# CI rebuilds and redeploys
2. RDS Database Issues
Detection: RDS CPU alarm, RDS connections alarm, RDS storage alarm, or application errors referencing database connection failures.
Severity: SEV-1 (connection failure) or SEV-2 (high CPU / approaching storage limit)
First response:
-
Check RDS status in AWS console or CLI:
aws rds describe-db-instances \--db-instance-identifier invapp-dev-db \--query 'DBInstances[0].{status:DBInstanceStatus,cpu:PerformanceInsightsEnabled,storage:AllocatedStorage}' -
Check current connections vs. max:
# Via bastion tunnel (see deployment-flow.md for setup)psql -h localhost -p 5433 -U app_admin -d investigation_app \-c "SELECT count(*) as active_connections FROM pg_stat_activity WHERE state = 'active';"
Common causes and fixes:
| Cause | Fix |
|---|---|
| Connection pool exhaustion | Restart ECS task (force new deployment). Check for connection leaks in recent code. |
| High CPU from expensive query | Check pg_stat_activity for long-running queries. SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 5; Cancel with SELECT pg_cancel_backend(<pid>); |
| Storage approaching limit | Increase allocated storage in Terraform (rds_allocated_storage). This is a non-destructive operation. |
| RDS instance down | Check AWS Health Dashboard. If maintenance event, wait. If unexpected, check CloudTrail for changes. |
Nuclear option — restore from snapshot:
# List available snapshots
aws rds describe-db-snapshots \
--db-instance-identifier invapp-dev-db \
--query 'DBSnapshots[*].{id:DBSnapshotIdentifier,time:SnapshotCreateTime}' \
--output table
# Restore to a new instance (does NOT replace the existing one)
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier invapp-dev-db-restored \
--db-snapshot-identifier <snapshot-id>
After restoring, update the ECS task definition to point to the new RDS endpoint, or rename instances. This is a last resort — expect 15-30 minutes.
3. Bedrock Throttling / AI Analysis Failures
Detection: Bedrock throttling alarm, Bedrock error alarm, Sentry errors from analysis generation routes, or user reports that "analysis is stuck."
Severity: SEV-2
First response:
-
Check Bedrock CloudWatch metrics:
aws cloudwatch get-metric-statistics \--namespace AWS/Bedrock \--metric-name InvocationThrottles \--dimensions Name=ModelId,Value=us.anthropic.claude-sonnet-4-6 \--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \--period 300 --statistics Sum -
Check current quota:
aws service-quotas get-service-quota \--service-code bedrock \--quota-code L-ABCD1234 # Check actual quota code in console -
Check the in-app shedder + fallback state (NQU-783 PR 1):
# Hit the admin queue-stats endpoint with a valid admin session token:curl -H "Authorization: Bearer ${ADMIN_TOKEN}" https://app.nquiry.ai/api/admin/queue-stats | jq '.shedder'# Expected during incident: { "hot": true, "recentThrottles": N>3, "fallbackRegion": "us-west-2", "disabled": false }# Search ECS logs for [AI:fallback] (region fallback fired) and [AI:Shed] (quality calls shed)
Built-in resilience (NQU-783 PR 1):
The app now ships two automatic resilience mechanisms before any operator action is needed:
- Multi-region fallback.
lib/ai/client.tsretries onus-west-2after the primary region (us-east-1) exhausts retries with a retryable error (throttle, 5xx, connection). Look for[AI:fallback]log entries — these mean fallback fired, which is expected behavior under a us-east-1-only incident, not a code bug. - Throttle shedder. Each
RateLimitError/ThrottlingExceptionincrements a 60-second sliding-window counter shared across all Bedrock limiters. When the count exceeds 3, quality-check calls (lib/ai/quality/*) reject pre-flight withQualityShedErrorso primary analysis traffic keeps its quota. Look for[AI:Throttle](a throttle was recorded) and[AI:Shed](a quality call was shed). Primary calls never shed.
Common causes and fixes:
| Cause | Fix |
|---|---|
| Burst usage exceeding RPM quota | Wait — the throttle shedder will drop quality calls automatically; primary traffic continues. If shedder.hot stays true for >5 min, request a quota increase via AWS console. Current limit: 50 RPM per region. |
| us-east-1 regional outage | The fallback to us-west-2 should fire automatically (look for [AI:fallback] log lines). If those are present and the incident persists, both regions are likely degraded — check AWS Health Dashboard. No code fix needed. |
| us-west-2 regional outage | No user impact unless us-east-1 is also down. Confirm by checking that [AI:fallback] log lines are NOT firing. |
| Shedder firing false positives | Emergency disable via env var: BEDROCK_SHEDDER_DISABLED=true (jq upsert into the live task def via the CI deploy pattern). File a Linear bug to investigate. |
| Model deprecated or changed | Check Bedrock model catalog. Update BEDROCK_PRIMARY_MODEL / BEDROCK_QUALITY_MODEL env vars. Run the full pre-flight checklist in .claude/rules/model-upgrade.md before swapping. |
User communication: If Bedrock is degraded in BOTH regions, analyses will fail with an error message. Users can retry later. No data is lost — the inquiry and evidence remain intact. If only quality checks are affected (shedder active, primary OK), the analysis ships without a quality score; the user sees a slightly degraded UX but the analysis itself is correct.
4. CloudFront / ALB Errors (502, 503, 504)
Detection: ALB 5xx alarm (after NQU-601), synthetics canary, or user reports of error pages.
Severity: SEV-1 (sustained) or SEV-3 (intermittent)
First response:
-
Determine which layer is failing:
# Direct to ALB (bypasses CloudFront)curl -s -o /dev/null -w "%{http_code}" https://app.nquiry.ai/api/health# Check ECS task healthaws ecs describe-services --cluster invapp-dev-cluster --services invapp-dev-app-service \--query 'services[0].{running:runningCount,desired:desiredCount}' -
If ALB returns 502: the ECS task is not responding. See Runbook #1 (ECS Task Down).
-
If ALB returns 504 (timeout): the app is running but slow. Check RDS CPU, Bedrock latency, or look for a long-running request.
-
If CloudFront returns an error but ALB is healthy: check CloudFront distribution status in console. WAF may be blocking requests — check WAF logs.
WAF false positive: WAF is in count mode for bot control (not blocking), but rate limiting is active. If a legitimate user is being rate-limited:
# Check WAF sampled requests in console
# AWS Console → WAF & Shield → Web ACLs → invapp-dev-waf → Sampled requests
5. Cognito Authentication Failures
Detection: Auth failure spike alarm, user reports they can't log in, Sentry errors from auth routes.
Severity: SEV-1 (nobody can log in) or SEV-3 (one user locked out)
First response:
-
Check if Cognito itself is up:
aws cognito-idp describe-user-pool \--user-pool-id <pool-id> \--query 'UserPool.Status' -
If a single user is locked out:
aws cognito-idp admin-get-user \--user-pool-id <pool-id> \--username <email> \--query '{status:UserStatus,enabled:Enabled}' -
If user is disabled or locked (Advanced Security risk-based auth):
aws cognito-idp admin-enable-user \--user-pool-id <pool-id> \--username <email>
Common causes and fixes:
| Cause | Fix |
|---|---|
| Cognito Advanced Security blocking legitimate user | Check risk evaluation in Cognito console. Disable adaptive authentication temporarily if needed. |
| User forgot password | Direct them to the forgot-password flow in the app. |
| Cognito regional outage | Check AWS Health Dashboard. No workaround — auth is centralized. |
| MFA device lost | Admin can reset MFA: aws cognito-idp admin-set-user-mfa-preference --user-pool-id <id> --username <email> --software-token-mfa-settings Enabled=false |
Communication Templates
For F&F Users (SEV-1 or extended SEV-2)
Subject: Nquiry — temporary service interruption
Hi [name],
Nquiry is currently experiencing [brief description — e.g., "difficulty connecting to our AI analysis service"]. Your data is safe and no action is needed on your end.
I'm working on a fix and expect to have things back to normal within [timeframe]. I'll follow up when it's resolved.
Sorry for the interruption, and thanks for your patience while we work through these early days.
— Joe
Resolution Follow-Up
Hi [name],
The issue from [date/time] has been resolved. [One sentence on what happened and what was done.]
If you notice anything still off, just let me know.
— Joe
Escalation Path
| Level | Who | When |
|---|---|---|
| L1 | Joe (email alerts) | All incidents — first responder |
| L2 | AWS Support (case) | Infrastructure issue Joe can't resolve in 30 min. Use AWS console → Support → Create case. Current plan: Basic (free). |
| L3 | AWS Premium Support | If/when upgraded to Business support ($100/mo). Provides 24/7 access and < 1hr response for production-down. |
Note: Upgrading to AWS Business Support before Gate 2 is worth considering. $100/mo buys < 1hr response time for production-system-down events and access to Trusted Advisor checks.
Post-Incident
After any SEV-1 or recurring SEV-2:
- What happened? One paragraph.
- What was the impact? Duration, users affected.
- What was the root cause?
- What prevented faster detection/resolution?
- What changes will prevent recurrence? File Linear issues.
Keep it simple — a Linear comment on the relevant issue is sufficient for now. No formal post-mortem template needed at this scale.
Alarm Anchors (NQU-678)
Every CloudWatch alarm under infrastructure/terraform/modules/{monitoring,cloudwatch-dashboard,synthetics,elasticache} carries a Runbook: reference pointing here. Each subsection below is a 30-second triage card: what fired, three checks to do first, when to escalate. If a check produces a clear answer, follow that thread; if all three look fine, escalate.
Severity column legend: C = critical / page; W = warning / next business hour; I = informational / no action.
Application logic & background jobs
audit-failure
Severity: C — compliance trail at risk.
What fired: lib/shared/audit.ts emitted [AUDIT FAILURE] — at least one audit row failed to write in the last 5 min.
First look:
GET /api/admin/audit-health— what action types are failing? Recent error message?- CloudWatch logs filter
[AUDIT FAILURE]in the last hour — clustered around one route or random? - Is RDS healthy (see
#rds-cpu-high,#rds-connections-high)? Escalation: Joe → if the table itself is unwriteable (FK constraint or column mismatch), suspect a recent migration; check_migrationsfor last applied row.
auth-failure-spike
Severity: W — possible brute force or password-reset campaign.
What fired: > N auth.login_failed audit rows in 5 min (threshold in terraform var auth_failure_threshold).
First look:
- CloudWatch logs filter
auth.login_failed— same email repeating, or many distinct emails? - WAF console → recent rate-limit blocks → confirm WAF is doing its job for Cognito IPs.
- Slack/email user reports of locked-out accounts?
Escalation: Joe → if pattern is credential-stuffing, enable Cognito advanced security (adaptive auth) or temporarily lower the WAF rate-limit on
/api/auth/*.
rate-limit-spike
Severity: W — possible DoS or misconfigured client.
What fired: > N rate-limit hits in 5 min (terraform var rate_limit_hits_threshold).
First look:
- CloudWatch logs filter for the rate-limit log line — what user IDs / IPs are repeating?
- One user hitting it = misconfigured client (e.g. a polling loop without backoff). Many distinct = abuse.
- Confirm the rate limiter (Redis-backed) is actually keying on the right field —
RATE_LIMITSconfig. Escalation: Joe → if abuse, block at WAF; if client bug, contact the user and tighten the limit if structural.
webauthn-verify-failures
Severity: W — passkey auth regression risk. What fired: WebAuthn verify endpoint failures spiking. Pre-existing context: NQU-655 documented Cognito DeviceConfiguration as a recurring root cause. First look:
- CloudWatch logs for
auth.webauthn.authenticate.verifyfailures — error message recurring? - Cognito user-pool DeviceConfiguration setting — has anyone re-enabled it?
- Is one user repeating, or many distinct users? One = stale credential; many = systemic. Escalation: Joe → if many users, revert the most recent auth-related deploy.
quality-check-failure
Severity: W — analyses still complete, but quality scores missing.
What fired: Background quality check (lib/ai/quality/) failed for at least one analysis in the last 5 min — typically a Bedrock timeout or a JSON-parse error in the merged faithfulness+contradiction prompt (NQU-608).
First look:
- CloudWatch logs filter
[QUALITY_CHECK_FAILURE]— error message recurring? - Bedrock side: see
#bedrock-throttling,#bedrock-errors,#bedrock-latency. - Inspect failing analysis row's
generation_status— should becompleteorquality_unavailable, never stuck onchecking. Escalation: Joe → if Bedrock-side, retry will self-heal; if parse error, file a Linear ticket and disable the failing check via env until fixed.
usage-recording-failure
Severity: C if sustained — billing data may be incomplete.
What fired: recordAIUsage call failed in the last 5 min — quota counters and downstream invoicing are at risk.
First look:
- CloudWatch logs for the failure line — DB error or constraint violation?
- Is RDS healthy (see
#rds-*)? - Spot-check a recent quota-using user via
/api/admin/users— does theirai_usagecount match expected? Escalation: Joe → if sustained > 30 min, freeze any pending invoices until backfill is verified; check_migrationsfor recent ai_usage schema changes (see NQU-667).
retention-cron-missing
Severity: W — daily retention sweep didn't run.
What fired: No RetentionCronLastRun log line in the last 26 hours. Either the GitHub Actions workflow didn't trigger or it died before emitting.
First look:
- GitHub Actions →
Retention Cronworkflow → did the latest scheduled run start? If not, GitHub Actions outage. - If it started but failed, the run logs show the curl response — Bearer token mismatch, app down, or RDS unreachable.
- Trigger a manual run via the workflow_dispatch button to confirm the path works.
Escalation: Joe → if cron has been missing for > 48 h, a manual
psqlcleanup may be warranted before stale data accumulates.
embedding-pending-over-24h
Severity: W — embedding pipeline stuck.
What fired: Embedding sweeper (/api/admin/cron/embedding-sweeper) reports rows pending > 24 h after a sweep. Either the sweeper isn't running or the embedding pipeline itself is broken (NQU-669).
First look:
- GitHub Actions →
Embedding Sweeper Cronworkflow → recent runs failing? - Manual trigger of the sweeper — does it complete with
pendingOver24hAfterSweep: 0? - Inspect a failing row in
embedding_source_status— what's theerror_message? Escalation: Joe → if Bedrock Titan embeddings are throttling, temporarily reduceMAX_REPROCESS_PER_RUNand run sweeper hourly from a workflow_dispatch.
Compute (ECS / Amplify / ALB)
ecs-task-deficit
Severity: C — service running fewer tasks than desired for 10 min. What fired: ECS service health below desired count. Task is dying or failing to start. First look:
- ECS Console → service → events tab — what's the last task-stopped reason?
- See
#ecs-oom-or-exitif the reason is OOM / non-zero exit. - See
#alb-healthy-hosts-lowif no targets are registering. Escalation: Joe → roll back to previous task definition (see "ECS Task Down" runbook section above).
ecs-oom-or-exit
Severity: C — container crashed.
What fired: ECS task logged OutOfMemoryError or a non-zero exit code in the last 5 min.
First look:
- CloudWatch logs filter
OutOfMemoryErrororexit code— which container, which time? - Memory metric (see
#ecs-memory-high) — sustained pressure or a sudden spike? - Recent deploy? Roll back if so.
Escalation: Joe → if memory pressure is sustained, increase task
memoryand re-deploy; if exit code, check Sentry for an unhandled exception around the same timestamp.
ecs-cpu-high
Severity: W — service CPU > threshold for 10 min.
What fired: ECS service CPU above ecs_cpu_threshold (default 75%) for 10 minutes.
First look:
- ALB request rate — has traffic spiked? If yes, this is just load.
- CPU per task — one task pinned at 100%? Could be a stuck request loop.
- Recent deploy with new sync work on the request path? Escalation: Joe → if traffic-driven, scale up; if a hot loop, identify via CPU profiler and roll back the suspect change.
ecs-memory-high
Severity: W — container approaching OOM.
What fired: ECS service memory above ecs_memory_threshold (default 80%) for 10 minutes.
First look:
- Memory metric trend — gradual climb (leak) or step (legitimate increased usage)?
- CloudWatch logs for any high-memory operation indicators (large file processing, long contexts).
- Recent change to context budgets in
lib/ai/context-budget.ts? Escalation: Joe → restart task to buy time; investigate leak via heap snapshot if pattern is leak-shaped.
amplify-5xx-high
Severity: C — Amplify 5xx error rate > 5%. What fired: Amplify 5xx error rate exceeds 5% for 5 minutes. First look:
- Amplify console → app → recent deploys.
- Sentry → recent unhandled exceptions clustering by route.
- ALB target health (see
#alb-healthy-hosts-low) — is the underlying target serving? Escalation: Joe → roll back the most recent Amplify deployment if it correlates.
alb-target-5xx-high
Severity: C — > N% of ALB targets returning 5xx.
What fired: ALB target 5xx error rate exceeds alb_5xx_error_rate_threshold_pct% over 5 minutes (with > 100 reqs in the window).
First look:
- ECS task health — see
#ecs-task-deficit. - Sentry exceptions correlated to the time window.
- Recent deploy?
Escalation: Joe → roll back the deploy or scale tasks; if root cause is in DB, see
#rds-*.
alb-target-latency-p95-high
Severity: W — sustained p95 latency above threshold.
What fired: ALB target p95 response time above alb_latency_p95_threshold_seconds for 3 consecutive 5-min periods.
First look:
- Slowest endpoints — CloudWatch logs
Slow querylines (development) or AWS X-Ray traces (when enabled). - DB load (see
#rds-cpu-high,#rds-connections-high). - Bedrock latency (see
#bedrock-latency) — analysis routes can dominate p95 on small fleets. Escalation: Joe → if AI-driven, that's expected during heavy generation periods; if DB-driven, see#rds-*.
alb-healthy-hosts-low
Severity: C — app unreachable. What fired: No healthy targets registered behind the ALB. First look:
- ECS service → tasks tab → are any tasks RUNNING? See
#ecs-task-deficit. - ALB target group health checks — what's the failure reason (e.g. timeout, 5xx from
/health)? - Recent task-def change to health-check path / port? Escalation: Joe → roll back via "ECS Task Down" runbook; if stuck, AWS Support L2.
synthetics-canary-failure
Severity: C in prod, W in staging — health check failing.
What fired: Synthetics canary against app.nquiry.ai (or staging URL) is failing.
First look:
- Manually
curl https://app.nquiry.ai/api/health— what status? - ALB / ECS health (see
#alb-healthy-hosts-low,#ecs-task-deficit). - DNS / Route 53 health checks — anything stale? Escalation: Joe → SEV-1 if production. Use the F&F user comms template above if outage exceeds 15 min.
Data layer (RDS / Redis)
rds-cpu-high
Severity: W → C if sustained > 30 min. What fired: RDS CPU > 80% for 5 min. First look:
- RDS console → Performance Insights → top SQL by CPU.
- New slow query? Check
pg_stat_statements(or CloudWatch RDS Enhanced Monitoring). - Connection storm (see
#rds-connections-high)? Escalation: Joe → if a single query is hot, kill the offender (pg_terminate_backend(pid)) and file a Linear ticket; if sustained from legitimate load, scale instance class.
rds-storage-low
Severity: C — < 5 GB free. What fired: RDS free storage below 5 GB. First look:
- Has retention cron (
#retention-cron-missing) run recently? Failed retention is a likely cause. - WAL bloat —
SELECT slot_name, active, restart_lsn FROM pg_replication_slotsfor inactive slots holding WAL. - Largest tables — is one table growing unexpectedly?
pg_size_pretty(pg_total_relation_size('<table>')). Escalation: Joe → emergency:aws rds modify-db-instance --allocated-storage <bigger>; long-term: enable Storage Autoscaling and set max ceiling.
rds-connections-high
Severity: W.
What fired: RDS connections exceed 80% of max (rds_max_connections).
First look:
SELECT state, count(*) FROM pg_stat_activity GROUP BY state— idle in transaction = connection leak.- Pool sizing in
lib/db/pool.ts(max: 10) × Lambda concurrency — at scale, may need PgBouncer. - Recent deploy add long-held transactions? Escalation: Joe → kill leaked sessions; long-term, deploy RDS Proxy or PgBouncer.
redis-cpu-high
Severity: W. What fired: Elasticache Redis CPU sustained high. First look:
- Redis SLOWLOG (
SLOWLOG GET 50) — is one command pattern dominating? - Recent change to rate-limit window or to a new Redis-backed feature?
- Key-count growth —
INFO keyspace— runaway key creation? Escalation: Joe → scale Redis instance; if rate-limit-driven, lower request volume at WAF.
redis-memory-high
Severity: W → C if approaching maxmemory.
What fired: Redis memory usage high.
First look:
INFO memory—used_memoryvsmaxmemory, fragmentation ratio.--bigkeysscan — any pathological keys?- Eviction policy on the cluster — is it dropping things silently? Escalation: Joe → scale up; if rate-limit only, switch to LRU eviction safely.
External AI (Bedrock)
bedrock-throttling
Severity: W — Bedrock 429s spiking. What fired: Bedrock throttling exceeds 10 requests in 5 minutes. First look:
- Which model? (Claude vs Titan embed vs other.) The terraform alarm includes
${bedrock_model_id}. - AI queue stats:
getAIQueueStats(lib/ai/client.ts) — pending depth growing? - Per-org concurrency cap (
lib/ai/concurrency.ts) — is one org saturating? Escalation: Joe → request a quota increase via AWS console; reduceMAX_CONCURRENT_GENERATIONS_PER_ORGif a single org is the problem.
bedrock-latency
Severity: W — slow AI responses.
What fired: Bedrock p90 invocation latency above bedrock_latency_threshold_ms for 15 + minutes.
First look:
- AWS Service Health Dashboard — Bedrock incident in this region?
- Has the
analysisTypemix shifted toward larger contexts? Checktoken_usagein recentanalysisrows. - Throttling adjacent (see
#bedrock-throttling)? Escalation: Joe → no immediate fix; communicate slowness to active users; consider lower-tier model fallback if sustained.
bedrock-errors
Severity: W → C if sustained.
What fired: Bedrock server errors above bedrock_error_threshold in 5 min.
First look:
- Sentry exceptions tagged with the Bedrock client.
- AWS Service Health.
- Did we recently change
model_idor version? Bedrock occasionally retires preview models. Escalation: Joe → swap to a known-good model via env override; AWS Support L2 if Bedrock-side.
bedrock-token-spike
Severity: C — runaway cost risk.
What fired: Bedrock output tokens > bedrock_token_spike_threshold in 1 hour.
First look:
analysisrows from the last hour — any with abnormally largetoken_usage.output_tokens?- Did a new prompt template ship with
max_tokenscranked up? Checkprompt_templatetable. - One org dominating? Cross-reference with
/api/admin/jobs/status. Escalation: Joe → if a prompt template is the cause, setis_active = falseon it via DB; if user behavior, lower per-user rate limit immediately.
Milestones (informational)
first-paid-invoice
Severity: I — milestone marker, not an incident.
What fired: Stripe webhook just processed the first non-zero invoice.paid event. Idempotent — fires once ever.
First look: None required. Open the celebration channel.
Escalation: None.