Skip to main content

Incident Response Plan

Created: 2026-04-17 Owner: Joe Etherage Status: Active — minimum viable plan for Gate 1/Gate 2 Linear: NQU-420


How You'll Know Something Is Wrong

Nquiry has two alerting channels. Both send to joe.etherage@gmail.com.

CloudWatch Alarms → SNS → Email:

  • RDS CPU > 80% for 5 min
  • RDS free storage < 5 GB
  • RDS connections > 80% of max
  • Redis CPU and memory alarms
  • Bedrock throttling > 10 requests / 5 min
  • Bedrock latency p90 sustained high
  • Bedrock server errors elevated
  • Audit logging failure (compliance-critical)
  • Auth failure spike (possible brute force)
  • Rate limit spike (possible DoS)
  • Synthetics canary failure (app.nquiry.ai unreachable)
  • Pending NQU-601: ALB HealthyHostCount < 1, ALB 5xx spike, ECS CPU/memory

Sentry → Email:

  • Unhandled JavaScript exceptions (client and server)
  • API route errors

Not covered yet (acceptable for Gate 1/2):

  • No PagerDuty or phone alerting — email only
  • No on-call rotation — Joe is the sole responder
  • No formal SLA — F&F users understand it's early

Severity Levels

SEV-1 — App is down or data is at risk. Users cannot access app.nquiry.ai, or data integrity is compromised. Drop everything.

SEV-2 — Core feature is broken. Users can log in but can't generate analyses, upload evidence, or export reports. Fix within hours.

SEV-3 — Degraded experience. Slow responses, intermittent errors, non-critical feature broken. Fix within 1 business day.


Failure Mode Runbooks

1. ECS Task Down (App Unreachable)

Detection: Synthetics canary alarm, ALB HealthyHostCount < 1 (after NQU-601), or user report that app.nquiry.ai returns 502/503.

Severity: SEV-1

First response (5 minutes):

  1. Check ECS console — is the task running?

    aws ecs describe-services \
    --cluster invapp-dev-cluster \
    --services invapp-dev-app-service \
    --query 'services[0].{desired:desiredCount,running:runningCount,status:status}'
  2. If task count is 0 or task is in STOPPED state, check why:

    aws ecs describe-services \
    --cluster invapp-dev-cluster \
    --services invapp-dev-app-service \
    --query 'services[0].events[:5]'
  3. Check stopped task reason:

    aws ecs list-tasks --cluster invapp-dev-cluster --service-name invapp-dev-app-service --desired-status STOPPED
    aws ecs describe-tasks --cluster invapp-dev-cluster --tasks <task-id> --query 'tasks[0].stoppedReason'

Common causes and fixes:

CauseFix
OOM kill (container exceeded memory)Task will auto-restart. If recurring, increase memory in ECS module.
Failed health checkCheck /api/health — if the app is crashing on startup, check recent deploy. Roll back (see below).
Bad deployRoll back to previous task definition revision.
ECR image pull failureCheck ECR — image may have been deleted. Redeploy from last known-good SHA.

Rollback — Option A (fast, ~3 minutes):

# Find current task def revision
aws ecs describe-services --cluster invapp-dev-cluster --services invapp-dev-app-service \
--query 'services[0].taskDefinition'

# List recent revisions
aws ecs list-task-definitions --family-prefix invapp-dev-app --sort DESC --max-items 5

# Roll back to previous revision
aws ecs update-service \
--cluster invapp-dev-cluster \
--service invapp-dev-app-service \
--task-definition invapp-dev-app:<previous-revision> \
--force-new-deployment

# Wait for stabilization
aws ecs wait services-stable --cluster invapp-dev-cluster --services invapp-dev-app-service

# Smoke test
curl -s -o /dev/null -w "%{http_code}" https://app.nquiry.ai/api/health

Rollback — Option B (clean, ~10 minutes):

git revert <bad-commit-sha>
git push origin main
# CI rebuilds and redeploys

2. RDS Database Issues

Detection: RDS CPU alarm, RDS connections alarm, RDS storage alarm, or application errors referencing database connection failures.

Severity: SEV-1 (connection failure) or SEV-2 (high CPU / approaching storage limit)

First response:

  1. Check RDS status in AWS console or CLI:

    aws rds describe-db-instances \
    --db-instance-identifier invapp-dev-db \
    --query 'DBInstances[0].{status:DBInstanceStatus,cpu:PerformanceInsightsEnabled,storage:AllocatedStorage}'
  2. Check current connections vs. max:

    # Via bastion tunnel (see deployment-flow.md for setup)
    psql -h localhost -p 5433 -U app_admin -d investigation_app \
    -c "SELECT count(*) as active_connections FROM pg_stat_activity WHERE state = 'active';"

Common causes and fixes:

CauseFix
Connection pool exhaustionRestart ECS task (force new deployment). Check for connection leaks in recent code.
High CPU from expensive queryCheck pg_stat_activity for long-running queries. SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 5; Cancel with SELECT pg_cancel_backend(<pid>);
Storage approaching limitIncrease allocated storage in Terraform (rds_allocated_storage). This is a non-destructive operation.
RDS instance downCheck AWS Health Dashboard. If maintenance event, wait. If unexpected, check CloudTrail for changes.

Nuclear option — restore from snapshot:

# List available snapshots
aws rds describe-db-snapshots \
--db-instance-identifier invapp-dev-db \
--query 'DBSnapshots[*].{id:DBSnapshotIdentifier,time:SnapshotCreateTime}' \
--output table

# Restore to a new instance (does NOT replace the existing one)
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier invapp-dev-db-restored \
--db-snapshot-identifier <snapshot-id>

After restoring, update the ECS task definition to point to the new RDS endpoint, or rename instances. This is a last resort — expect 15-30 minutes.


3. Bedrock Throttling / AI Analysis Failures

Detection: Bedrock throttling alarm, Bedrock error alarm, Sentry errors from analysis generation routes, or user reports that "analysis is stuck."

Severity: SEV-2

First response:

  1. Check Bedrock CloudWatch metrics:

    aws cloudwatch get-metric-statistics \
    --namespace AWS/Bedrock \
    --metric-name InvocationThrottles \
    --dimensions Name=ModelId,Value=us.anthropic.claude-sonnet-4-6 \
    --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 300 --statistics Sum
  2. Check current quota:

    aws service-quotas get-service-quota \
    --service-code bedrock \
    --quota-code L-ABCD1234 # Check actual quota code in console
  3. Check the in-app shedder + fallback state (NQU-783 PR 1):

    # Hit the admin queue-stats endpoint with a valid admin session token:
    curl -H "Authorization: Bearer ${ADMIN_TOKEN}" https://app.nquiry.ai/api/admin/queue-stats | jq '.shedder'
    # Expected during incident: { "hot": true, "recentThrottles": N>3, "fallbackRegion": "us-west-2", "disabled": false }
    # Search ECS logs for [AI:fallback] (region fallback fired) and [AI:Shed] (quality calls shed)

Built-in resilience (NQU-783 PR 1):

The app now ships two automatic resilience mechanisms before any operator action is needed:

  • Multi-region fallback. lib/ai/client.ts retries on us-west-2 after the primary region (us-east-1) exhausts retries with a retryable error (throttle, 5xx, connection). Look for [AI:fallback] log entries — these mean fallback fired, which is expected behavior under a us-east-1-only incident, not a code bug.
  • Throttle shedder. Each RateLimitError / ThrottlingException increments a 60-second sliding-window counter shared across all Bedrock limiters. When the count exceeds 3, quality-check calls (lib/ai/quality/*) reject pre-flight with QualityShedError so primary analysis traffic keeps its quota. Look for [AI:Throttle] (a throttle was recorded) and [AI:Shed] (a quality call was shed). Primary calls never shed.

Common causes and fixes:

CauseFix
Burst usage exceeding RPM quotaWait — the throttle shedder will drop quality calls automatically; primary traffic continues. If shedder.hot stays true for >5 min, request a quota increase via AWS console. Current limit: 50 RPM per region.
us-east-1 regional outageThe fallback to us-west-2 should fire automatically (look for [AI:fallback] log lines). If those are present and the incident persists, both regions are likely degraded — check AWS Health Dashboard. No code fix needed.
us-west-2 regional outageNo user impact unless us-east-1 is also down. Confirm by checking that [AI:fallback] log lines are NOT firing.
Shedder firing false positivesEmergency disable via env var: BEDROCK_SHEDDER_DISABLED=true (jq upsert into the live task def via the CI deploy pattern). File a Linear bug to investigate.
Model deprecated or changedCheck Bedrock model catalog. Update BEDROCK_PRIMARY_MODEL / BEDROCK_QUALITY_MODEL env vars. Run the full pre-flight checklist in .claude/rules/model-upgrade.md before swapping.

User communication: If Bedrock is degraded in BOTH regions, analyses will fail with an error message. Users can retry later. No data is lost — the inquiry and evidence remain intact. If only quality checks are affected (shedder active, primary OK), the analysis ships without a quality score; the user sees a slightly degraded UX but the analysis itself is correct.


4. CloudFront / ALB Errors (502, 503, 504)

Detection: ALB 5xx alarm (after NQU-601), synthetics canary, or user reports of error pages.

Severity: SEV-1 (sustained) or SEV-3 (intermittent)

First response:

  1. Determine which layer is failing:

    # Direct to ALB (bypasses CloudFront)
    curl -s -o /dev/null -w "%{http_code}" https://app.nquiry.ai/api/health

    # Check ECS task health
    aws ecs describe-services --cluster invapp-dev-cluster --services invapp-dev-app-service \
    --query 'services[0].{running:runningCount,desired:desiredCount}'
  2. If ALB returns 502: the ECS task is not responding. See Runbook #1 (ECS Task Down).

  3. If ALB returns 504 (timeout): the app is running but slow. Check RDS CPU, Bedrock latency, or look for a long-running request.

  4. If CloudFront returns an error but ALB is healthy: check CloudFront distribution status in console. WAF may be blocking requests — check WAF logs.

WAF false positive: WAF is in count mode for bot control (not blocking), but rate limiting is active. If a legitimate user is being rate-limited:

# Check WAF sampled requests in console
# AWS Console → WAF & Shield → Web ACLs → invapp-dev-waf → Sampled requests

5. Cognito Authentication Failures

Detection: Auth failure spike alarm, user reports they can't log in, Sentry errors from auth routes.

Severity: SEV-1 (nobody can log in) or SEV-3 (one user locked out)

First response:

  1. Check if Cognito itself is up:

    aws cognito-idp describe-user-pool \
    --user-pool-id <pool-id> \
    --query 'UserPool.Status'
  2. If a single user is locked out:

    aws cognito-idp admin-get-user \
    --user-pool-id <pool-id> \
    --username <email> \
    --query '{status:UserStatus,enabled:Enabled}'
  3. If user is disabled or locked (Advanced Security risk-based auth):

    aws cognito-idp admin-enable-user \
    --user-pool-id <pool-id> \
    --username <email>

Common causes and fixes:

CauseFix
Cognito Advanced Security blocking legitimate userCheck risk evaluation in Cognito console. Disable adaptive authentication temporarily if needed.
User forgot passwordDirect them to the forgot-password flow in the app.
Cognito regional outageCheck AWS Health Dashboard. No workaround — auth is centralized.
MFA device lostAdmin can reset MFA: aws cognito-idp admin-set-user-mfa-preference --user-pool-id <id> --username <email> --software-token-mfa-settings Enabled=false

Communication Templates

For F&F Users (SEV-1 or extended SEV-2)

Subject: Nquiry — temporary service interruption

Hi [name],

Nquiry is currently experiencing [brief description — e.g., "difficulty connecting to our AI analysis service"]. Your data is safe and no action is needed on your end.

I'm working on a fix and expect to have things back to normal within [timeframe]. I'll follow up when it's resolved.

Sorry for the interruption, and thanks for your patience while we work through these early days.

— Joe

Resolution Follow-Up

Hi [name],

The issue from [date/time] has been resolved. [One sentence on what happened and what was done.]

If you notice anything still off, just let me know.

— Joe


Escalation Path

LevelWhoWhen
L1Joe (email alerts)All incidents — first responder
L2AWS Support (case)Infrastructure issue Joe can't resolve in 30 min. Use AWS console → Support → Create case. Current plan: Basic (free).
L3AWS Premium SupportIf/when upgraded to Business support ($100/mo). Provides 24/7 access and < 1hr response for production-down.

Note: Upgrading to AWS Business Support before Gate 2 is worth considering. $100/mo buys < 1hr response time for production-system-down events and access to Trusted Advisor checks.


Post-Incident

After any SEV-1 or recurring SEV-2:

  1. What happened? One paragraph.
  2. What was the impact? Duration, users affected.
  3. What was the root cause?
  4. What prevented faster detection/resolution?
  5. What changes will prevent recurrence? File Linear issues.

Keep it simple — a Linear comment on the relevant issue is sufficient for now. No formal post-mortem template needed at this scale.


Alarm Anchors (NQU-678)

Every CloudWatch alarm under infrastructure/terraform/modules/{monitoring,cloudwatch-dashboard,synthetics,elasticache} carries a Runbook: reference pointing here. Each subsection below is a 30-second triage card: what fired, three checks to do first, when to escalate. If a check produces a clear answer, follow that thread; if all three look fine, escalate.

Severity column legend: C = critical / page; W = warning / next business hour; I = informational / no action.

Application logic & background jobs

audit-failure

Severity: C — compliance trail at risk. What fired: lib/shared/audit.ts emitted [AUDIT FAILURE] — at least one audit row failed to write in the last 5 min. First look:

  • GET /api/admin/audit-health — what action types are failing? Recent error message?
  • CloudWatch logs filter [AUDIT FAILURE] in the last hour — clustered around one route or random?
  • Is RDS healthy (see #rds-cpu-high, #rds-connections-high)? Escalation: Joe → if the table itself is unwriteable (FK constraint or column mismatch), suspect a recent migration; check _migrations for last applied row.

auth-failure-spike

Severity: W — possible brute force or password-reset campaign. What fired: > N auth.login_failed audit rows in 5 min (threshold in terraform var auth_failure_threshold). First look:

  • CloudWatch logs filter auth.login_failed — same email repeating, or many distinct emails?
  • WAF console → recent rate-limit blocks → confirm WAF is doing its job for Cognito IPs.
  • Slack/email user reports of locked-out accounts? Escalation: Joe → if pattern is credential-stuffing, enable Cognito advanced security (adaptive auth) or temporarily lower the WAF rate-limit on /api/auth/*.

rate-limit-spike

Severity: W — possible DoS or misconfigured client. What fired: > N rate-limit hits in 5 min (terraform var rate_limit_hits_threshold). First look:

  • CloudWatch logs filter for the rate-limit log line — what user IDs / IPs are repeating?
  • One user hitting it = misconfigured client (e.g. a polling loop without backoff). Many distinct = abuse.
  • Confirm the rate limiter (Redis-backed) is actually keying on the right field — RATE_LIMITS config. Escalation: Joe → if abuse, block at WAF; if client bug, contact the user and tighten the limit if structural.

webauthn-verify-failures

Severity: W — passkey auth regression risk. What fired: WebAuthn verify endpoint failures spiking. Pre-existing context: NQU-655 documented Cognito DeviceConfiguration as a recurring root cause. First look:

  • CloudWatch logs for auth.webauthn.authenticate.verify failures — error message recurring?
  • Cognito user-pool DeviceConfiguration setting — has anyone re-enabled it?
  • Is one user repeating, or many distinct users? One = stale credential; many = systemic. Escalation: Joe → if many users, revert the most recent auth-related deploy.

quality-check-failure

Severity: W — analyses still complete, but quality scores missing. What fired: Background quality check (lib/ai/quality/) failed for at least one analysis in the last 5 min — typically a Bedrock timeout or a JSON-parse error in the merged faithfulness+contradiction prompt (NQU-608). First look:

  • CloudWatch logs filter [QUALITY_CHECK_FAILURE] — error message recurring?
  • Bedrock side: see #bedrock-throttling, #bedrock-errors, #bedrock-latency.
  • Inspect failing analysis row's generation_status — should be complete or quality_unavailable, never stuck on checking. Escalation: Joe → if Bedrock-side, retry will self-heal; if parse error, file a Linear ticket and disable the failing check via env until fixed.

usage-recording-failure

Severity: C if sustained — billing data may be incomplete. What fired: recordAIUsage call failed in the last 5 min — quota counters and downstream invoicing are at risk. First look:

  • CloudWatch logs for the failure line — DB error or constraint violation?
  • Is RDS healthy (see #rds-*)?
  • Spot-check a recent quota-using user via /api/admin/users — does their ai_usage count match expected? Escalation: Joe → if sustained > 30 min, freeze any pending invoices until backfill is verified; check _migrations for recent ai_usage schema changes (see NQU-667).

retention-cron-missing

Severity: W — daily retention sweep didn't run. What fired: No RetentionCronLastRun log line in the last 26 hours. Either the GitHub Actions workflow didn't trigger or it died before emitting. First look:

  • GitHub Actions → Retention Cron workflow → did the latest scheduled run start? If not, GitHub Actions outage.
  • If it started but failed, the run logs show the curl response — Bearer token mismatch, app down, or RDS unreachable.
  • Trigger a manual run via the workflow_dispatch button to confirm the path works. Escalation: Joe → if cron has been missing for > 48 h, a manual psql cleanup may be warranted before stale data accumulates.

embedding-pending-over-24h

Severity: W — embedding pipeline stuck. What fired: Embedding sweeper (/api/admin/cron/embedding-sweeper) reports rows pending > 24 h after a sweep. Either the sweeper isn't running or the embedding pipeline itself is broken (NQU-669). First look:

  • GitHub Actions → Embedding Sweeper Cron workflow → recent runs failing?
  • Manual trigger of the sweeper — does it complete with pendingOver24hAfterSweep: 0?
  • Inspect a failing row in embedding_source_status — what's the error_message? Escalation: Joe → if Bedrock Titan embeddings are throttling, temporarily reduce MAX_REPROCESS_PER_RUN and run sweeper hourly from a workflow_dispatch.

Compute (ECS / Amplify / ALB)

ecs-task-deficit

Severity: C — service running fewer tasks than desired for 10 min. What fired: ECS service health below desired count. Task is dying or failing to start. First look:

  • ECS Console → service → events tab — what's the last task-stopped reason?
  • See #ecs-oom-or-exit if the reason is OOM / non-zero exit.
  • See #alb-healthy-hosts-low if no targets are registering. Escalation: Joe → roll back to previous task definition (see "ECS Task Down" runbook section above).

ecs-oom-or-exit

Severity: C — container crashed. What fired: ECS task logged OutOfMemoryError or a non-zero exit code in the last 5 min. First look:

  • CloudWatch logs filter OutOfMemoryError or exit code — which container, which time?
  • Memory metric (see #ecs-memory-high) — sustained pressure or a sudden spike?
  • Recent deploy? Roll back if so. Escalation: Joe → if memory pressure is sustained, increase task memory and re-deploy; if exit code, check Sentry for an unhandled exception around the same timestamp.

ecs-cpu-high

Severity: W — service CPU > threshold for 10 min. What fired: ECS service CPU above ecs_cpu_threshold (default 75%) for 10 minutes. First look:

  • ALB request rate — has traffic spiked? If yes, this is just load.
  • CPU per task — one task pinned at 100%? Could be a stuck request loop.
  • Recent deploy with new sync work on the request path? Escalation: Joe → if traffic-driven, scale up; if a hot loop, identify via CPU profiler and roll back the suspect change.

ecs-memory-high

Severity: W — container approaching OOM. What fired: ECS service memory above ecs_memory_threshold (default 80%) for 10 minutes. First look:

  • Memory metric trend — gradual climb (leak) or step (legitimate increased usage)?
  • CloudWatch logs for any high-memory operation indicators (large file processing, long contexts).
  • Recent change to context budgets in lib/ai/context-budget.ts? Escalation: Joe → restart task to buy time; investigate leak via heap snapshot if pattern is leak-shaped.

amplify-5xx-high

Severity: C — Amplify 5xx error rate > 5%. What fired: Amplify 5xx error rate exceeds 5% for 5 minutes. First look:

  • Amplify console → app → recent deploys.
  • Sentry → recent unhandled exceptions clustering by route.
  • ALB target health (see #alb-healthy-hosts-low) — is the underlying target serving? Escalation: Joe → roll back the most recent Amplify deployment if it correlates.

alb-target-5xx-high

Severity: C — > N% of ALB targets returning 5xx. What fired: ALB target 5xx error rate exceeds alb_5xx_error_rate_threshold_pct% over 5 minutes (with > 100 reqs in the window). First look:

  • ECS task health — see #ecs-task-deficit.
  • Sentry exceptions correlated to the time window.
  • Recent deploy? Escalation: Joe → roll back the deploy or scale tasks; if root cause is in DB, see #rds-*.

alb-target-latency-p95-high

Severity: W — sustained p95 latency above threshold. What fired: ALB target p95 response time above alb_latency_p95_threshold_seconds for 3 consecutive 5-min periods. First look:

  • Slowest endpoints — CloudWatch logs Slow query lines (development) or AWS X-Ray traces (when enabled).
  • DB load (see #rds-cpu-high, #rds-connections-high).
  • Bedrock latency (see #bedrock-latency) — analysis routes can dominate p95 on small fleets. Escalation: Joe → if AI-driven, that's expected during heavy generation periods; if DB-driven, see #rds-*.

alb-healthy-hosts-low

Severity: C — app unreachable. What fired: No healthy targets registered behind the ALB. First look:

  • ECS service → tasks tab → are any tasks RUNNING? See #ecs-task-deficit.
  • ALB target group health checks — what's the failure reason (e.g. timeout, 5xx from /health)?
  • Recent task-def change to health-check path / port? Escalation: Joe → roll back via "ECS Task Down" runbook; if stuck, AWS Support L2.

synthetics-canary-failure

Severity: C in prod, W in staging — health check failing. What fired: Synthetics canary against app.nquiry.ai (or staging URL) is failing. First look:

  • Manually curl https://app.nquiry.ai/api/health — what status?
  • ALB / ECS health (see #alb-healthy-hosts-low, #ecs-task-deficit).
  • DNS / Route 53 health checks — anything stale? Escalation: Joe → SEV-1 if production. Use the F&F user comms template above if outage exceeds 15 min.

Data layer (RDS / Redis)

rds-cpu-high

Severity: W → C if sustained > 30 min. What fired: RDS CPU > 80% for 5 min. First look:

  • RDS console → Performance Insights → top SQL by CPU.
  • New slow query? Check pg_stat_statements (or CloudWatch RDS Enhanced Monitoring).
  • Connection storm (see #rds-connections-high)? Escalation: Joe → if a single query is hot, kill the offender (pg_terminate_backend(pid)) and file a Linear ticket; if sustained from legitimate load, scale instance class.

rds-storage-low

Severity: C — < 5 GB free. What fired: RDS free storage below 5 GB. First look:

  • Has retention cron (#retention-cron-missing) run recently? Failed retention is a likely cause.
  • WAL bloat — SELECT slot_name, active, restart_lsn FROM pg_replication_slots for inactive slots holding WAL.
  • Largest tables — is one table growing unexpectedly? pg_size_pretty(pg_total_relation_size('<table>')). Escalation: Joe → emergency: aws rds modify-db-instance --allocated-storage <bigger>; long-term: enable Storage Autoscaling and set max ceiling.

rds-connections-high

Severity: W. What fired: RDS connections exceed 80% of max (rds_max_connections). First look:

  • SELECT state, count(*) FROM pg_stat_activity GROUP BY state — idle in transaction = connection leak.
  • Pool sizing in lib/db/pool.ts (max: 10) × Lambda concurrency — at scale, may need PgBouncer.
  • Recent deploy add long-held transactions? Escalation: Joe → kill leaked sessions; long-term, deploy RDS Proxy or PgBouncer.

redis-cpu-high

Severity: W. What fired: Elasticache Redis CPU sustained high. First look:

  • Redis SLOWLOG (SLOWLOG GET 50) — is one command pattern dominating?
  • Recent change to rate-limit window or to a new Redis-backed feature?
  • Key-count growth — INFO keyspace — runaway key creation? Escalation: Joe → scale Redis instance; if rate-limit-driven, lower request volume at WAF.

redis-memory-high

Severity: W → C if approaching maxmemory. What fired: Redis memory usage high. First look:

  • INFO memoryused_memory vs maxmemory, fragmentation ratio.
  • --bigkeys scan — any pathological keys?
  • Eviction policy on the cluster — is it dropping things silently? Escalation: Joe → scale up; if rate-limit only, switch to LRU eviction safely.

External AI (Bedrock)

bedrock-throttling

Severity: W — Bedrock 429s spiking. What fired: Bedrock throttling exceeds 10 requests in 5 minutes. First look:

  • Which model? (Claude vs Titan embed vs other.) The terraform alarm includes ${bedrock_model_id}.
  • AI queue stats: getAIQueueStats (lib/ai/client.ts) — pending depth growing?
  • Per-org concurrency cap (lib/ai/concurrency.ts) — is one org saturating? Escalation: Joe → request a quota increase via AWS console; reduce MAX_CONCURRENT_GENERATIONS_PER_ORG if a single org is the problem.

bedrock-latency

Severity: W — slow AI responses. What fired: Bedrock p90 invocation latency above bedrock_latency_threshold_ms for 15 + minutes. First look:

  • AWS Service Health Dashboard — Bedrock incident in this region?
  • Has the analysisType mix shifted toward larger contexts? Check token_usage in recent analysis rows.
  • Throttling adjacent (see #bedrock-throttling)? Escalation: Joe → no immediate fix; communicate slowness to active users; consider lower-tier model fallback if sustained.

bedrock-errors

Severity: W → C if sustained. What fired: Bedrock server errors above bedrock_error_threshold in 5 min. First look:

  • Sentry exceptions tagged with the Bedrock client.
  • AWS Service Health.
  • Did we recently change model_id or version? Bedrock occasionally retires preview models. Escalation: Joe → swap to a known-good model via env override; AWS Support L2 if Bedrock-side.

bedrock-token-spike

Severity: C — runaway cost risk. What fired: Bedrock output tokens > bedrock_token_spike_threshold in 1 hour. First look:

  • analysis rows from the last hour — any with abnormally large token_usage.output_tokens?
  • Did a new prompt template ship with max_tokens cranked up? Check prompt_template table.
  • One org dominating? Cross-reference with /api/admin/jobs/status. Escalation: Joe → if a prompt template is the cause, set is_active = false on it via DB; if user behavior, lower per-user rate limit immediately.

Milestones (informational)

first-paid-invoice

Severity: I — milestone marker, not an incident. What fired: Stripe webhook just processed the first non-zero invoice.paid event. Idempotent — fires once ever. First look: None required. Open the celebration channel. Escalation: None.