Skip to main content

Operations Runbook

Day-to-day operational procedures for Nquiry production infrastructure. For incident response (failure modes, rollback, escalation), see incident-response.md.

Environment: Single AWS environment (invapp-dev-* prefix) serving production traffic at app.nquiry.ai. Resource rename to invapp-prod-* deferred to NQU-644 (Terraform parameterization).


1. Health Checks

Quick Status Check

# Application health
curl -s https://app.nquiry.ai/api/health | jq .

# ECS service status
aws ecs describe-services \
--cluster invapp-dev-cluster \
--services invapp-dev-app-service \
--query 'services[0].{desired:desiredCount,running:runningCount,status:status,deployments:deployments[*].{status:status,running:runningCount,desired:desiredCount}}'

# RDS status
aws rds describe-db-instances \
--db-instance-identifier invapp-dev-postgres \
--query 'DBInstances[0].{status:DBInstanceStatus,storage:AllocatedStorage,freeStorage:FreeStorageSpace,connections:Endpoint}'

# Redis status
aws elasticache describe-replication-groups \
--query 'ReplicationGroups[?starts_with(ReplicationGroupId,`invapp-dev`)].{id:ReplicationGroupId,status:Status,nodes:NodeGroups[0].NodeGroupMembers[*].{endpoint:ReadEndpoint,status:CurrentRole}}'

CloudWatch Alarm Status

# All alarms in ALARM state
aws cloudwatch describe-alarms \
--state-value ALARM \
--query 'MetricAlarms[*].{name:AlarmName,state:StateValue,reason:StateReason}' \
--output table

Synthetics Canary Status

# Check canary last run (two canaries: staging-health + prod-staging-health)
aws synthetics get-canary \
--name invapp-dev-staging-health \
--query 'Canary.{status:Status.State,lastRun:Status.StateReasonCode}'

2. Log Access

Application Logs (CloudWatch)

# Recent logs (last 30 minutes)
aws logs filter-log-events \
--log-group-name /ecs/invapp-dev-app \
--start-time $(date -u -v-30M +%s)000 \
--filter-pattern "ERROR" \
--query 'events[*].{time:timestamp,message:message}' \
--output table

# Tail logs live
aws logs tail /ecs/invapp-dev-app --follow --since 5m

Sentry

Access Sentry dashboard for unhandled exceptions. DSN configured via NEXT_PUBLIC_SENTRY_DSN environment variable. Check email for Sentry alerts.

Audit Logs (Application-Level)

Audit logs are stored in the audit_log table. Access via bastion:

# Recent audit entries
psql -h localhost -p 5433 -U app_admin -d investigation_app \
-c "SELECT created_at, user_id, action, resource_type, success
FROM audit_log
ORDER BY created_at DESC
LIMIT 20;"

3. Database Operations

Bastion Access

The bastion host provides SSH tunnel access to RDS. Uses SSM Session Manager (no SSH keys needed):

# Start SSM session with port forwarding
aws ssm start-session \
--target <bastion-instance-id> \
--document-name AWS-StartPortForwardingSessionToRemoteHost \
--parameters '{"host":["<rds-endpoint>"],"portNumber":["5432"],"localPortNumber":["5433"]}'

# Connect via local tunnel
psql -h localhost -p 5433 -U app_admin -d investigation_app

Running Migrations

# Via ECS exec (preferred — runs in the application context)
aws ecs execute-command \
--cluster invapp-dev-cluster \
--task <task-id> \
--container app \
--interactive \
--command "npm run db:migrate"

# Check migration status
psql -h localhost -p 5433 -U app_admin -d investigation_app \
-c "SELECT filename, executed_at FROM _migrations ORDER BY executed_at DESC LIMIT 10;"

Database Size and Growth

psql -h localhost -p 5433 -U app_admin -d investigation_app \
-c "SELECT pg_size_pretty(pg_database_size('investigation_app')) as db_size;"

# Table sizes
psql -h localhost -p 5433 -U app_admin -d investigation_app \
-c "SELECT relname as table, pg_size_pretty(pg_total_relation_size(relid)) as size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 15;"

Manual RDS Snapshot (Pre-Migration)

Required before running migrations in production (per staging-to-production ADR Decision 3):

aws rds create-db-snapshot \
--db-instance-identifier invapp-dev-postgres \
--db-snapshot-identifier "pre-migration-$(date +%Y%m%d-%H%M%S)"

4. Embedding Worker

The embedding worker processes evidence files into vector embeddings (Titan V2) for RAG retrieval.

Check Worker Status

# Pending embedding items
psql -h localhost -p 5433 -U app_admin -d investigation_app \
-c "SELECT processing_status, count(*)
FROM embedding_source_status
GROUP BY processing_status
ORDER BY count(*) DESC;"

# Oldest pending items (check for stalled processing)
psql -h localhost -p 5433 -U app_admin -d investigation_app \
-c "SELECT source_id, source_type, investigation_id, processing_status,
processing_started_at, created_at
FROM embedding_source_status
WHERE processing_status = 'pending'
ORDER BY created_at ASC
LIMIT 10;"

Diagnosing Stalled Embeddings

If items are stuck in pending or processing for >1 hour:

  1. Check if the items are from demo/seed organizations vs. real users (NQU-651 context)
  2. Check ECS task logs for embedding errors
  3. Check Bedrock Titan V2 quota — throttling can stall the worker
# Check which orgs have pending items (joins via investigation → organization)
psql -h localhost -p 5433 -U app_admin -d investigation_app \
-c "SELECT o.name, count(*) as pending
FROM embedding_source_status ess
JOIN investigation i ON ess.investigation_id = i.investigation_id
JOIN organization o ON i.organization_id = o.organization_id
WHERE ess.processing_status = 'pending'
GROUP BY o.name
ORDER BY pending DESC;"

5. User Management (Cognito)

List Users

aws cognito-idp list-users \
--user-pool-id <pool-id> \
--query 'Users[*].{email:Attributes[?Name==`email`].Value|[0],status:UserStatus,created:UserCreateDate,enabled:Enabled}' \
--output table

Disable/Enable User

# Disable
aws cognito-idp admin-disable-user \
--user-pool-id <pool-id> \
--username <email>

# Enable
aws cognito-idp admin-enable-user \
--user-pool-id <pool-id> \
--username <email>

Reset User Password

aws cognito-idp admin-set-user-password \
--user-pool-id <pool-id> \
--username <email> \
--password <new-password> \
--permanent

Reset MFA

aws cognito-idp admin-set-user-mfa-preference \
--user-pool-id <pool-id> \
--username <email> \
--software-token-mfa-settings Enabled=false

6. S3 Evidence Storage

Check Storage Usage

# Total bucket size
aws s3 ls s3://investigation-app-dev-760007728097 --recursive --summarize | tail -2

# Per-org storage (top consumers)
aws s3 ls s3://investigation-app-dev-760007728097/ --recursive \
| awk '{sum[$4]+=$3} END {for(k in sum) print sum[k], k}' \
| sort -rn | head -10

Archive/Lifecycle Status

S3 lifecycle policy tags investigations closed >90 days for Glacier transition:

# Check lifecycle rules
aws s3api get-bucket-lifecycle-configuration \
--bucket investigation-app-dev-760007728097 \
--query 'Rules[*].{id:ID,status:Status,transitions:Transitions}'

7. Cost Monitoring

Current Month Spend

aws ce get-cost-and-usage \
--time-period Start=$(date -u +%Y-%m-01),End=$(date -u +%Y-%m-%d) \
--granularity MONTHLY \
--metrics UnblendedCost \
--group-by Type=DIMENSION,Key=SERVICE \
--query 'ResultsByTime[0].Groups[*].{service:Keys[0],cost:Metrics.UnblendedCost.Amount}' \
--output table

Budget Alerts

Budget alerts configured via Terraform (NQU-588). Alerts fire at 80%, 90%, and 100% of monthly budget.

CloudWatch Free Tier

CloudWatch has a 10-alarm free tier limit. Current alarm count exceeds this (NQU-638 tracking). Expected overage cost: ~$5-10/month.


8. Secrets Rotation

View Current Secrets

# List secrets (names only, not values)
aws secretsmanager list-secrets \
--query 'SecretList[?starts_with(Name,`invapp-dev`)].{name:Name,lastChanged:LastChangedDate}' \
--output table

Rotate Redis Auth Token

Per NQU-535: update the token in both Secrets Manager and ElastiCache. Requires brief Redis restart.

# 1. Update in Secrets Manager
aws secretsmanager update-secret \
--secret-id invapp-dev/app-secrets \
--secret-string '{"REDIS_AUTH_TOKEN":"<new-token>"}' # Include all other keys

# 2. Update ElastiCache
aws elasticache modify-replication-group \
--replication-group-id invapp-dev-redis \
--auth-token <new-token> \
--auth-token-update-strategy ROTATE

# 3. Force new ECS deployment to pick up new token
aws ecs update-service \
--cluster invapp-dev-cluster \
--service invapp-dev-app-service \
--force-new-deployment

9. Deployment

See deployment-flow.md for the full CI/CD pipeline. Quick reference:

Manual Deployment (Force Redeploy)

# Force new deployment with current image
aws ecs update-service \
--cluster invapp-dev-cluster \
--service invapp-dev-app-service \
--force-new-deployment

# Wait for stabilization
aws ecs wait services-stable \
--cluster invapp-dev-cluster \
--services invapp-dev-app-service

Terraform Apply

cd infrastructure/terraform/environments/dev
terraform init
terraform plan -out=plan.tfplan
terraform apply plan.tfplan

Rule: Always run terraform plan first and review the output. Never run terraform apply without reviewing the plan.


10. Weekly Operational Checks

These are tracked as recurring Linear issues (NQU-586, NQU-592):

Error & Exception Review (NQU-586, weekly):

  • Check Sentry for new unhandled exceptions
  • Review CloudWatch logs for Bedrock API errors
  • Check embedding_source_status for stuck items
  • Review failed embedding jobs

Architectural Health Check (NQU-592, weekly):

  • Run architectural scan for tech debt
  • Review docs/production-blockers.md
  • Triage docs/inbox.md (items >7 days need decision)
  • Check TODOs in code

Monthly checks:

  • Security review (NQU-526)
  • Backup verification (NQU-527)
  • Dependency updates (NQU-528)
  • Code quality & lint review (NQU-559)
  • Database health & data integrity (NQU-560)

Reference

ResourceLocation
Incident responsedocs/admin/ops/incident-response.md
Deployment flowdocs/admin/ops/deployment-flow.md
Environment strategydocs/admin/ops/environment-strategy.md
Migration proceduresdocs/admin/ops/migrations.md
Terraform modulesinfrastructure/terraform/modules/
CI/CD pipeline.github/workflows/ci.yml
CloudWatch alarmsinfrastructure/terraform/modules/cloudwatch-dashboard/main.tf