Operations Runbook
Day-to-day operational procedures for Nquiry production infrastructure. For incident response (failure modes, rollback, escalation), see incident-response.md.
Environment: Single AWS environment (invapp-dev-* prefix) serving production traffic at app.nquiry.ai. Resource rename to invapp-prod-* deferred to NQU-644 (Terraform parameterization).
1. Health Checks
Quick Status Check
# Application health
curl -s https://app.nquiry.ai/api/health | jq .
# ECS service status
aws ecs describe-services \
--cluster invapp-dev-cluster \
--services invapp-dev-app-service \
--query 'services[0].{desired:desiredCount,running:runningCount,status:status,deployments:deployments[*].{status:status,running:runningCount,desired:desiredCount}}'
# RDS status
aws rds describe-db-instances \
--db-instance-identifier invapp-dev-postgres \
--query 'DBInstances[0].{status:DBInstanceStatus,storage:AllocatedStorage,freeStorage:FreeStorageSpace,connections:Endpoint}'
# Redis status
aws elasticache describe-replication-groups \
--query 'ReplicationGroups[?starts_with(ReplicationGroupId,`invapp-dev`)].{id:ReplicationGroupId,status:Status,nodes:NodeGroups[0].NodeGroupMembers[*].{endpoint:ReadEndpoint,status:CurrentRole}}'
CloudWatch Alarm Status
# All alarms in ALARM state
aws cloudwatch describe-alarms \
--state-value ALARM \
--query 'MetricAlarms[*].{name:AlarmName,state:StateValue,reason:StateReason}' \
--output table
Synthetics Canary Status
# Check canary last run (two canaries: staging-health + prod-staging-health)
aws synthetics get-canary \
--name invapp-dev-staging-health \
--query 'Canary.{status:Status.State,lastRun:Status.StateReasonCode}'
2. Log Access
Application Logs (CloudWatch)
# Recent logs (last 30 minutes)
aws logs filter-log-events \
--log-group-name /ecs/invapp-dev-app \
--start-time $(date -u -v-30M +%s)000 \
--filter-pattern "ERROR" \
--query 'events[*].{time:timestamp,message:message}' \
--output table
# Tail logs live
aws logs tail /ecs/invapp-dev-app --follow --since 5m
Sentry
Access Sentry dashboard for unhandled exceptions. DSN configured via NEXT_PUBLIC_SENTRY_DSN environment variable. Check email for Sentry alerts.
Audit Logs (Application-Level)
Audit logs are stored in the audit_log table. Access via bastion:
# Recent audit entries
psql -h localhost -p 5433 -U app_admin -d investigation_app \
-c "SELECT created_at, user_id, action, resource_type, success
FROM audit_log
ORDER BY created_at DESC
LIMIT 20;"
3. Database Operations
Bastion Access
The bastion host provides SSH tunnel access to RDS. Uses SSM Session Manager (no SSH keys needed):
# Start SSM session with port forwarding
aws ssm start-session \
--target <bastion-instance-id> \
--document-name AWS-StartPortForwardingSessionToRemoteHost \
--parameters '{"host":["<rds-endpoint>"],"portNumber":["5432"],"localPortNumber":["5433"]}'
# Connect via local tunnel
psql -h localhost -p 5433 -U app_admin -d investigation_app
Running Migrations
# Via ECS exec (preferred — runs in the application context)
aws ecs execute-command \
--cluster invapp-dev-cluster \
--task <task-id> \
--container app \
--interactive \
--command "npm run db:migrate"
# Check migration status
psql -h localhost -p 5433 -U app_admin -d investigation_app \
-c "SELECT filename, executed_at FROM _migrations ORDER BY executed_at DESC LIMIT 10;"
Database Size and Growth
psql -h localhost -p 5433 -U app_admin -d investigation_app \
-c "SELECT pg_size_pretty(pg_database_size('investigation_app')) as db_size;"
# Table sizes
psql -h localhost -p 5433 -U app_admin -d investigation_app \
-c "SELECT relname as table, pg_size_pretty(pg_total_relation_size(relid)) as size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 15;"
Manual RDS Snapshot (Pre-Migration)
Required before running migrations in production (per staging-to-production ADR Decision 3):
aws rds create-db-snapshot \
--db-instance-identifier invapp-dev-postgres \
--db-snapshot-identifier "pre-migration-$(date +%Y%m%d-%H%M%S)"
4. Embedding Worker
The embedding worker processes evidence files into vector embeddings (Titan V2) for RAG retrieval.
Check Worker Status
# Pending embedding items
psql -h localhost -p 5433 -U app_admin -d investigation_app \
-c "SELECT processing_status, count(*)
FROM embedding_source_status
GROUP BY processing_status
ORDER BY count(*) DESC;"
# Oldest pending items (check for stalled processing)
psql -h localhost -p 5433 -U app_admin -d investigation_app \
-c "SELECT source_id, source_type, investigation_id, processing_status,
processing_started_at, created_at
FROM embedding_source_status
WHERE processing_status = 'pending'
ORDER BY created_at ASC
LIMIT 10;"
Diagnosing Stalled Embeddings
If items are stuck in pending or processing for >1 hour:
- Check if the items are from demo/seed organizations vs. real users (NQU-651 context)
- Check ECS task logs for embedding errors
- Check Bedrock Titan V2 quota — throttling can stall the worker
# Check which orgs have pending items (joins via investigation → organization)
psql -h localhost -p 5433 -U app_admin -d investigation_app \
-c "SELECT o.name, count(*) as pending
FROM embedding_source_status ess
JOIN investigation i ON ess.investigation_id = i.investigation_id
JOIN organization o ON i.organization_id = o.organization_id
WHERE ess.processing_status = 'pending'
GROUP BY o.name
ORDER BY pending DESC;"
5. User Management (Cognito)
List Users
aws cognito-idp list-users \
--user-pool-id <pool-id> \
--query 'Users[*].{email:Attributes[?Name==`email`].Value|[0],status:UserStatus,created:UserCreateDate,enabled:Enabled}' \
--output table
Disable/Enable User
# Disable
aws cognito-idp admin-disable-user \
--user-pool-id <pool-id> \
--username <email>
# Enable
aws cognito-idp admin-enable-user \
--user-pool-id <pool-id> \
--username <email>
Reset User Password
aws cognito-idp admin-set-user-password \
--user-pool-id <pool-id> \
--username <email> \
--password <new-password> \
--permanent
Reset MFA
aws cognito-idp admin-set-user-mfa-preference \
--user-pool-id <pool-id> \
--username <email> \
--software-token-mfa-settings Enabled=false
6. S3 Evidence Storage
Check Storage Usage
# Total bucket size
aws s3 ls s3://investigation-app-dev-760007728097 --recursive --summarize | tail -2
# Per-org storage (top consumers)
aws s3 ls s3://investigation-app-dev-760007728097/ --recursive \
| awk '{sum[$4]+=$3} END {for(k in sum) print sum[k], k}' \
| sort -rn | head -10
Archive/Lifecycle Status
S3 lifecycle policy tags investigations closed >90 days for Glacier transition:
# Check lifecycle rules
aws s3api get-bucket-lifecycle-configuration \
--bucket investigation-app-dev-760007728097 \
--query 'Rules[*].{id:ID,status:Status,transitions:Transitions}'
7. Cost Monitoring
Current Month Spend
aws ce get-cost-and-usage \
--time-period Start=$(date -u +%Y-%m-01),End=$(date -u +%Y-%m-%d) \
--granularity MONTHLY \
--metrics UnblendedCost \
--group-by Type=DIMENSION,Key=SERVICE \
--query 'ResultsByTime[0].Groups[*].{service:Keys[0],cost:Metrics.UnblendedCost.Amount}' \
--output table
Budget Alerts
Budget alerts configured via Terraform (NQU-588). Alerts fire at 80%, 90%, and 100% of monthly budget.
CloudWatch Free Tier
CloudWatch has a 10-alarm free tier limit. Current alarm count exceeds this (NQU-638 tracking). Expected overage cost: ~$5-10/month.
8. Secrets Rotation
View Current Secrets
# List secrets (names only, not values)
aws secretsmanager list-secrets \
--query 'SecretList[?starts_with(Name,`invapp-dev`)].{name:Name,lastChanged:LastChangedDate}' \
--output table
Rotate Redis Auth Token
Per NQU-535: update the token in both Secrets Manager and ElastiCache. Requires brief Redis restart.
# 1. Update in Secrets Manager
aws secretsmanager update-secret \
--secret-id invapp-dev/app-secrets \
--secret-string '{"REDIS_AUTH_TOKEN":"<new-token>"}' # Include all other keys
# 2. Update ElastiCache
aws elasticache modify-replication-group \
--replication-group-id invapp-dev-redis \
--auth-token <new-token> \
--auth-token-update-strategy ROTATE
# 3. Force new ECS deployment to pick up new token
aws ecs update-service \
--cluster invapp-dev-cluster \
--service invapp-dev-app-service \
--force-new-deployment
9. Deployment
See deployment-flow.md for the full CI/CD pipeline. Quick reference:
Manual Deployment (Force Redeploy)
# Force new deployment with current image
aws ecs update-service \
--cluster invapp-dev-cluster \
--service invapp-dev-app-service \
--force-new-deployment
# Wait for stabilization
aws ecs wait services-stable \
--cluster invapp-dev-cluster \
--services invapp-dev-app-service
Terraform Apply
cd infrastructure/terraform/environments/dev
terraform init
terraform plan -out=plan.tfplan
terraform apply plan.tfplan
Rule: Always run terraform plan first and review the output. Never run terraform apply without reviewing the plan.
10. Weekly Operational Checks
These are tracked as recurring Linear issues (NQU-586, NQU-592):
Error & Exception Review (NQU-586, weekly):
- Check Sentry for new unhandled exceptions
- Review CloudWatch logs for Bedrock API errors
- Check
embedding_source_statusfor stuck items - Review failed embedding jobs
Architectural Health Check (NQU-592, weekly):
- Run architectural scan for tech debt
- Review
docs/production-blockers.md - Triage
docs/inbox.md(items >7 days need decision) - Check TODOs in code
Monthly checks:
- Security review (NQU-526)
- Backup verification (NQU-527)
- Dependency updates (NQU-528)
- Code quality & lint review (NQU-559)
- Database health & data integrity (NQU-560)
Reference
| Resource | Location |
|---|---|
| Incident response | docs/admin/ops/incident-response.md |
| Deployment flow | docs/admin/ops/deployment-flow.md |
| Environment strategy | docs/admin/ops/environment-strategy.md |
| Migration procedures | docs/admin/ops/migrations.md |
| Terraform modules | infrastructure/terraform/modules/ |
| CI/CD pipeline | .github/workflows/ci.yml |
| CloudWatch alarms | infrastructure/terraform/modules/cloudwatch-dashboard/main.tf |