Operations Runbook

Day-to-day operational procedures for Nquiry production infrastructure. For incident response (failure modes, rollback, escalation), see incident-response.md.

Environment: Single AWS environment (invapp-dev-* prefix) serving production traffic at app.nquiry.ai. Resource rename to invapp-prod-* deferred to NQU-644 (Terraform parameterization).

1. Health Checks

Quick Status Check

# Application health
curl -s https://app.nquiry.ai/api/health | jq .

# ECS service status
aws ecs describe-services \
  --cluster invapp-dev-cluster \
  --services invapp-dev-app-service \
  --query 'services[0].{desired:desiredCount,running:runningCount,status:status,deployments:deployments[*].{status:status,running:runningCount,desired:desiredCount}}'

# RDS status
aws rds describe-db-instances \
  --db-instance-identifier invapp-dev-postgres \
  --query 'DBInstances[0].{status:DBInstanceStatus,storage:AllocatedStorage,freeStorage:FreeStorageSpace,connections:Endpoint}'

# Redis status
aws elasticache describe-replication-groups \
  --query 'ReplicationGroups[?starts_with(ReplicationGroupId,`invapp-dev`)].{id:ReplicationGroupId,status:Status,nodes:NodeGroups[0].NodeGroupMembers[*].{endpoint:ReadEndpoint,status:CurrentRole}}'

CloudWatch Alarm Status

# All alarms in ALARM state
aws cloudwatch describe-alarms \
  --state-value ALARM \
  --query 'MetricAlarms[*].{name:AlarmName,state:StateValue,reason:StateReason}' \
  --output table

Synthetics Canary Status

# Check canary last run (two canaries: staging-health + prod-staging-health)
aws synthetics get-canary \
  --name invapp-dev-staging-health \
  --query 'Canary.{status:Status.State,lastRun:Status.StateReasonCode}'

2. Log Access

Application Logs (CloudWatch)

# Recent logs (last 30 minutes)
aws logs filter-log-events \
  --log-group-name /ecs/invapp-dev-app \
  --start-time $(date -u -v-30M +%s)000 \
  --filter-pattern "ERROR" \
  --query 'events[*].{time:timestamp,message:message}' \
  --output table

# Tail logs live
aws logs tail /ecs/invapp-dev-app --follow --since 5m

Sentry

Access Sentry dashboard for unhandled exceptions. DSN configured via NEXT_PUBLIC_SENTRY_DSN environment variable. Check email for Sentry alerts.

Audit Logs (Application-Level)

Audit logs are stored in the audit_log table. Access via bastion:

# Recent audit entries
psql -h localhost -p 5433 -U app_admin -d investigation_app \
  -c "SELECT created_at, user_id, action, resource_type, success
      FROM audit_log
      ORDER BY created_at DESC
      LIMIT 20;"

3. Database Operations

Bastion Access

The bastion host provides SSH tunnel access to RDS. Uses SSM Session Manager (no SSH keys needed):

# Start SSM session with port forwarding
aws ssm start-session \
  --target <bastion-instance-id> \
  --document-name AWS-StartPortForwardingSessionToRemoteHost \
  --parameters '{"host":["<rds-endpoint>"],"portNumber":["5432"],"localPortNumber":["5433"]}'

# Connect via local tunnel
psql -h localhost -p 5433 -U app_admin -d investigation_app

Running Migrations

# Via ECS exec (preferred — runs in the application context)
aws ecs execute-command \
  --cluster invapp-dev-cluster \
  --task <task-id> \
  --container app \
  --interactive \
  --command "npm run db:migrate"

# Check migration status
psql -h localhost -p 5433 -U app_admin -d investigation_app \
  -c "SELECT filename, executed_at FROM _migrations ORDER BY executed_at DESC LIMIT 10;"

Database Size and Growth

psql -h localhost -p 5433 -U app_admin -d investigation_app \
  -c "SELECT pg_size_pretty(pg_database_size('investigation_app')) as db_size;"

# Table sizes
psql -h localhost -p 5433 -U app_admin -d investigation_app \
  -c "SELECT relname as table, pg_size_pretty(pg_total_relation_size(relid)) as size
      FROM pg_catalog.pg_statio_user_tables
      ORDER BY pg_total_relation_size(relid) DESC
      LIMIT 15;"

Manual RDS Snapshot (Pre-Migration)

Required before running migrations in production (per staging-to-production ADR Decision 3):

aws rds create-db-snapshot \
  --db-instance-identifier invapp-dev-postgres \
  --db-snapshot-identifier "pre-migration-$(date +%Y%m%d-%H%M%S)"

4. Embedding Worker

The embedding worker processes evidence files into vector embeddings (Titan V2) for RAG retrieval.

Check Worker Status

# Pending embedding items
psql -h localhost -p 5433 -U app_admin -d investigation_app \
  -c "SELECT processing_status, count(*)
      FROM embedding_source_status
      GROUP BY processing_status
      ORDER BY count(*) DESC;"

# Oldest pending items (check for stalled processing)
psql -h localhost -p 5433 -U app_admin -d investigation_app \
  -c "SELECT source_id, source_type, investigation_id, processing_status,
             processing_started_at, created_at
      FROM embedding_source_status
      WHERE processing_status = 'pending'
      ORDER BY created_at ASC
      LIMIT 10;"

Diagnosing Stalled Embeddings

If items are stuck in pending or processing for >1 hour:

Check if the items are from demo/seed organizations vs. real users (NQU-651 context)
Check ECS task logs for embedding errors
Check Bedrock Titan V2 quota — throttling can stall the worker

# Check which orgs have pending items (joins via investigation → organization)
psql -h localhost -p 5433 -U app_admin -d investigation_app \
  -c "SELECT o.name, count(*) as pending
      FROM embedding_source_status ess
      JOIN investigation i  ON ess.investigation_id = i.investigation_id
      JOIN organization  o  ON i.organization_id = o.organization_id
      WHERE ess.processing_status = 'pending'
      GROUP BY o.name
      ORDER BY pending DESC;"

5. User Management (Cognito)

List Users

aws cognito-idp list-users \
  --user-pool-id <pool-id> \
  --query 'Users[*].{email:Attributes[?Name==`email`].Value|[0],status:UserStatus,created:UserCreateDate,enabled:Enabled}' \
  --output table

Disable/Enable User

# Disable
aws cognito-idp admin-disable-user \
  --user-pool-id <pool-id> \
  --username <email>

# Enable
aws cognito-idp admin-enable-user \
  --user-pool-id <pool-id> \
  --username <email>

Reset User Password

aws cognito-idp admin-set-user-password \
  --user-pool-id <pool-id> \
  --username <email> \
  --password <new-password> \
  --permanent

Reset MFA

aws cognito-idp admin-set-user-mfa-preference \
  --user-pool-id <pool-id> \
  --username <email> \
  --software-token-mfa-settings Enabled=false

6. S3 Evidence Storage

Check Storage Usage

# Total bucket size
aws s3 ls s3://investigation-app-dev-760007728097 --recursive --summarize | tail -2

# Per-org storage (top consumers)
aws s3 ls s3://investigation-app-dev-760007728097/ --recursive \
  | awk '{sum[$4]+=$3} END {for(k in sum) print sum[k], k}' \
  | sort -rn | head -10

Archive/Lifecycle Status

S3 lifecycle policy tags investigations closed >90 days for Glacier transition:

# Check lifecycle rules
aws s3api get-bucket-lifecycle-configuration \
  --bucket investigation-app-dev-760007728097 \
  --query 'Rules[*].{id:ID,status:Status,transitions:Transitions}'

7. Cost Monitoring

Current Month Spend

aws ce get-cost-and-usage \
  --time-period Start=$(date -u +%Y-%m-01),End=$(date -u +%Y-%m-%d) \
  --granularity MONTHLY \
  --metrics UnblendedCost \
  --group-by Type=DIMENSION,Key=SERVICE \
  --query 'ResultsByTime[0].Groups[*].{service:Keys[0],cost:Metrics.UnblendedCost.Amount}' \
  --output table

Budget Alerts

Budget alerts configured via Terraform (NQU-588). Alerts fire at 80%, 90%, and 100% of monthly budget.

CloudWatch Free Tier

CloudWatch has a 10-alarm free tier limit. Current alarm count exceeds this (NQU-638 tracking). Expected overage cost: ~$5-10/month.

8. Secrets Rotation

View Current Secrets

# List secrets (names only, not values)
aws secretsmanager list-secrets \
  --query 'SecretList[?starts_with(Name,`invapp-dev`)].{name:Name,lastChanged:LastChangedDate}' \
  --output table

Rotate Redis Auth Token

Per NQU-535: update the token in both Secrets Manager and ElastiCache. Requires brief Redis restart.

# 1. Update in Secrets Manager
aws secretsmanager update-secret \
  --secret-id invapp-dev/app-secrets \
  --secret-string '{"REDIS_AUTH_TOKEN":"<new-token>"}'  # Include all other keys

# 2. Update ElastiCache
aws elasticache modify-replication-group \
  --replication-group-id invapp-dev-redis \
  --auth-token <new-token> \
  --auth-token-update-strategy ROTATE

# 3. Force new ECS deployment to pick up new token
aws ecs update-service \
  --cluster invapp-dev-cluster \
  --service invapp-dev-app-service \
  --force-new-deployment

9. Deployment

See deployment-flow.md for the full CI/CD pipeline. Quick reference:

Manual Deployment (Force Redeploy)

# Force new deployment with current image
aws ecs update-service \
  --cluster invapp-dev-cluster \
  --service invapp-dev-app-service \
  --force-new-deployment

# Wait for stabilization
aws ecs wait services-stable \
  --cluster invapp-dev-cluster \
  --services invapp-dev-app-service

Terraform Apply

cd infrastructure/terraform/environments/dev
terraform init
terraform plan -out=plan.tfplan
terraform apply plan.tfplan

Rule: Always run terraform plan first and review the output. Never run terraform apply without reviewing the plan.

10. Weekly Operational Checks

These are tracked as recurring Linear issues (NQU-586, NQU-592):

Error & Exception Review (NQU-586, weekly):

Check Sentry for new unhandled exceptions
Review CloudWatch logs for Bedrock API errors
Check embedding_source_status for stuck items
Review failed embedding jobs

Architectural Health Check (NQU-592, weekly):

Run architectural scan for tech debt
Review docs/production-blockers.md
Triage docs/inbox.md (items >7 days need decision)
Check TODOs in code

Monthly checks:

Security review (NQU-526)
Backup verification (NQU-527)
Dependency updates (NQU-528)
Code quality & lint review (NQU-559)
Database health & data integrity (NQU-560)

Reference

Resource	Location
Incident response	`docs/admin/ops/incident-response.md`
Deployment flow	`docs/admin/ops/deployment-flow.md`
Environment strategy	`docs/admin/ops/environment-strategy.md`
Migration procedures	`docs/admin/ops/migrations.md`
Terraform modules	`infrastructure/terraform/modules/`
CI/CD pipeline	`.github/workflows/ci.yml`
CloudWatch alarms	`infrastructure/terraform/modules/cloudwatch-dashboard/main.tf`

1. Health Checks​

Quick Status Check​

CloudWatch Alarm Status​

Synthetics Canary Status​

2. Log Access​

Application Logs (CloudWatch)​

Sentry​

Audit Logs (Application-Level)​

3. Database Operations​

Bastion Access​

Running Migrations​

Database Size and Growth​

Manual RDS Snapshot (Pre-Migration)​

4. Embedding Worker​

Check Worker Status​

Diagnosing Stalled Embeddings​

5. User Management (Cognito)​

List Users​

Disable/Enable User​

Reset User Password​

Reset MFA​

6. S3 Evidence Storage​

Check Storage Usage​

Archive/Lifecycle Status​

7. Cost Monitoring​

Current Month Spend​

Budget Alerts​

CloudWatch Free Tier​

8. Secrets Rotation​

View Current Secrets​

Rotate Redis Auth Token​

9. Deployment​

Manual Deployment (Force Redeploy)​

Terraform Apply​

10. Weekly Operational Checks​

Reference​