Test Environment Spin-up Runbook (JE-Vectors-test)
End-to-end procedure for spinning the JE-Vectors-test customer-greenfield environment up on demand, using it for validation / load tests / customer-install repro, and tearing it back down.
Who runs this. JE Vectors release engineering. The same
customer-greenfieldenv doubles as our release-validation rig (see release-validation runbook) and as a scratch env any engineer can spin to reproduce a customer-side issue. This runbook is the spin/tear procedure; choose a workload doc from §Use for what to do in between.
When to run it
- Load tests. NQU-809 Gate 3 load test (10 VUs / 100 analyses) — first user of this runbook.
- Customer-install repro. When a customer-side install hits a wall and the reproduction needs the actual
customer-greenfieldshape, not the dev env. - Release validation. The release-validation runbook drives the same env through a full install → upgrade → rollback cycle. If you're cutting a release, use that runbook; this one is for the "stand it up and use it" workflow without the validation gates.
Persistent infra that survives between runs. Route53 zone
greenfield.nquir.ai(Z02645003FT7M14Q92EJ3) + ACM cert (d5f34171-428a-4752-ba9e-ee68372fb296) + the terraform state bucket persist between spins. Cost ~$0.50/mo idle. Do not destroy them.
Cost
Heavy resources (RDS, NAT gateway, ECS Fargate tasks, ALB, CloudFront) run only during the activity window. Single-day spin + use + tear ≈ $3–5. Persistent infra (Route53 zone + ACM cert + state bucket) ≈ $0.50/mo idle.
Bedrock + Anthropic API calls billed per-invocation are additional and depend on the workload. The Gate 3 load test estimate is in NQU-809.
Prerequisites
Confirm each item before starting. If any fails, fix it before proceeding.
AWS access
-
aws sso login --profile je-vectors-testsucceeds. -
AWS_PROFILE=je-vectors-test aws sts get-caller-identityreturns account961381384763. - You have admin-or-equivalent in the JE-Vectors-test account.
Persistent infra still present
The DNS + ACM bootstrap from NQU-729 Phase A is supposed to survive between spins. Verify:
export AWS_PROFILE=je-vectors-test
aws route53 list-hosted-zones --query 'HostedZones[?Name==`greenfield.nquir.ai.`]'
aws acm describe-certificate \
--certificate-arn arn:aws:acm:us-east-1:961381384763:certificate/d5f34171-428a-4752-ba9e-ee68372fb296 \
--query 'Certificate.{Status:Status,Domain:DomainName}'
Expected: hosted zone present, cert ISSUED for greenfield.nquir.ai. If either is missing, re-bootstrap from the NQU-729 Phase A inline procedure before continuing — it's a ~10-minute one-off.
Test environment hygiene
-
No prior
customer-greenfieldstack exists in JE-Vectors-test:cd infrastructure/terraform/environments/customer-greenfieldterraform init -input=falseterraform show 2>&1 | grep -q "no state" || echo "STATE PRESENT — tear before spin"If state is present from a prior run, run §Tear first.
-
RDS snapshot quota in
us-east-1has headroom. A typical spin produces 0–2 snapshots (1 final snapshot ifis_ephemeral=false, 0 ifis_ephemeral=true— see NQU-802). Check:aws rds describe-account-attributes --region us-east-1 \--query 'AccountQuotas[?AccountQuotaName==`ManualClusterSnapshots` || AccountQuotaName==`ManualSnapshots`]' -
No leftover orphaned log groups (rare; only after a failed prior destroy):
aws logs describe-log-groups --region us-east-1 \--log-group-name-prefix /aws/vpc/nquiry-greenfield- \--query 'logGroups[].logGroupName'Expected: empty array. If non-empty,
aws logs delete-log-group --log-group-name <name>each before continuing. -
No orphan
nquiry-greenfield/app-secretsscheduled for deletion (NQU-822 — Secrets Manager keeps the name reserved for the deletion window, andterraform applywill fail to recreate the secret until it's force-purged):aws secretsmanager describe-secret --secret-id nquiry-greenfield/app-secrets --region us-east-1 \--query 'DeletedDate' --output text 2>/dev/nullExpected:
Noneor "secret not found" error. If a timestamp prints, force-purge before continuing:aws secretsmanager delete-secret --secret-id nquiry-greenfield/app-secrets \--force-delete-without-recovery --region us-east-1
Local toolchain
-
terraform --version≥ 1.0. -
aws --versionv2. -
jq --versioninstalled. -
openssl versionavailable (bootstrap-secrets uses it to auto-generateCRON_SECRET). -
dockeronly required if you're usingdeploy-image.sh's default Marketplace pull path. NOT required when the task def points directly atpublic.ecr.aws/l2g7u7c8/invapp-dev-app:<tag>(see Spin §2).
terraform.tfvars
The customer-greenfield env reads sensitive values from a gitignored terraform.tfvars. After the first NQU-729 bootstrap this file should still exist locally; if you're spinning from a fresh checkout:
cd infrastructure/terraform/environments/customer-greenfield
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars to set:
# db_password (16+ chars)
# redis_auth_token (16-128 chars)
# app_domain = "greenfield.nquir.ai"
# acm_certificate_arn = "arn:aws:acm:us-east-1:961381384763:certificate/d5f34171-428a-4752-ba9e-ee68372fb296"
# route53_zone_id = "Z02645003FT7M14Q92EJ3"
Spin
Wallclock baseline: apply ~15–25 min, bootstrap ~1 min, image deploy + ECS stabilize ~5 min. Total ~25–30 min the first time the runbook is exercised after landing; second + subsequent runs should hold ~15 min wallclock once muscle memory + cache are warm.
1. Apply
From the env directory:
export AWS_PROFILE=je-vectors-test
cd infrastructure/terraform/environments/customer-greenfield
../../apply.sh
Type y at the prompt. Capture start time.
Expected: ~167 resources created (count tracks any new module additions; see the most-recent validation results doc for the current baseline). 0 destroyed, 0 replaced.
If apply.sh fails early on the ACM cert ARN or Route53 zone ID, the Prerequisites DNS+ACM check was wrong — fix tfvars and re-run.
2. Bootstrap secrets
Terraform creates the empty Secrets Manager container; bootstrap-secrets.sh populates the AWSCURRENT version so ECS tasks can pull values.
../../bootstrap-secrets.sh
It auto-discovers the required key set from the live task def. With enable_stripe = false in customer-greenfield (see NQU-808), the surface is 6 keys:
| Key | Source |
|---|---|
DB_PASSWORD | read from terraform.tfvars |
REDIS_AUTH_TOKEN | read from terraform.tfvars |
CRON_SECRET | auto-generated (openssl) |
ANTHROPIC_API_KEY | interactive prompt (required) |
RESEND_API_KEY | interactive prompt (required) |
NEXT_PUBLIC_SENTRY_DSN | interactive prompt (optional — blank = placeholder) |
For non-interactive runs (CI / re-spins), supply a values file:
../../bootstrap-secrets.sh --values-file /tmp/test-env-secrets.json
(Keep this file outside the repo — gitignored paths only.)
3. Bootstrap the database schema + run migrations
A fresh RDS comes up empty. The baseline migration assumes auth helpers (auth.uid(), auth.users) that aren't present on a vanilla RDS instance, so npm run db:migrate fails outright without a one-time shim first. bootstrap-db.sh (NQU-821) opens an SSM tunnel to the bastion, applies scripts/ci-bootstrap-db.sql (extensions + auth/storage schema stubs + helper functions), then runs all pending app migrations.
../../bootstrap-db.sh
Wallclock baseline: ~2 min (mostly the migration chain — first run applies all 101 migrations from baseline forward).
Idempotent: re-running on an already-bootstrapped DB skips the migrations that have been recorded in _migrations and re-applies the shim (which uses CREATE IF NOT EXISTS / CREATE OR REPLACE throughout).
Requires psql on PATH (or under /opt/homebrew/opt/libpq/bin/). Install via brew install libpq on macOS if missing.
4. Deploy the image
The task def is created with :latest against an empty customer ECR. Without an image deploy the service spins forever in PENDING.
Default path (Marketplace pull → customer ECR push → ECS rolling deploy): requires Docker locally.
../../deploy-image.sh
This pulls the locked tag from public.ecr.aws/l2g7u7c8/invapp-dev-app:<version>, retags for the customer's private ECR, pushes, and forces a new ECS deployment.
Sidestep (no Docker needed): point the task def directly at the ECR Public image. Used during the v0.1.0 validation run when Docker wasn't available on the engineer's machine. Trade-off: bypasses the customer's private ECR — only acceptable in the JE-Vectors-test env, not for real customer installs.
CLUSTER=$(terraform output -raw ecs_cluster_name)
SERVICE=$(terraform output -raw ecs_service_name)
TASK_FAMILY=$(aws ecs describe-services --cluster "$CLUSTER" --services "$SERVICE" \
--query 'services[0].taskDefinition' --output text | sed 's|.*/||;s|:.*||')
VERSION=$(terraform output -raw module_version)
# Re-register the task def with the public image URI.
aws ecs describe-task-definition --task-definition "$TASK_FAMILY" \
--query 'taskDefinition' --output json \
| jq --arg img "public.ecr.aws/l2g7u7c8/invapp-dev-app:$VERSION" \
'.containerDefinitions[0].image = $img
| del(.taskDefinitionArn, .revision, .status, .requiresAttributes, .compatibilities, .registeredAt, .registeredBy)' \
> /tmp/td.json
aws ecs register-task-definition --cli-input-json file:///tmp/td.json
aws ecs update-service --cluster "$CLUSTER" --service "$SERVICE" --force-new-deployment
rm /tmp/td.json
Either path: wait ~3–5 min for ECS to stabilize:
aws ecs describe-services --cluster "$CLUSTER" --services "$SERVICE" \
--query 'services[0].deployments[0].{status:status, running:runningCount, desired:desiredCount}'
Expected: {"status": "PRIMARY", "running": 2, "desired": 2}.
Smoke test
The CloudFront layer fronts the app, but AWS WAF Bot Control blocks non-residential source IPs (developer laptops, AWS-resident hosts, common cloud egress). Hitting the public URL from a developer machine typically returns 403. That's expected and is not a defect.
For a real smoke test, hit the ALB directly:
ALB_DNS=$(aws elbv2 describe-load-balancers \
--names $(aws elbv2 describe-load-balancers \
--query "LoadBalancers[?contains(LoadBalancerName, 'nquiry-greenfield')].LoadBalancerName" \
--output text) \
--query 'LoadBalancers[0].DNSName' --output text)
curl -ksS "https://$ALB_DNS/api/health" -H "Host: greenfield.nquir.ai"
Expected: {"status":"ok","timestamp":"..."} with HTTP 200.
Alarms expected to be in ALARM immediately after a fresh spin:
nquiry-greenfield-alb-healthy-hosts-low— targets stillinitialwhile ECS registers. Clears in ~2 min.nquiry-greenfield-retention-cron-missing— cron has never run on a fresh install. Clears on the first scheduled run; on the runbook timescale, treat as expected.
Anything else in ALARM is real — investigate before declaring the env usable:
aws cloudwatch describe-alarms --state-value ALARM --region us-east-1 \
--query 'MetricAlarms[?contains(AlarmName, `nquiry-greenfield`)].AlarmName'
Use
Pick a workload doc for what you actually do with the env:
- Gate 3 load test: NQU-809 +
docs/working/nqu-428-load-test-plan.md. - Customer-install repro: drive
../../apply.sh+../../bootstrap-secrets.sh+../../bootstrap-db.sh+../../deploy-image.shagainst the customer's reported version and reproduce the issue against the env's outputs. - Release validation: release-validation runbook — same env, different procedure (full install → upgrade → rollback cycle with explicit gates).
Tear
Once NQU-802 is merged, the customer-greenfield env has
is_ephemeral = trueby default: RDS deletion_protection off, skip_final_snapshot on, CloudTrail + synthetics S3 bucketsforce_destroy = true. A singleterraform destroyshould one-shot. The pre-clean steps below become unnecessary.Until NQU-802 is merged (or in any env that flips
is_ephemeral = false), follow the full pre-clean sequence —terraform destroywill otherwise need 2–3 iterations per the 2026-05-17 validation results.
Pre-clean (only if NQU-802 has not landed yet)
# 1. Disable RDS deletion protection.
aws rds modify-db-instance --db-instance-identifier nquiry-greenfield-postgres \
--no-deletion-protection --apply-immediately
# 2. Empty the versioned CloudTrail bucket (re-run if destroy fails partway and the
# bucket re-fills while CloudTrail is still writing).
BUCKET=$(aws s3api list-buckets --query "Buckets[?starts_with(Name, 'nquiry-greenfield-cloudtrail-')].Name" --output text)
aws s3api delete-objects --bucket "$BUCKET" \
--delete "$(aws s3api list-object-versions --bucket "$BUCKET" \
--query '{Objects: Versions[].{Key: Key, VersionId: VersionId}}')" 2>/dev/null || true
aws s3api delete-objects --bucket "$BUCKET" \
--delete "$(aws s3api list-object-versions --bucket "$BUCKET" \
--query '{Objects: DeleteMarkers[].{Key: Key, VersionId: VersionId}}')" 2>/dev/null || true
# 3. Empty the synthetics canary-artifacts bucket (canary writes continuously).
SYNTH=$(aws s3api list-buckets --query "Buckets[?Name=='nquiry-greenfield-synthetics-artifacts'].Name" --output text)
aws s3 rm "s3://$SYNTH" --recursive || true
# 4. Force-delete the app-secrets Secrets Manager secret so a re-spin can recreate it
# without the "scheduled for deletion" wait window.
aws secretsmanager delete-secret --secret-id nquiry-greenfield/app-secrets \
--force-delete-without-recovery 2>/dev/null || true
# 5. Delete orphaned VPC flow-log group if a prior destroy left one.
aws logs delete-log-group --log-group-name /aws/vpc/nquiry-greenfield-flow-logs 2>/dev/null || true
Destroy
cd infrastructure/terraform/environments/customer-greenfield
terraform destroy
Type yes. Expect ~10–15 min wallclock; RDS deletion is the long pole.
If destroy fails partway, the most common cause (pre-NQU-802) is the CloudTrail bucket re-filling during the destroy itself. Re-run the bucket-empty step (above, item 2) and terraform destroy again. The 2026-05-17 validation needed 3 passes for this reason.
Verify clean
terraform show 2>&1 | grep -q "no state" && echo "STATE EMPTY — clean"
aws ecs list-clusters --region us-east-1 \
--query 'clusterArns[?contains(@, `nquiry-greenfield`)]'
aws rds describe-db-instances --region us-east-1 \
--query 'DBInstances[?contains(DBInstanceIdentifier, `nquiry-greenfield`)].DBInstanceIdentifier'
aws elbv2 describe-load-balancers --region us-east-1 \
--query 'LoadBalancers[?contains(LoadBalancerName, `nquiry-greenfield`)].LoadBalancerName'
aws ec2 describe-vpcs --region us-east-1 \
--query 'Vpcs[?Tags[?Key==`Project` && Value==`nquiry`] && !IsDefault].VpcId'
All four post-state queries should return empty arrays / no matches.
Persistent infra (Route53 zone, ACM cert, state bucket, ECR Public repo) should remain. Do not delete.
Related docs
- Release validation runbook — same env, gated install → upgrade → rollback procedure for cutting releases.
- Customer rollback runbook — customer-facing rollback procedure that NQU-810's "rollback exercise" doubles as a self-test of.
apply.sh,bootstrap-secrets.sh,bootstrap-db.sh,deploy-image.sh— the four customer-facing wrappers this runbook drives.- NQU-645 v0.1.0 validation results — source of the cleanup-blocker checklist and the basis for the wallclock baselines in this runbook.
- NQU-802 —
is_ephemeralenv tightening; once merged, the pre-clean section becomes optional. - NQU-809 — Gate 3 load test; first user of this runbook.