Skip to main content

Test Environment Spin-up Runbook (JE-Vectors-test)

End-to-end procedure for spinning the JE-Vectors-test customer-greenfield environment up on demand, using it for validation / load tests / customer-install repro, and tearing it back down.

Who runs this. JE Vectors release engineering. The same customer-greenfield env doubles as our release-validation rig (see release-validation runbook) and as a scratch env any engineer can spin to reproduce a customer-side issue. This runbook is the spin/tear procedure; choose a workload doc from §Use for what to do in between.

When to run it

  • Load tests. NQU-809 Gate 3 load test (10 VUs / 100 analyses) — first user of this runbook.
  • Customer-install repro. When a customer-side install hits a wall and the reproduction needs the actual customer-greenfield shape, not the dev env.
  • Release validation. The release-validation runbook drives the same env through a full install → upgrade → rollback cycle. If you're cutting a release, use that runbook; this one is for the "stand it up and use it" workflow without the validation gates.

Persistent infra that survives between runs. Route53 zone greenfield.nquir.ai (Z02645003FT7M14Q92EJ3) + ACM cert (d5f34171-428a-4752-ba9e-ee68372fb296) + the terraform state bucket persist between spins. Cost ~$0.50/mo idle. Do not destroy them.

Cost

Heavy resources (RDS, NAT gateway, ECS Fargate tasks, ALB, CloudFront) run only during the activity window. Single-day spin + use + tear ≈ $3–5. Persistent infra (Route53 zone + ACM cert + state bucket) ≈ $0.50/mo idle.

Bedrock + Anthropic API calls billed per-invocation are additional and depend on the workload. The Gate 3 load test estimate is in NQU-809.

Prerequisites

Confirm each item before starting. If any fails, fix it before proceeding.

AWS access

  • aws sso login --profile je-vectors-test succeeds.
  • AWS_PROFILE=je-vectors-test aws sts get-caller-identity returns account 961381384763.
  • You have admin-or-equivalent in the JE-Vectors-test account.

Persistent infra still present

The DNS + ACM bootstrap from NQU-729 Phase A is supposed to survive between spins. Verify:

export AWS_PROFILE=je-vectors-test
aws route53 list-hosted-zones --query 'HostedZones[?Name==`greenfield.nquir.ai.`]'
aws acm describe-certificate \
--certificate-arn arn:aws:acm:us-east-1:961381384763:certificate/d5f34171-428a-4752-ba9e-ee68372fb296 \
--query 'Certificate.{Status:Status,Domain:DomainName}'

Expected: hosted zone present, cert ISSUED for greenfield.nquir.ai. If either is missing, re-bootstrap from the NQU-729 Phase A inline procedure before continuing — it's a ~10-minute one-off.

Test environment hygiene

  • No prior customer-greenfield stack exists in JE-Vectors-test:

    cd infrastructure/terraform/environments/customer-greenfield
    terraform init -input=false
    terraform show 2>&1 | grep -q "no state" || echo "STATE PRESENT — tear before spin"

    If state is present from a prior run, run §Tear first.

  • RDS snapshot quota in us-east-1 has headroom. A typical spin produces 0–2 snapshots (1 final snapshot if is_ephemeral=false, 0 if is_ephemeral=true — see NQU-802). Check:

    aws rds describe-account-attributes --region us-east-1 \
    --query 'AccountQuotas[?AccountQuotaName==`ManualClusterSnapshots` || AccountQuotaName==`ManualSnapshots`]'
  • No leftover orphaned log groups (rare; only after a failed prior destroy):

    aws logs describe-log-groups --region us-east-1 \
    --log-group-name-prefix /aws/vpc/nquiry-greenfield- \
    --query 'logGroups[].logGroupName'

    Expected: empty array. If non-empty, aws logs delete-log-group --log-group-name <name> each before continuing.

  • No orphan nquiry-greenfield/app-secrets scheduled for deletion (NQU-822 — Secrets Manager keeps the name reserved for the deletion window, and terraform apply will fail to recreate the secret until it's force-purged):

    aws secretsmanager describe-secret --secret-id nquiry-greenfield/app-secrets --region us-east-1 \
    --query 'DeletedDate' --output text 2>/dev/null

    Expected: None or "secret not found" error. If a timestamp prints, force-purge before continuing:

    aws secretsmanager delete-secret --secret-id nquiry-greenfield/app-secrets \
    --force-delete-without-recovery --region us-east-1

Local toolchain

  • terraform --version ≥ 1.0.
  • aws --version v2.
  • jq --version installed.
  • openssl version available (bootstrap-secrets uses it to auto-generate CRON_SECRET).
  • docker only required if you're using deploy-image.sh's default Marketplace pull path. NOT required when the task def points directly at public.ecr.aws/l2g7u7c8/invapp-dev-app:<tag> (see Spin §2).

terraform.tfvars

The customer-greenfield env reads sensitive values from a gitignored terraform.tfvars. After the first NQU-729 bootstrap this file should still exist locally; if you're spinning from a fresh checkout:

cd infrastructure/terraform/environments/customer-greenfield
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars to set:
# db_password (16+ chars)
# redis_auth_token (16-128 chars)
# app_domain = "greenfield.nquir.ai"
# acm_certificate_arn = "arn:aws:acm:us-east-1:961381384763:certificate/d5f34171-428a-4752-ba9e-ee68372fb296"
# route53_zone_id = "Z02645003FT7M14Q92EJ3"

Spin

Wallclock baseline: apply ~15–25 min, bootstrap ~1 min, image deploy + ECS stabilize ~5 min. Total ~25–30 min the first time the runbook is exercised after landing; second + subsequent runs should hold ~15 min wallclock once muscle memory + cache are warm.

1. Apply

From the env directory:

export AWS_PROFILE=je-vectors-test
cd infrastructure/terraform/environments/customer-greenfield
../../apply.sh

Type y at the prompt. Capture start time.

Expected: ~167 resources created (count tracks any new module additions; see the most-recent validation results doc for the current baseline). 0 destroyed, 0 replaced.

If apply.sh fails early on the ACM cert ARN or Route53 zone ID, the Prerequisites DNS+ACM check was wrong — fix tfvars and re-run.

2. Bootstrap secrets

Terraform creates the empty Secrets Manager container; bootstrap-secrets.sh populates the AWSCURRENT version so ECS tasks can pull values.

../../bootstrap-secrets.sh

It auto-discovers the required key set from the live task def. With enable_stripe = false in customer-greenfield (see NQU-808), the surface is 6 keys:

KeySource
DB_PASSWORDread from terraform.tfvars
REDIS_AUTH_TOKENread from terraform.tfvars
CRON_SECRETauto-generated (openssl)
ANTHROPIC_API_KEYinteractive prompt (required)
RESEND_API_KEYinteractive prompt (required)
NEXT_PUBLIC_SENTRY_DSNinteractive prompt (optional — blank = placeholder)

For non-interactive runs (CI / re-spins), supply a values file:

../../bootstrap-secrets.sh --values-file /tmp/test-env-secrets.json

(Keep this file outside the repo — gitignored paths only.)

3. Bootstrap the database schema + run migrations

A fresh RDS comes up empty. The baseline migration assumes auth helpers (auth.uid(), auth.users) that aren't present on a vanilla RDS instance, so npm run db:migrate fails outright without a one-time shim first. bootstrap-db.sh (NQU-821) opens an SSM tunnel to the bastion, applies scripts/ci-bootstrap-db.sql (extensions + auth/storage schema stubs + helper functions), then runs all pending app migrations.

../../bootstrap-db.sh

Wallclock baseline: ~2 min (mostly the migration chain — first run applies all 101 migrations from baseline forward).

Idempotent: re-running on an already-bootstrapped DB skips the migrations that have been recorded in _migrations and re-applies the shim (which uses CREATE IF NOT EXISTS / CREATE OR REPLACE throughout).

Requires psql on PATH (or under /opt/homebrew/opt/libpq/bin/). Install via brew install libpq on macOS if missing.

4. Deploy the image

The task def is created with :latest against an empty customer ECR. Without an image deploy the service spins forever in PENDING.

Default path (Marketplace pull → customer ECR push → ECS rolling deploy): requires Docker locally.

../../deploy-image.sh

This pulls the locked tag from public.ecr.aws/l2g7u7c8/invapp-dev-app:<version>, retags for the customer's private ECR, pushes, and forces a new ECS deployment.

Sidestep (no Docker needed): point the task def directly at the ECR Public image. Used during the v0.1.0 validation run when Docker wasn't available on the engineer's machine. Trade-off: bypasses the customer's private ECR — only acceptable in the JE-Vectors-test env, not for real customer installs.

CLUSTER=$(terraform output -raw ecs_cluster_name)
SERVICE=$(terraform output -raw ecs_service_name)
TASK_FAMILY=$(aws ecs describe-services --cluster "$CLUSTER" --services "$SERVICE" \
--query 'services[0].taskDefinition' --output text | sed 's|.*/||;s|:.*||')
VERSION=$(terraform output -raw module_version)

# Re-register the task def with the public image URI.
aws ecs describe-task-definition --task-definition "$TASK_FAMILY" \
--query 'taskDefinition' --output json \
| jq --arg img "public.ecr.aws/l2g7u7c8/invapp-dev-app:$VERSION" \
'.containerDefinitions[0].image = $img
| del(.taskDefinitionArn, .revision, .status, .requiresAttributes, .compatibilities, .registeredAt, .registeredBy)' \
> /tmp/td.json
aws ecs register-task-definition --cli-input-json file:///tmp/td.json

aws ecs update-service --cluster "$CLUSTER" --service "$SERVICE" --force-new-deployment
rm /tmp/td.json

Either path: wait ~3–5 min for ECS to stabilize:

aws ecs describe-services --cluster "$CLUSTER" --services "$SERVICE" \
--query 'services[0].deployments[0].{status:status, running:runningCount, desired:desiredCount}'

Expected: {"status": "PRIMARY", "running": 2, "desired": 2}.

Smoke test

The CloudFront layer fronts the app, but AWS WAF Bot Control blocks non-residential source IPs (developer laptops, AWS-resident hosts, common cloud egress). Hitting the public URL from a developer machine typically returns 403. That's expected and is not a defect.

For a real smoke test, hit the ALB directly:

ALB_DNS=$(aws elbv2 describe-load-balancers \
--names $(aws elbv2 describe-load-balancers \
--query "LoadBalancers[?contains(LoadBalancerName, 'nquiry-greenfield')].LoadBalancerName" \
--output text) \
--query 'LoadBalancers[0].DNSName' --output text)

curl -ksS "https://$ALB_DNS/api/health" -H "Host: greenfield.nquir.ai"

Expected: {"status":"ok","timestamp":"..."} with HTTP 200.

Alarms expected to be in ALARM immediately after a fresh spin:

  • nquiry-greenfield-alb-healthy-hosts-low — targets still initial while ECS registers. Clears in ~2 min.
  • nquiry-greenfield-retention-cron-missing — cron has never run on a fresh install. Clears on the first scheduled run; on the runbook timescale, treat as expected.

Anything else in ALARM is real — investigate before declaring the env usable:

aws cloudwatch describe-alarms --state-value ALARM --region us-east-1 \
--query 'MetricAlarms[?contains(AlarmName, `nquiry-greenfield`)].AlarmName'

Use

Pick a workload doc for what you actually do with the env:

  • Gate 3 load test: NQU-809 + docs/working/nqu-428-load-test-plan.md.
  • Customer-install repro: drive ../../apply.sh + ../../bootstrap-secrets.sh + ../../bootstrap-db.sh + ../../deploy-image.sh against the customer's reported version and reproduce the issue against the env's outputs.
  • Release validation: release-validation runbook — same env, different procedure (full install → upgrade → rollback cycle with explicit gates).

Tear

Once NQU-802 is merged, the customer-greenfield env has is_ephemeral = true by default: RDS deletion_protection off, skip_final_snapshot on, CloudTrail + synthetics S3 buckets force_destroy = true. A single terraform destroy should one-shot. The pre-clean steps below become unnecessary.

Until NQU-802 is merged (or in any env that flips is_ephemeral = false), follow the full pre-clean sequence — terraform destroy will otherwise need 2–3 iterations per the 2026-05-17 validation results.

Pre-clean (only if NQU-802 has not landed yet)

# 1. Disable RDS deletion protection.
aws rds modify-db-instance --db-instance-identifier nquiry-greenfield-postgres \
--no-deletion-protection --apply-immediately

# 2. Empty the versioned CloudTrail bucket (re-run if destroy fails partway and the
# bucket re-fills while CloudTrail is still writing).
BUCKET=$(aws s3api list-buckets --query "Buckets[?starts_with(Name, 'nquiry-greenfield-cloudtrail-')].Name" --output text)
aws s3api delete-objects --bucket "$BUCKET" \
--delete "$(aws s3api list-object-versions --bucket "$BUCKET" \
--query '{Objects: Versions[].{Key: Key, VersionId: VersionId}}')" 2>/dev/null || true
aws s3api delete-objects --bucket "$BUCKET" \
--delete "$(aws s3api list-object-versions --bucket "$BUCKET" \
--query '{Objects: DeleteMarkers[].{Key: Key, VersionId: VersionId}}')" 2>/dev/null || true

# 3. Empty the synthetics canary-artifacts bucket (canary writes continuously).
SYNTH=$(aws s3api list-buckets --query "Buckets[?Name=='nquiry-greenfield-synthetics-artifacts'].Name" --output text)
aws s3 rm "s3://$SYNTH" --recursive || true

# 4. Force-delete the app-secrets Secrets Manager secret so a re-spin can recreate it
# without the "scheduled for deletion" wait window.
aws secretsmanager delete-secret --secret-id nquiry-greenfield/app-secrets \
--force-delete-without-recovery 2>/dev/null || true

# 5. Delete orphaned VPC flow-log group if a prior destroy left one.
aws logs delete-log-group --log-group-name /aws/vpc/nquiry-greenfield-flow-logs 2>/dev/null || true

Destroy

cd infrastructure/terraform/environments/customer-greenfield
terraform destroy

Type yes. Expect ~10–15 min wallclock; RDS deletion is the long pole.

If destroy fails partway, the most common cause (pre-NQU-802) is the CloudTrail bucket re-filling during the destroy itself. Re-run the bucket-empty step (above, item 2) and terraform destroy again. The 2026-05-17 validation needed 3 passes for this reason.

Verify clean

terraform show 2>&1 | grep -q "no state" && echo "STATE EMPTY — clean"
aws ecs list-clusters --region us-east-1 \
--query 'clusterArns[?contains(@, `nquiry-greenfield`)]'
aws rds describe-db-instances --region us-east-1 \
--query 'DBInstances[?contains(DBInstanceIdentifier, `nquiry-greenfield`)].DBInstanceIdentifier'
aws elbv2 describe-load-balancers --region us-east-1 \
--query 'LoadBalancers[?contains(LoadBalancerName, `nquiry-greenfield`)].LoadBalancerName'
aws ec2 describe-vpcs --region us-east-1 \
--query 'Vpcs[?Tags[?Key==`Project` && Value==`nquiry`] && !IsDefault].VpcId'

All four post-state queries should return empty arrays / no matches.

Persistent infra (Route53 zone, ACM cert, state bucket, ECR Public repo) should remain. Do not delete.