Release Validation Runbook
End-to-end test procedure for a Nquiry release. Runs the full greenfield install → upgrade → rollback cycle against the JE-Vectors-test AWS account (961381384763) before a release is recommended to customers.
Who runs this. JE Vectors release engineering. Not customer-facing — customers operate the rollback runbook on their own environments. This runbook validates that the mechanisms we ship to customers actually work end-to-end.
When to run it
- Before recommending any release to a customer. This is the explicit gate.
- After any change to
infrastructure/terraform/,.github/workflows/release-publish.yml, or theapply.shwrapper. These touch the customer-deployable surface; regressions here are not caught by unit tests. - As the closing event for the NQU-645 deployment-automation initiative — running this against
v0.1.0is what flips NQU-645 to Done.
Prerequisites
Before starting, confirm each item. If any fails, fix it before proceeding — the runbook assumes all preflight items are green.
AWS access
-
aws sso login --profile je-vectors-testsucceeds. -
AWS_PROFILE=je-vectors-test aws sts get-caller-identityreturns account961381384763. - You have admin-or-equivalent in the JE-Vectors-test account (the test creates and destroys VPCs, RDS, ECS, etc).
Release artifact
- The target Git tag (e.g.,
v0.1.0) exists onorigin/main. List withgit tag -l 'v*' --sort=-v:refname | head -5. - The tag's
infrastructure/terraform/modules/nquiry-stack/VERSIONmatches the tag name. Verify:git show <tag>:infrastructure/terraform/modules/nquiry-stack/VERSION | tr -d '[:space:]'. - The
release-publish.ymlworkflow has completed successfully for the tag. Check:gh run list --workflow=release-publish.yml --branch <tag> --limit 1. - The image is pullable from ECR Public. Verify:
docker manifest inspect public.ecr.aws/<alias>/nquiry-app:<tag>.
Test environment hygiene
- No prior
customer-greenfieldstack exists inJE-Vectors-test. Verify:terraform -chdir=infrastructure/terraform/environments/customer-greenfield show 2>&1 | grep -q "no state" || terraform -chdir=infrastructure/terraform/environments/customer-greenfield destroy -auto-approve— and confirm clean. - RDS snapshot quota in
us-east-1has headroom (the test creates 2+ snapshots). Check:aws rds describe-account-attributes --region us-east-1. - No leftover ECR images from prior test runs polluting the registry. Optional cleanup:
aws ecr batch-delete-imagefor the customer-greenfield repo.
Recording
- You have a results capture target ready: create
docs/working/nqu-645-validation-<YYYY-MM-DD>.mdand paste outputs as you go. Goal is a single artifact future runs can diff against.
Phase A — Preflight checks
These are read-only. If any fails, do not start Phase B — fix the preflight issue first.
export AWS_PROFILE=je-vectors-test
cd infrastructure/terraform/environments/customer-greenfield
- A1:
terraform init -input=falsesucceeds. Confirms the S3 backend inJE-Vectors-testis reachable and the state lock table exists. - A2:
terraform validatereturns "Success! The configuration is valid." - A3:
aws bedrock list-foundation-models --region us-east-1 --query 'modelSummaries[?modelLifecycle.status==ACTIVE&& contains([\anthropic.claude-sonnet-4-6`, `anthropic.claude-haiku-4-5-20251001-v1:0`, `amazon.titan-embed-text-v2:0`, `cohere.rerank-v3-5:0`], modelId)]'` returns four entries. - A4:
terraform plan -out=preflight.plansucceeds and shows ~150 resources to be created. No resources shown as "destroy" or "replace" — the env is fresh. - A5: Discard the plan:
rm preflight.plan.
Phase B — Greenfield install
Capture timing for each step. The release notes claim 15–25 minutes total apply; verify on your run.
-
B1: From the
customer-greenfieldenv dir, run the customer-facing wrapper:../../apply.shAt the prompt, type
y. Capture the start time. -
B2: Apply completes successfully. Capture the end time and total duration.
-
B2.5: Bootstrap the database schema and run migrations (NQU-821):
../../bootstrap-db.shThis opens an SSM tunnel to the bastion, applies
scripts/ci-bootstrap-db.sql(extensions + auth/storage schema stubs that the baseline migration assumes), then runs all 101 app migrations vianpm run db:migrate. Skipping this leaves the RDS empty —/api/healthwill still return 200 (it doesn't touch the DB) but any real app traffic will fail. Wallclock baseline ~2 min.Idempotent on re-runs.
-
B3: Inspect outputs:
terraform output -json onboarding_summary | jq .versionConfirms
module_versionandlocked_image_urishow the target version. -
B4: Deploy the application image:
LOCKED_URI=$(terraform output -json onboarding_summary | jq -r .version.locked_image_uri)VERSION=$(terraform output -raw module_version)ECR_PUBLIC_ALIAS=<from-JE-Vectors-dev-output>docker pull "public.ecr.aws/$ECR_PUBLIC_ALIAS/nquiry-app:$VERSION"docker tag "public.ecr.aws/$ECR_PUBLIC_ALIAS/nquiry-app:$VERSION" "$LOCKED_URI"aws ecr get-login-password --region us-east-1 \| docker login --username AWS --password-stdin "${LOCKED_URI%%/*}"docker push "$LOCKED_URI"aws ecs update-service --force-new-deployment \--cluster $(terraform output -raw ecs_cluster_name) \--service $(terraform output -raw ecs_service_name) -
B5: Wait for the ECS service to stabilize (~3–5 minutes). Verify:
aws ecs describe-services \--cluster $(terraform output -raw ecs_cluster_name) \--services $(terraform output -raw ecs_service_name) \--query 'services[0].deployments[0].{status:status, running:runningCount, desired:desiredCount}'Expected:
{"status": "PRIMARY", "running": 2, "desired": 2}. -
B6: The health endpoint returns 200:
APP_URL=$(terraform output -raw app_url 2>/dev/null || echo "https://<app_domain-from-tfvars>")curl -fsS "$APP_URL/api/health"Note: for v0.1.0 with placeholder
NEXT_PUBLIC_*values, the app may return a partial-config error — the goal of Phase B is to verify the deploy mechanism, not full application functionality. Capture what/api/healthreturns and flag any divergence from the release's "What's new" expectations. -
B7: No CloudWatch alarms are in
ALARMstate:aws cloudwatch describe-alarms --state-value ALARM --region us-east-1Expected: empty
MetricAlarmsarray.
Phase C — Upgrade exercise
Skip Phase C if this is the first-ever release validation (no prior release exists to upgrade from). For v0.1.0 specifically, Phase C is skipped — see §v0.1.0-specific notes.
For releases v0.1.1 and later:
-
C1: In your
customer-greenfield/main.tf, change the module source ref from the current version to the prior version, then back to the target. (Or, simpler: do the upgrade from a known-prior version.)For the canonical test: install the prior release in Phase B, then change the pin to the target release here.
-
C2: Re-run
../../apply.sh. Review the plan. Capture the diff — what's changing? Match against the release notes' "What's new" section. -
C3: Type
y. Capture upgrade duration. -
C4: Redeploy the image with the new version (same B4 sequence, with the new
$VERSION). -
C5: Verify the upgrade landed (B5–B7 checks repeated).
-
C6: Inspect
onboarding_summary.version—module_versionandlocked_image_urishould now reflect the target version.
Phase D — Rollback exercise
Skip Phase D for v0.1.0 — same reason as Phase C.
For releases v0.1.1 and later:
-
D1: Follow the rollback runbook §"Standard rollback" — change the module pin back to the prior release, run
apply.sh, redeploy the prior image. -
D2: Verify B5–B7 against the rolled-back state.
-
D3: Inspect
onboarding_summary.version— should reflect the prior release. -
D4: Capture rollback duration.
-
D5: After rollback completes, re-run the target-version apply (D1 in reverse) to confirm forward-then-back-then-forward-again is symmetric.
Phase E — Teardown
-
E1: From the
customer-greenfieldenv dir:terraform destroyType
yesat the prompt. Capture destroy duration (expect 10–15 minutes; RDS deletion is the long pole). -
E2: Confirm clean teardown:
terraform show 2>&1 | grep -q "no state"aws ecs list-clusters --region us-east-1 | grep nquiry-greenfield # expected: no matchaws rds describe-db-instances --region us-east-1 --query 'DBInstances[?contains(DBInstanceIdentifier, `nquiry-greenfield`)]' # expected: empty -
E3: Any leftover ECR images from this run can stay — they're cheap and useful for the next run's pull-cache.
Pass/fail criteria
The release passes validation if and only if:
| Check | Criterion |
|---|---|
| Phase A | All preflight items green |
| Phase B | Apply succeeds within 25 minutes; ECS service stabilizes; /api/health returns 200 (or documented partial-config behavior for v0.1.0); no ALARM state alarms |
| Phase C (if exercised) | Upgrade apply succeeds; service redeploys; new version reflected in outputs |
| Phase D (if exercised) | Rollback apply succeeds; service redeploys to prior image; prior version reflected in outputs; forward-back symmetry holds |
| Phase E | terraform destroy succeeds cleanly; no orphaned resources in the test account |
If any phase fails, do not recommend the release to customers. File the failure as a P0 issue against the release tag and route to the appropriate owner.
How to capture results
Each run produces a docs/working/nqu-645-validation-<YYYY-MM-DD>.md file with:
- Release under test — tag, commit SHA, run date.
- Per-phase pass/fail with timing. Where a step has multiple sub-checks, list each.
- Diff against prior run (if applicable) — did any duration creep beyond historical baseline?
- Issues found — every red box, even if you fixed it on the spot. The runbook gets stronger over time only if past failures are recorded.
- Sign-off — name of the engineer who ran the test, date, "READY FOR CUSTOMERS" or "BLOCKED — see issues above."
The completed validation file is the evidence artifact that closes NQU-645 for the v0.1.0 run, and is the equivalent gate for every subsequent release.
v0.1.0-specific notes
The first release validation against v0.1.0 has reduced scope:
- Phase C (upgrade) is skipped — there is no prior release to upgrade from. The mechanism's upgrade path will be exercised by the v0.1.1 validation (whenever that release cuts).
- Phase D (rollback) is skipped — same reason; nothing to roll back across. v0.1.0's "rollback" is
terraform destroy(covered in Phase E). - Phase B step B6 (
/api/health) — v0.1.0 bakes placeholderNEXT_PUBLIC_*build args, so the application may return Cognito-config errors. The acceptance for v0.1.0 is that the deploy mechanism works end-to-end (ECS service stabilizes, ALB routes traffic, CloudFront fronts it, the container starts and the network path is correct). Application-config correctness is a v0.2.0 acceptance bar. - Phase E — full destroy is the correct closing action. v0.1.0 is for mechanism validation, not for leaving infrastructure running.
After a successful v0.1.0 run:
- Update
docs/licensee/release-notes/v0.1.0.mdKnown issues: remove "first validation pending" and add the validation date + signing engineer. - Close NQU-645 to Done with a comment referencing the validation results doc.
- File a follow-up Linear ticket for the v0.1.1 patch release, whose acceptance includes the full Phase C + D exercise (first real upgrade/rollback test).
Related docs
- Customer rollback runbook — referenced by Phase D.
- Release notes index — version history; check that each release notes' "Required customer actions" match what Phase B exercises.
- Customer environment requirements — prerequisites carried forward to customer-side installs.
apply.sh— the customer wrapper Phase B drives.release-publish.yml— the publish workflow whose successful completion is a Phase A preflight item.