Release Validation Runbook

End-to-end test procedure for a Nquiry release. Runs the full greenfield install → upgrade → rollback cycle against the JE-Vectors-test AWS account (961381384763) before a release is recommended to customers.

Who runs this. JE Vectors release engineering. Not customer-facing — customers operate the rollback runbook on their own environments. This runbook validates that the mechanisms we ship to customers actually work end-to-end.

When to run it

Before recommending any release to a customer. This is the explicit gate.
After any change to infrastructure/terraform/, .github/workflows/release-publish.yml, or the apply.sh wrapper. These touch the customer-deployable surface; regressions here are not caught by unit tests.
As the closing event for the NQU-645 deployment-automation initiative — running this against v0.1.0 is what flips NQU-645 to Done.

Prerequisites

Before starting, confirm each item. If any fails, fix it before proceeding — the runbook assumes all preflight items are green.

AWS access

aws sso login --profile je-vectors-test succeeds.
AWS_PROFILE=je-vectors-test aws sts get-caller-identity returns account 961381384763.
You have admin-or-equivalent in the JE-Vectors-test account (the test creates and destroys VPCs, RDS, ECS, etc).

Release artifact

The target Git tag (e.g., v0.1.0) exists on origin/main. List with git tag -l 'v*' --sort=-v:refname | head -5.
The tag's infrastructure/terraform/modules/nquiry-stack/VERSION matches the tag name. Verify: git show <tag>:infrastructure/terraform/modules/nquiry-stack/VERSION | tr -d '[:space:]'.
The release-publish.yml workflow has completed successfully for the tag. Check: gh run list --workflow=release-publish.yml --branch <tag> --limit 1.
The image is pullable from ECR Public. Verify: docker manifest inspect public.ecr.aws/<alias>/nquiry-app:<tag>.

Test environment hygiene

No prior customer-greenfield stack exists in JE-Vectors-test. Verify: terraform -chdir=infrastructure/terraform/environments/customer-greenfield show 2>&1 | grep -q "no state" || terraform -chdir=infrastructure/terraform/environments/customer-greenfield destroy -auto-approve — and confirm clean.
RDS snapshot quota in us-east-1 has headroom (the test creates 2+ snapshots). Check: aws rds describe-account-attributes --region us-east-1.
No leftover ECR images from prior test runs polluting the registry. Optional cleanup: aws ecr batch-delete-image for the customer-greenfield repo.

Recording

You have a results capture target ready: create docs/working/nqu-645-validation-<YYYY-MM-DD>.md and paste outputs as you go. Goal is a single artifact future runs can diff against.

Phase A — Preflight checks

These are read-only. If any fails, do not start Phase B — fix the preflight issue first.

export AWS_PROFILE=je-vectors-test
cd infrastructure/terraform/environments/customer-greenfield

A1: terraform init -input=false succeeds. Confirms the S3 backend in JE-Vectors-test is reachable and the state lock table exists.
A2: terraform validate returns "Success! The configuration is valid."
A3: aws bedrock list-foundation-models --region us-east-1 --query 'modelSummaries[?modelLifecycle.status==ACTIVE && contains([\anthropic.claude-sonnet-4-6`, `anthropic.claude-haiku-4-5-20251001-v1:0`, `amazon.titan-embed-text-v2:0`, `cohere.rerank-v3-5:0`], modelId)]'` returns four entries.
A4: terraform plan -out=preflight.plan succeeds and shows ~150 resources to be created. No resources shown as "destroy" or "replace" — the env is fresh.
A5: Discard the plan: rm preflight.plan.

Phase B — Greenfield install

Capture timing for each step. The release notes claim 15–25 minutes total apply; verify on your run.

B1: From the customer-greenfield env dir, run the customer-facing wrapper:
```
../../apply.sh
```
At the prompt, type y. Capture the start time.
B2: Apply completes successfully. Capture the end time and total duration.
B2.5: Bootstrap the database schema and run migrations (NQU-821):
```
../../bootstrap-db.sh
```
This opens an SSM tunnel to the bastion, applies scripts/ci-bootstrap-db.sql (extensions + auth/storage schema stubs that the baseline migration assumes), then runs all 101 app migrations via npm run db:migrate. Skipping this leaves the RDS empty — /api/health will still return 200 (it doesn't touch the DB) but any real app traffic will fail. Wallclock baseline ~2 min.

Idempotent on re-runs.
B3: Inspect outputs:
```
terraform output -json onboarding_summary | jq .version
```
Confirms module_version and locked_image_uri show the target version.

B4: Deploy the application image:

LOCKED_URI=$(terraform output -json onboarding_summary | jq -r .version.locked_image_uri)
VERSION=$(terraform output -raw module_version)
ECR_PUBLIC_ALIAS=<from-JE-Vectors-dev-output>

docker pull "public.ecr.aws/$ECR_PUBLIC_ALIAS/nquiry-app:$VERSION"
docker tag  "public.ecr.aws/$ECR_PUBLIC_ALIAS/nquiry-app:$VERSION" "$LOCKED_URI"
aws ecr get-login-password --region us-east-1 \
  | docker login --username AWS --password-stdin "${LOCKED_URI%%/*}"
docker push "$LOCKED_URI"

aws ecs update-service --force-new-deployment \
  --cluster $(terraform output -raw ecs_cluster_name) \
  --service $(terraform output -raw ecs_service_name)

B5: Wait for the ECS service to stabilize (~3–5 minutes). Verify:

aws ecs describe-services \
  --cluster $(terraform output -raw ecs_cluster_name) \
  --services $(terraform output -raw ecs_service_name) \
  --query 'services[0].deployments[0].{status:status, running:runningCount, desired:desiredCount}'

Expected: {"status": "PRIMARY", "running": 2, "desired": 2}.

B6: The health endpoint returns 200:
```
APP_URL=$(terraform output -raw app_url 2>/dev/null || echo "https://<app_domain-from-tfvars>")
curl -fsS "$APP_URL/api/health"
```
Note: for v0.1.0 with placeholder NEXT_PUBLIC_* values, the app may return a partial-config error — the goal of Phase B is to verify the deploy mechanism, not full application functionality. Capture what /api/health returns and flag any divergence from the release's "What's new" expectations.
B7: No CloudWatch alarms are in ALARM state:
```
aws cloudwatch describe-alarms --state-value ALARM --region us-east-1
```
Expected: empty MetricAlarms array.

Phase C — Upgrade exercise

Skip Phase C if this is the first-ever release validation (no prior release exists to upgrade from). For v0.1.0 specifically, Phase C is skipped — see §v0.1.0-specific notes.

For releases v0.1.1 and later:

C1: In your customer-greenfield/main.tf, change the module source ref from the current version to the prior version, then back to the target. (Or, simpler: do the upgrade from a known-prior version.)

For the canonical test: install the prior release in Phase B, then change the pin to the target release here.
C2: Re-run ../../apply.sh. Review the plan. Capture the diff — what's changing? Match against the release notes' "What's new" section.
C3: Type y. Capture upgrade duration.
C4: Redeploy the image with the new version (same B4 sequence, with the new $VERSION).
C5: Verify the upgrade landed (B5–B7 checks repeated).
C6: Inspect onboarding_summary.version — module_version and locked_image_uri should now reflect the target version.

Phase D — Rollback exercise

Skip Phase D for v0.1.0 — same reason as Phase C.

For releases v0.1.1 and later:

D1: Follow the rollback runbook §"Standard rollback" — change the module pin back to the prior release, run apply.sh, redeploy the prior image.
D2: Verify B5–B7 against the rolled-back state.
D3: Inspect onboarding_summary.version — should reflect the prior release.
D4: Capture rollback duration.
D5: After rollback completes, re-run the target-version apply (D1 in reverse) to confirm forward-then-back-then-forward-again is symmetric.

Phase E — Teardown

E1: From the customer-greenfield env dir:
```
terraform destroy
```
Type yes at the prompt. Capture destroy duration (expect 10–15 minutes; RDS deletion is the long pole).

E2: Confirm clean teardown:

terraform show 2>&1 | grep -q "no state"
aws ecs list-clusters --region us-east-1 | grep nquiry-greenfield   # expected: no match
aws rds describe-db-instances --region us-east-1 --query 'DBInstances[?contains(DBInstanceIdentifier, `nquiry-greenfield`)]' # expected: empty

E3: Any leftover ECR images from this run can stay — they're cheap and useful for the next run's pull-cache.

Pass/fail criteria

The release passes validation if and only if:

Check	Criterion
Phase A	All preflight items green
Phase B	Apply succeeds within 25 minutes; ECS service stabilizes; `/api/health` returns 200 (or documented partial-config behavior for v0.1.0); no `ALARM` state alarms
Phase C (if exercised)	Upgrade apply succeeds; service redeploys; new version reflected in outputs
Phase D (if exercised)	Rollback apply succeeds; service redeploys to prior image; prior version reflected in outputs; forward-back symmetry holds
Phase E	`terraform destroy` succeeds cleanly; no orphaned resources in the test account

If any phase fails, do not recommend the release to customers. File the failure as a P0 issue against the release tag and route to the appropriate owner.

How to capture results

Each run produces a docs/working/nqu-645-validation-<YYYY-MM-DD>.md file with:

Release under test — tag, commit SHA, run date.
Per-phase pass/fail with timing. Where a step has multiple sub-checks, list each.
Diff against prior run (if applicable) — did any duration creep beyond historical baseline?
Issues found — every red box, even if you fixed it on the spot. The runbook gets stronger over time only if past failures are recorded.
Sign-off — name of the engineer who ran the test, date, "READY FOR CUSTOMERS" or "BLOCKED — see issues above."

The completed validation file is the evidence artifact that closes NQU-645 for the v0.1.0 run, and is the equivalent gate for every subsequent release.

v0.1.0-specific notes

The first release validation against v0.1.0 has reduced scope:

Phase C (upgrade) is skipped — there is no prior release to upgrade from. The mechanism's upgrade path will be exercised by the v0.1.1 validation (whenever that release cuts).
Phase D (rollback) is skipped — same reason; nothing to roll back across. v0.1.0's "rollback" is terraform destroy (covered in Phase E).
Phase B step B6 (/api/health) — v0.1.0 bakes placeholder NEXT_PUBLIC_* build args, so the application may return Cognito-config errors. The acceptance for v0.1.0 is that the deploy mechanism works end-to-end (ECS service stabilizes, ALB routes traffic, CloudFront fronts it, the container starts and the network path is correct). Application-config correctness is a v0.2.0 acceptance bar.
Phase E — full destroy is the correct closing action. v0.1.0 is for mechanism validation, not for leaving infrastructure running.

After a successful v0.1.0 run:

Update docs/licensee/release-notes/v0.1.0.md Known issues: remove "first validation pending" and add the validation date + signing engineer.
Close NQU-645 to Done with a comment referencing the validation results doc.
File a follow-up Linear ticket for the v0.1.1 patch release, whose acceptance includes the full Phase C + D exercise (first real upgrade/rollback test).

Customer rollback runbook — referenced by Phase D.
Release notes index — version history; check that each release notes' "Required customer actions" match what Phase B exercises.
Customer environment requirements — prerequisites carried forward to customer-side installs.
apply.sh — the customer wrapper Phase B drives.
release-publish.yml — the publish workflow whose successful completion is a Phase A preflight item.

When to run it​

Prerequisites​

AWS access​

Release artifact​

Test environment hygiene​

Recording​

Phase A — Preflight checks​

Phase B — Greenfield install​

Phase C — Upgrade exercise​

Phase D — Rollback exercise​

Phase E — Teardown​

Pass/fail criteria​

How to capture results​

v0.1.0-specific notes​

Related docs​