Skip to main content

Release Validation Runbook

End-to-end test procedure for a Nquiry release. Runs the full greenfield install → upgrade → rollback cycle against the JE-Vectors-test AWS account (961381384763) before a release is recommended to customers.

Who runs this. JE Vectors release engineering. Not customer-facing — customers operate the rollback runbook on their own environments. This runbook validates that the mechanisms we ship to customers actually work end-to-end.

When to run it

  • Before recommending any release to a customer. This is the explicit gate.
  • After any change to infrastructure/terraform/, .github/workflows/release-publish.yml, or the apply.sh wrapper. These touch the customer-deployable surface; regressions here are not caught by unit tests.
  • As the closing event for the NQU-645 deployment-automation initiative — running this against v0.1.0 is what flips NQU-645 to Done.

Prerequisites

Before starting, confirm each item. If any fails, fix it before proceeding — the runbook assumes all preflight items are green.

AWS access

  • aws sso login --profile je-vectors-test succeeds.
  • AWS_PROFILE=je-vectors-test aws sts get-caller-identity returns account 961381384763.
  • You have admin-or-equivalent in the JE-Vectors-test account (the test creates and destroys VPCs, RDS, ECS, etc).

Release artifact

  • The target Git tag (e.g., v0.1.0) exists on origin/main. List with git tag -l 'v*' --sort=-v:refname | head -5.
  • The tag's infrastructure/terraform/modules/nquiry-stack/VERSION matches the tag name. Verify: git show <tag>:infrastructure/terraform/modules/nquiry-stack/VERSION | tr -d '[:space:]'.
  • The release-publish.yml workflow has completed successfully for the tag. Check: gh run list --workflow=release-publish.yml --branch <tag> --limit 1.
  • The image is pullable from ECR Public. Verify: docker manifest inspect public.ecr.aws/<alias>/nquiry-app:<tag>.

Test environment hygiene

  • No prior customer-greenfield stack exists in JE-Vectors-test. Verify: terraform -chdir=infrastructure/terraform/environments/customer-greenfield show 2>&1 | grep -q "no state" || terraform -chdir=infrastructure/terraform/environments/customer-greenfield destroy -auto-approve — and confirm clean.
  • RDS snapshot quota in us-east-1 has headroom (the test creates 2+ snapshots). Check: aws rds describe-account-attributes --region us-east-1.
  • No leftover ECR images from prior test runs polluting the registry. Optional cleanup: aws ecr batch-delete-image for the customer-greenfield repo.

Recording

  • You have a results capture target ready: create docs/working/nqu-645-validation-<YYYY-MM-DD>.md and paste outputs as you go. Goal is a single artifact future runs can diff against.

Phase A — Preflight checks

These are read-only. If any fails, do not start Phase B — fix the preflight issue first.

export AWS_PROFILE=je-vectors-test
cd infrastructure/terraform/environments/customer-greenfield
  • A1: terraform init -input=false succeeds. Confirms the S3 backend in JE-Vectors-test is reachable and the state lock table exists.
  • A2: terraform validate returns "Success! The configuration is valid."
  • A3: aws bedrock list-foundation-models --region us-east-1 --query 'modelSummaries[?modelLifecycle.status==ACTIVE && contains([\anthropic.claude-sonnet-4-6`, `anthropic.claude-haiku-4-5-20251001-v1:0`, `amazon.titan-embed-text-v2:0`, `cohere.rerank-v3-5:0`], modelId)]'` returns four entries.
  • A4: terraform plan -out=preflight.plan succeeds and shows ~150 resources to be created. No resources shown as "destroy" or "replace" — the env is fresh.
  • A5: Discard the plan: rm preflight.plan.

Phase B — Greenfield install

Capture timing for each step. The release notes claim 15–25 minutes total apply; verify on your run.

  • B1: From the customer-greenfield env dir, run the customer-facing wrapper:

    ../../apply.sh

    At the prompt, type y. Capture the start time.

  • B2: Apply completes successfully. Capture the end time and total duration.

  • B2.5: Bootstrap the database schema and run migrations (NQU-821):

    ../../bootstrap-db.sh

    This opens an SSM tunnel to the bastion, applies scripts/ci-bootstrap-db.sql (extensions + auth/storage schema stubs that the baseline migration assumes), then runs all 101 app migrations via npm run db:migrate. Skipping this leaves the RDS empty — /api/health will still return 200 (it doesn't touch the DB) but any real app traffic will fail. Wallclock baseline ~2 min.

    Idempotent on re-runs.

  • B3: Inspect outputs:

    terraform output -json onboarding_summary | jq .version

    Confirms module_version and locked_image_uri show the target version.

  • B4: Deploy the application image:

    LOCKED_URI=$(terraform output -json onboarding_summary | jq -r .version.locked_image_uri)
    VERSION=$(terraform output -raw module_version)
    ECR_PUBLIC_ALIAS=<from-JE-Vectors-dev-output>

    docker pull "public.ecr.aws/$ECR_PUBLIC_ALIAS/nquiry-app:$VERSION"
    docker tag "public.ecr.aws/$ECR_PUBLIC_ALIAS/nquiry-app:$VERSION" "$LOCKED_URI"
    aws ecr get-login-password --region us-east-1 \
    | docker login --username AWS --password-stdin "${LOCKED_URI%%/*}"
    docker push "$LOCKED_URI"

    aws ecs update-service --force-new-deployment \
    --cluster $(terraform output -raw ecs_cluster_name) \
    --service $(terraform output -raw ecs_service_name)
  • B5: Wait for the ECS service to stabilize (~3–5 minutes). Verify:

    aws ecs describe-services \
    --cluster $(terraform output -raw ecs_cluster_name) \
    --services $(terraform output -raw ecs_service_name) \
    --query 'services[0].deployments[0].{status:status, running:runningCount, desired:desiredCount}'

    Expected: {"status": "PRIMARY", "running": 2, "desired": 2}.

  • B6: The health endpoint returns 200:

    APP_URL=$(terraform output -raw app_url 2>/dev/null || echo "https://<app_domain-from-tfvars>")
    curl -fsS "$APP_URL/api/health"

    Note: for v0.1.0 with placeholder NEXT_PUBLIC_* values, the app may return a partial-config error — the goal of Phase B is to verify the deploy mechanism, not full application functionality. Capture what /api/health returns and flag any divergence from the release's "What's new" expectations.

  • B7: No CloudWatch alarms are in ALARM state:

    aws cloudwatch describe-alarms --state-value ALARM --region us-east-1

    Expected: empty MetricAlarms array.

Phase C — Upgrade exercise

Skip Phase C if this is the first-ever release validation (no prior release exists to upgrade from). For v0.1.0 specifically, Phase C is skipped — see §v0.1.0-specific notes.

For releases v0.1.1 and later:

  • C1: In your customer-greenfield/main.tf, change the module source ref from the current version to the prior version, then back to the target. (Or, simpler: do the upgrade from a known-prior version.)

    For the canonical test: install the prior release in Phase B, then change the pin to the target release here.

  • C2: Re-run ../../apply.sh. Review the plan. Capture the diff — what's changing? Match against the release notes' "What's new" section.

  • C3: Type y. Capture upgrade duration.

  • C4: Redeploy the image with the new version (same B4 sequence, with the new $VERSION).

  • C5: Verify the upgrade landed (B5–B7 checks repeated).

  • C6: Inspect onboarding_summary.versionmodule_version and locked_image_uri should now reflect the target version.

Phase D — Rollback exercise

Skip Phase D for v0.1.0 — same reason as Phase C.

For releases v0.1.1 and later:

  • D1: Follow the rollback runbook §"Standard rollback" — change the module pin back to the prior release, run apply.sh, redeploy the prior image.

  • D2: Verify B5–B7 against the rolled-back state.

  • D3: Inspect onboarding_summary.version — should reflect the prior release.

  • D4: Capture rollback duration.

  • D5: After rollback completes, re-run the target-version apply (D1 in reverse) to confirm forward-then-back-then-forward-again is symmetric.

Phase E — Teardown

  • E1: From the customer-greenfield env dir:

    terraform destroy

    Type yes at the prompt. Capture destroy duration (expect 10–15 minutes; RDS deletion is the long pole).

  • E2: Confirm clean teardown:

    terraform show 2>&1 | grep -q "no state"
    aws ecs list-clusters --region us-east-1 | grep nquiry-greenfield # expected: no match
    aws rds describe-db-instances --region us-east-1 --query 'DBInstances[?contains(DBInstanceIdentifier, `nquiry-greenfield`)]' # expected: empty
  • E3: Any leftover ECR images from this run can stay — they're cheap and useful for the next run's pull-cache.

Pass/fail criteria

The release passes validation if and only if:

CheckCriterion
Phase AAll preflight items green
Phase BApply succeeds within 25 minutes; ECS service stabilizes; /api/health returns 200 (or documented partial-config behavior for v0.1.0); no ALARM state alarms
Phase C (if exercised)Upgrade apply succeeds; service redeploys; new version reflected in outputs
Phase D (if exercised)Rollback apply succeeds; service redeploys to prior image; prior version reflected in outputs; forward-back symmetry holds
Phase Eterraform destroy succeeds cleanly; no orphaned resources in the test account

If any phase fails, do not recommend the release to customers. File the failure as a P0 issue against the release tag and route to the appropriate owner.

How to capture results

Each run produces a docs/working/nqu-645-validation-<YYYY-MM-DD>.md file with:

  1. Release under test — tag, commit SHA, run date.
  2. Per-phase pass/fail with timing. Where a step has multiple sub-checks, list each.
  3. Diff against prior run (if applicable) — did any duration creep beyond historical baseline?
  4. Issues found — every red box, even if you fixed it on the spot. The runbook gets stronger over time only if past failures are recorded.
  5. Sign-off — name of the engineer who ran the test, date, "READY FOR CUSTOMERS" or "BLOCKED — see issues above."

The completed validation file is the evidence artifact that closes NQU-645 for the v0.1.0 run, and is the equivalent gate for every subsequent release.

v0.1.0-specific notes

The first release validation against v0.1.0 has reduced scope:

  • Phase C (upgrade) is skipped — there is no prior release to upgrade from. The mechanism's upgrade path will be exercised by the v0.1.1 validation (whenever that release cuts).
  • Phase D (rollback) is skipped — same reason; nothing to roll back across. v0.1.0's "rollback" is terraform destroy (covered in Phase E).
  • Phase B step B6 (/api/health) — v0.1.0 bakes placeholder NEXT_PUBLIC_* build args, so the application may return Cognito-config errors. The acceptance for v0.1.0 is that the deploy mechanism works end-to-end (ECS service stabilizes, ALB routes traffic, CloudFront fronts it, the container starts and the network path is correct). Application-config correctness is a v0.2.0 acceptance bar.
  • Phase E — full destroy is the correct closing action. v0.1.0 is for mechanism validation, not for leaving infrastructure running.

After a successful v0.1.0 run:

  1. Update docs/licensee/release-notes/v0.1.0.md Known issues: remove "first validation pending" and add the validation date + signing engineer.
  2. Close NQU-645 to Done with a comment referencing the validation results doc.
  3. File a follow-up Linear ticket for the v0.1.1 patch release, whose acceptance includes the full Phase C + D exercise (first real upgrade/rollback test).