Analysis System Master Plan

Created: 2026-02-06 (last reviewed 2026-04-29) Owner: Joe (product vision) ↔ Claude Code (implementation) Status: Living document — the single source of truth for analysis system evolution

Executive Summary

Nquiry's analysis system is the core value proposition. This document unifies all analysis-related planning, research, and implementation into a single strategic framework. It provides:

Vision — What the analysis system should become
Current State — Honest assessment of where we are
Architecture — How the pieces fit together
Evolution Process — Systematic methodology for improvement
Workstreams — Concrete initiatives organized by priority

Guiding Principle: Every AI output must be traceable, measurable, and trustworthy. Users are investigation professionals with legal and ethical obligations — our system must support, not undermine, their professional judgment.

Part 1: Vision

The User's Core Question

"Can I trust that the AI found all relevant evidence and analyzed it accurately?"

Everything we build must answer this question affirmatively and demonstrably.

What "Done" Looks Like

For the User:

Generate analysis with confidence in what the AI considered
See gaps before committing to conclusions
Verify every claim traces to evidence
Understand why the AI reached its conclusions
Know when to trust and when to verify

For the System:

Every analysis is reproducible (same inputs → same outputs)
Every claim can be traced to evidence
Quality is measured, not assumed
Prompts evolve based on data, not intuition
Failures are detected, logged, and learned from

For the Business:

Differentiate on rigor, not just features
Build trust that enables enterprise sales
Create defensible audit trails for compliance
Reduce AI costs through intelligent gating

Part 2: Current State Assessment

What Works Well

Component	Status	Notes
Evidence Evaluation Framework	✅ Solid	617-line CIGIE/GAO-based framework
Structured Output Schemas	✅ Solid	Zod validation, TypeScript types
Semantic Search	✅ Functional	pgvector, Titan embeddings, similarity scoring
Prompt Versioning	✅ Functional	DB storage, history table, rollback capability
5 Analysis Types	✅ Functional	Question, Topic, Gap, Error, Summary
7 Report Types	✅ Functional	All sections with framework integration
Test Fixtures	✅ Solid	20 cases (basic, edge, adversarial)

What Needs Work

Component	Issue	Impact
Quality Checks	Disabled for timeout	No faithfulness/coverage in production
Two-Phase Flow	Not implemented	Users pay for analysis on insufficient evidence
Async Quality	Not implemented	Quality checks could run post-analysis
Prompt Tuning	Ad hoc	No systematic evaluation methodology
User Feedback	Not captured	Can't learn from accept/edit/reject signals
Cost Tracking	Partial	Token usage logged but not surfaced
Retrieval Audit	Logged but hidden	Users can't see what evidence was considered

Technical Debt

Item	Location	Priority
Faithfulness/coverage calls commented out	`app/api/analysis/generate/route.ts`	HIGH
Quality metrics UI incomplete	Analysis components	HIGH
No prompt A/B testing	Prompt system	MEDIUM
Evidence retrieval log not exposed	UI	MEDIUM
Custom framework upload	Phase 3 deferred	LOW

Part 3: Architecture

System Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                           USER WORKFLOW                                      │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐   │
│  │ Upload  │ → │ Organize │ → │ Analyze │ → │ Review  │ → │ Report  │   │
│  │Evidence │    │Questions│    │         │    │Findings │    │         │   │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘    └─────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         ANALYSIS ENGINE                                      │
│                                                                              │
│  ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐      │
│  │ PHASE 1: ASSESS  │ →  │ PHASE 2: ANALYZE │ →  │ PHASE 3: VERIFY  │      │
│  │ (Fast, Cheap)    │    │ (Claude, Costly) │    │ (Async, Quality) │      │
│  │                  │    │                  │    │                  │      │
│  │ • Evidence count │    │ • Prompt select  │    │ • Faithfulness   │      │
│  │ • Gap detection  │    │ • Context build  │    │ • Coverage       │      │
│  │ • Coverage score │    │ • Claude call    │    │ • Confidence     │      │
│  │ • Proceed gate   │    │ • Output parse   │    │ • Audit log      │      │
│  └──────────────────┘    └──────────────────┘    └──────────────────┘      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         SUPPORTING SYSTEMS                                   │
│                                                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │  Embedding  │  │   Prompt    │  │   Quality   │  │   Audit     │        │
│  │  Pipeline   │  │   Manager   │  │   Metrics   │  │   Trail     │        │
│  │             │  │             │  │             │  │             │        │
│  │ • Chunk     │  │ • Version   │  │ • Calculate │  │ • Log all   │        │
│  │ • Embed     │  │ • A/B test  │  │ • Display   │  │ • Trace     │        │
│  │ • Search    │  │ • Evaluate  │  │ • Alert     │  │ • Export    │        │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Data Flow

Evidence Upload
      │
      ▼
┌─────────────────┐
│ Text Extraction │ ← PDF, DOCX, images
└─────────────────┘
      │
      ▼
┌─────────────────┐
│   Chunking      │ ← ~500 token chunks
└─────────────────┘
      │
      ▼
┌─────────────────┐
│   Embedding     │ ← Titan embeddings → pgvector
└─────────────────┘
      │
      ▼
Analysis Request
      │
      ▼
┌─────────────────┐
│ Semantic Search │ ← Query embedding → similarity search
└─────────────────┘
      │
      ▼
┌─────────────────┐
│ Context Builder │ ← Top chunks + manual links + background docs
└─────────────────┘
      │
      ▼
┌─────────────────┐
│ Prompt Assembly │ ← Template + variables + framework
└─────────────────┘
      │
      ▼
┌─────────────────┐
│  Claude (LLM)   │ ← Bedrock API
└─────────────────┘
      │
      ▼
┌─────────────────┐
│ Output Parsing  │ ← JSON extraction + Zod validation
└─────────────────┘
      │
      ▼
┌─────────────────┐
│ Quality Checks  │ ← Faithfulness + Coverage + Confidence
└─────────────────┘
      │
      ▼
Analysis Stored + Displayed

Key Tables

Table	Purpose
`analysis`	Analysis results, quality metrics, prompt traceability
`evidence_chunk`	Embedded text chunks for semantic search
`analysis_evidence_retrieval`	Audit log of what evidence was retrieved
`prompt_template`	Active prompts with versioning
`prompt_template_history`	Prompt version history for rollback
`ai_usage`	Token usage and cost tracking

Part 4: Evolution Process

The Improvement Cycle

Every analysis system improvement follows this cycle:

      ┌─────────────────────────────────────────────────────────┐
      │                                                         │
      ▼                                                         │
┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐   │
│ MEASURE  │ →  │ ANALYZE  │ →  │ IMPROVE  │ →  │ VALIDATE │ ──┘
│          │    │          │    │          │    │          │
│ • Metrics│    │ • Root   │    │ • Design │    │ • Test   │
│ • Logs   │    │   cause  │    │ • Build  │    │ • Deploy │
│ • User   │    │ • Pattern│    │ • Prompt │    │ • Monitor│
│   signal │    │   find   │    │   tune   │    │          │
└──────────┘    └──────────┘    └──────────┘    └──────────┘

Measurement Framework

Quantitative Metrics:

Metric	Source	Target
Faithfulness Score	`analysis.faithfulness_score`	≥ 90%
Coverage Score	`analysis.coverage_score`	≥ 85%
Retrieval Quality	`analysis.retrieval_stats`	avg similarity ≥ 0.80
Validation Pass Rate	`analysis.validation_passed`	≥ 99%
Generation Success	`analysis.generation_status`	≤ 1% failures

User Signal Metrics:

Signal	What It Means	How to Capture
Regenerate	User rejected output	`analysis.iteration_number > 1`
Edit	User partially accepted	Track finding edits post-analysis
Accept	User trusted output	No regenerate + finding set
Direction provided	User guiding AI	`analysis` metadata

Operational Metrics:

Metric	Source	Target
Latency (P95)	CloudWatch	< 25s
Token Cost	`ai_usage.cost_usd`	Track trend
Error Rate	Sentry / logs	< 0.5%

Prompt Evolution Process

When to Revise a Prompt:

Faithfulness score drops below threshold
User regeneration rate exceeds 20%
New analysis type needed
Framework updates require alignment
User feedback indicates systematic issues

Prompt Change Process:

1. DOCUMENT the issue
   - What's failing? (examples)
   - What metrics show it?
   - What's the root cause hypothesis?

2. DESIGN the change
   - Draft new prompt version
   - Identify affected test fixtures
   - Plan validation approach

3. TEST offline
   - Run against test fixtures
   - Compare output quality
   - Check for regressions

4. DEPLOY as new version
   - Migration creates new version
   - Old version archived to history
   - Traceability maintained

5. MONITOR in production
   - Compare metrics before/after
   - Watch for regressions
   - Collect user signal

Prompt Version Naming:

Major: Breaking change (output schema change)
Minor: Significant improvement (new instructions)
Patch: Wording refinement

Part 5: Workstreams

Workstream 1: Two-Phase Analysis (PRIORITY)

Problem: Users generate expensive narrative analysis without knowing if evidence is sufficient.

Solution: Split analysis into two phases:

Phase 1: Assessment (Fast, Cheap)

Count evidence per question
Run semantic search, compute coverage
Identify obvious gaps
Score readiness (0-100)
Display to user with recommendations
Gate: User decides whether to proceed

Phase 2: Narrative (User-Initiated)

Only runs after user confirms
Full Claude analysis
Uses Phase 1 results as context
Quality checks run async

Implementation Plan:

Step	Task	Effort
1	Design Phase 1 output schema	S
2	Build assessment endpoint (no Claude)	M
3	Create assessment UI component	M
4	Add "Proceed to Analysis" gate	S
5	Refactor generate route for Phase 2	M
6	Update analysis page flow	M
7	Test end-to-end	M

Success Criteria:

Phase 1 completes in < 3 seconds
Users see gaps before committing
30%+ reduction in analysis on insufficient evidence
No change to Phase 2 output quality

Workstream 2: Quality Metrics Integration

Problem: Faithfulness and coverage checks exist but aren't running in production (disabled for timeout).

Solution: Run quality checks asynchronously after analysis completes.

Implementation Plan:

Step	Task	Effort
1	Create async quality check job	M
2	Update analysis status: `complete` → `checking` → `verified`	S
3	Run faithfulness checker async	M
4	Run coverage checker async	M
5	Update analysis record with results	S
6	Notify user when checks complete	S
7	Build quality metrics UI panel	M

Success Criteria:

Quality checks run on 100% of analyses
Users see metrics within 60s of analysis
No impact on analysis generation latency

Workstream 3: Retrieval Transparency

Problem: Users can't see what evidence the AI considered.

Solution: Expose evidence retrieval log to users.

Implementation Plan:

Step	Task	Effort
1	Design "Evidence Considered" UI	S
2	Query `analysis_evidence_retrieval` for display	S
3	Show chunks with similarity scores	M
4	Highlight which chunks were included	S
5	Add "Why wasn't X included?" explainer	M

Success Criteria:

Users can see all retrieved evidence
Similarity scores displayed
Clear explanation of inclusion/exclusion

Workstream 4: Prompt Evaluation Framework

Problem: Prompt changes are made ad hoc without systematic evaluation.

Solution: Build evaluation infrastructure for prompt quality.

Implementation Plan:

Step	Task	Effort
1	Define evaluation metrics per prompt type	S
2	Create evaluation harness (run prompt against fixtures)	M
3	Implement LLM-as-judge for output quality	M
4	Build comparison report (version A vs B)	M
5	Add evaluation to prompt change process	S

Success Criteria:

Every prompt change has before/after metrics
Regressions detected before deploy
Evaluation runs in CI

Workstream 5: User Feedback Loop

Problem: We don't capture whether users accept, edit, or reject AI output.

Solution: Track user actions post-analysis to learn from outcomes.

Implementation Plan:

Step	Task	Effort
1	Add `user_action` tracking to analysis	S
2	Track: viewed, regenerated, edited, accepted	S
3	Log finding changes after analysis	S
4	Build feedback dashboard for analysis	M
5	Use signals to prioritize prompt improvements	-

Success Criteria:

Know acceptance rate per analysis type
Know regeneration reasons
Data-driven prompt priorities

Workstream 6: Output Types Expansion

Current: 5 analysis types, 7 report types

Future Considerations:

Type	Purpose	Priority
Comparative Analysis	Compare evidence across time/subjects	POST-LAUNCH
Timeline Analysis	Chronological event reconstruction	POST-LAUNCH
Witness Credibility	Assess testimonial consistency	POST-LAUNCH
Risk Assessment	Prioritize issues by impact	POST-LAUNCH

Process for Adding New Types:

Define use case and user need
Design output schema
Create prompt with framework integration
Add test fixtures
Build UI components
Test with real evidence
Deploy with monitoring

Part 6: Priority Roadmap

Immediate (Next 2 Weeks)

Initiative	Workstream	Impact
Two-Phase Analysis Design	WS1	Architectural foundation
Async Quality Checks	WS2	Re-enable disabled features
Quality Metrics UI	WS2	User-facing trust indicators

Near-Term (Next Month)

Initiative	Workstream	Impact
Two-Phase Implementation	WS1	Cost savings, better UX
Retrieval Transparency	WS3	User trust
User Feedback Tracking	WS5	Data for improvement

Medium-Term (Next Quarter)

Initiative	Workstream	Impact
Prompt Evaluation Framework	WS4	Quality assurance
Prompt A/B Testing	WS4	Data-driven optimization
Custom Framework Upload	Phase 3	Enterprise feature

Part 7: Reference Documents

This master plan consolidates and supersedes:

Document	Status	Notes
`docs/archive/prompt_engineering_tracker.md`	ACTIVE	Prompt implementation status
`docs/claude_ai_work/research/rag-quality-metrics-research.md`	REFERENCE	Original research
`docs/guide/workflow/analysis.md`	ACTIVE	User documentation
`docs/guide/features/quality-metrics.md`	ACTIVE	Quality metrics docs
`docs/reference/evaluation-framework.md`	ACTIVE	Core framework (historical)
CLAUDE.md "AI Quality & Metrics" section	ACTIVE	Development guidelines

Key Files

File	Purpose
`app/api/analysis/generate/route.ts`	Main analysis generation
`lib/ai/analysis-output-schema.ts`	Output types
`lib/ai/analysis-output-validators.ts`	Zod validators
`lib/ai/quality/*.ts`	Quality metric calculators
`lib/embeddings/retrieval.ts`	Semantic search
`lib/embeddings/pipeline.ts`	Embedding processing
`__tests__/fixtures/prompts/`	Test fixtures

Part 8: Success Metrics

North Star

Analysis Acceptance Rate: % of analyses where user sets finding status without regenerating

Target: 80%+ (users trust the output on first try)

Supporting Metrics

Metric	Current	Target	Timeline
Faithfulness Score (avg)	Unknown (disabled)	≥ 90%	Q2
Coverage Score (avg)	Unknown (disabled)	≥ 85%	Q2
Phase 1 → Phase 2 Conversion	N/A	Track	Q2
Analysis Regeneration Rate	Unknown	< 20%	Q2
Quality Check Coverage	0%	100%	Immediate

Changelog

Date	Change	Author
2026-02-06	Initial creation	Claude Code

This document is the single source of truth for analysis system evolution. Update it as decisions are made and progress occurs.

Executive Summary​

Part 1: Vision​

The User's Core Question​

What "Done" Looks Like​

Part 2: Current State Assessment​

What Works Well​

What Needs Work​

Technical Debt​

Part 3: Architecture​

System Overview​

Data Flow​

Key Tables​

Part 4: Evolution Process​

The Improvement Cycle​

Measurement Framework​

Prompt Evolution Process​

Part 5: Workstreams​

Workstream 1: Two-Phase Analysis (PRIORITY)​

Workstream 2: Quality Metrics Integration​

Workstream 3: Retrieval Transparency​

Workstream 4: Prompt Evaluation Framework​

Workstream 5: User Feedback Loop​

Workstream 6: Output Types Expansion​

Part 6: Priority Roadmap​

Immediate (Next 2 Weeks)​

Near-Term (Next Month)​

Medium-Term (Next Quarter)​

Part 7: Reference Documents​

Key Files​

Part 8: Success Metrics​

North Star​

Supporting Metrics​

Changelog​

Executive Summary

Part 1: Vision

The User's Core Question

What "Done" Looks Like

Part 2: Current State Assessment

What Works Well

What Needs Work

Technical Debt

Part 3: Architecture

System Overview

Data Flow

Key Tables

Part 4: Evolution Process

The Improvement Cycle

Measurement Framework

Prompt Evolution Process

Part 5: Workstreams

Workstream 1: Two-Phase Analysis (PRIORITY)

Workstream 2: Quality Metrics Integration

Workstream 3: Retrieval Transparency

Workstream 4: Prompt Evaluation Framework

Workstream 5: User Feedback Loop

Workstream 6: Output Types Expansion

Part 6: Priority Roadmap

Immediate (Next 2 Weeks)

Near-Term (Next Month)

Medium-Term (Next Quarter)

Part 7: Reference Documents

Key Files

Part 8: Success Metrics

North Star

Supporting Metrics

Changelog