Skip to main content

Analysis System Master Plan

Created: 2026-02-06 (last reviewed 2026-04-29) Owner: Joe (product vision) ↔ Claude Code (implementation) Status: Living document — the single source of truth for analysis system evolution


Executive Summary

Nquiry's analysis system is the core value proposition. This document unifies all analysis-related planning, research, and implementation into a single strategic framework. It provides:

  1. Vision — What the analysis system should become
  2. Current State — Honest assessment of where we are
  3. Architecture — How the pieces fit together
  4. Evolution Process — Systematic methodology for improvement
  5. Workstreams — Concrete initiatives organized by priority

Guiding Principle: Every AI output must be traceable, measurable, and trustworthy. Users are investigation professionals with legal and ethical obligations — our system must support, not undermine, their professional judgment.


Part 1: Vision

The User's Core Question

"Can I trust that the AI found all relevant evidence and analyzed it accurately?"

Everything we build must answer this question affirmatively and demonstrably.

What "Done" Looks Like

For the User:

  • Generate analysis with confidence in what the AI considered
  • See gaps before committing to conclusions
  • Verify every claim traces to evidence
  • Understand why the AI reached its conclusions
  • Know when to trust and when to verify

For the System:

  • Every analysis is reproducible (same inputs → same outputs)
  • Every claim can be traced to evidence
  • Quality is measured, not assumed
  • Prompts evolve based on data, not intuition
  • Failures are detected, logged, and learned from

For the Business:

  • Differentiate on rigor, not just features
  • Build trust that enables enterprise sales
  • Create defensible audit trails for compliance
  • Reduce AI costs through intelligent gating

Part 2: Current State Assessment

What Works Well

ComponentStatusNotes
Evidence Evaluation Framework✅ Solid617-line CIGIE/GAO-based framework
Structured Output Schemas✅ SolidZod validation, TypeScript types
Semantic Search✅ Functionalpgvector, Titan embeddings, similarity scoring
Prompt Versioning✅ FunctionalDB storage, history table, rollback capability
5 Analysis Types✅ FunctionalQuestion, Topic, Gap, Error, Summary
7 Report Types✅ FunctionalAll sections with framework integration
Test Fixtures✅ Solid20 cases (basic, edge, adversarial)

What Needs Work

ComponentIssueImpact
Quality ChecksDisabled for timeoutNo faithfulness/coverage in production
Two-Phase FlowNot implementedUsers pay for analysis on insufficient evidence
Async QualityNot implementedQuality checks could run post-analysis
Prompt TuningAd hocNo systematic evaluation methodology
User FeedbackNot capturedCan't learn from accept/edit/reject signals
Cost TrackingPartialToken usage logged but not surfaced
Retrieval AuditLogged but hiddenUsers can't see what evidence was considered

Technical Debt

ItemLocationPriority
Faithfulness/coverage calls commented outapp/api/analysis/generate/route.tsHIGH
Quality metrics UI incompleteAnalysis componentsHIGH
No prompt A/B testingPrompt systemMEDIUM
Evidence retrieval log not exposedUIMEDIUM
Custom framework uploadPhase 3 deferredLOW

Part 3: Architecture

System Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│ USER WORKFLOW │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Upload │ → │ Organize │ → │ Analyze │ → │ Review │ → │ Report │ │
│ │Evidence │ │Questions│ │ │ │Findings │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────────┐
│ ANALYSIS ENGINE │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ PHASE 1: ASSESS │ → │ PHASE 2: ANALYZE │ → │ PHASE 3: VERIFY │ │
│ │ (Fast, Cheap) │ │ (Claude, Costly) │ │ (Async, Quality) │ │
│ │ │ │ │ │ │ │
│ │ • Evidence count │ │ • Prompt select │ │ • Faithfulness │ │
│ │ • Gap detection │ │ • Context build │ │ • Coverage │ │
│ │ • Coverage score │ │ • Claude call │ │ • Confidence │ │
│ │ • Proceed gate │ │ • Output parse │ │ • Audit log │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────────┐
│ SUPPORTING SYSTEMS │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Embedding │ │ Prompt │ │ Quality │ │ Audit │ │
│ │ Pipeline │ │ Manager │ │ Metrics │ │ Trail │ │
│ │ │ │ │ │ │ │ │ │
│ │ • Chunk │ │ • Version │ │ • Calculate │ │ • Log all │ │
│ │ • Embed │ │ • A/B test │ │ • Display │ │ • Trace │ │
│ │ • Search │ │ • Evaluate │ │ • Alert │ │ • Export │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Data Flow

Evidence Upload


┌─────────────────┐
│ Text Extraction │ ← PDF, DOCX, images
└─────────────────┘


┌─────────────────┐
│ Chunking │ ← ~500 token chunks
└─────────────────┘


┌─────────────────┐
│ Embedding │ ← Titan embeddings → pgvector
└─────────────────┘


Analysis Request


┌─────────────────┐
│ Semantic Search │ ← Query embedding → similarity search
└─────────────────┘


┌─────────────────┐
│ Context Builder │ ← Top chunks + manual links + background docs
└─────────────────┘


┌─────────────────┐
│ Prompt Assembly │ ← Template + variables + framework
└─────────────────┘


┌─────────────────┐
│ Claude (LLM) │ ← Bedrock API
└─────────────────┘


┌─────────────────┐
│ Output Parsing │ ← JSON extraction + Zod validation
└─────────────────┘


┌─────────────────┐
│ Quality Checks │ ← Faithfulness + Coverage + Confidence
└─────────────────┘


Analysis Stored + Displayed

Key Tables

TablePurpose
analysisAnalysis results, quality metrics, prompt traceability
evidence_chunkEmbedded text chunks for semantic search
analysis_evidence_retrievalAudit log of what evidence was retrieved
prompt_templateActive prompts with versioning
prompt_template_historyPrompt version history for rollback
ai_usageToken usage and cost tracking

Part 4: Evolution Process

The Improvement Cycle

Every analysis system improvement follows this cycle:

┌─────────────────────────────────────────────────────────┐
│ │
▼ │
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ MEASURE │ → │ ANALYZE │ → │ IMPROVE │ → │ VALIDATE │ ──┘
│ │ │ │ │ │ │ │
│ • Metrics│ │ • Root │ │ • Design │ │ • Test │
│ • Logs │ │ cause │ │ • Build │ │ • Deploy │
│ • User │ │ • Pattern│ │ • Prompt │ │ • Monitor│
│ signal │ │ find │ │ tune │ │ │
└──────────┘ └──────────┘ └──────────┘ └──────────┘

Measurement Framework

Quantitative Metrics:

MetricSourceTarget
Faithfulness Scoreanalysis.faithfulness_score≥ 90%
Coverage Scoreanalysis.coverage_score≥ 85%
Retrieval Qualityanalysis.retrieval_statsavg similarity ≥ 0.80
Validation Pass Rateanalysis.validation_passed≥ 99%
Generation Successanalysis.generation_status≤ 1% failures

User Signal Metrics:

SignalWhat It MeansHow to Capture
RegenerateUser rejected outputanalysis.iteration_number > 1
EditUser partially acceptedTrack finding edits post-analysis
AcceptUser trusted outputNo regenerate + finding set
Direction providedUser guiding AIanalysis metadata

Operational Metrics:

MetricSourceTarget
Latency (P95)CloudWatch< 25s
Token Costai_usage.cost_usdTrack trend
Error RateSentry / logs< 0.5%

Prompt Evolution Process

When to Revise a Prompt:

  1. Faithfulness score drops below threshold
  2. User regeneration rate exceeds 20%
  3. New analysis type needed
  4. Framework updates require alignment
  5. User feedback indicates systematic issues

Prompt Change Process:

1. DOCUMENT the issue
- What's failing? (examples)
- What metrics show it?
- What's the root cause hypothesis?

2. DESIGN the change
- Draft new prompt version
- Identify affected test fixtures
- Plan validation approach

3. TEST offline
- Run against test fixtures
- Compare output quality
- Check for regressions

4. DEPLOY as new version
- Migration creates new version
- Old version archived to history
- Traceability maintained

5. MONITOR in production
- Compare metrics before/after
- Watch for regressions
- Collect user signal

Prompt Version Naming:

  • Major: Breaking change (output schema change)
  • Minor: Significant improvement (new instructions)
  • Patch: Wording refinement

Part 5: Workstreams

Workstream 1: Two-Phase Analysis (PRIORITY)

Problem: Users generate expensive narrative analysis without knowing if evidence is sufficient.

Solution: Split analysis into two phases:

Phase 1: Assessment (Fast, Cheap)

  • Count evidence per question
  • Run semantic search, compute coverage
  • Identify obvious gaps
  • Score readiness (0-100)
  • Display to user with recommendations
  • Gate: User decides whether to proceed

Phase 2: Narrative (User-Initiated)

  • Only runs after user confirms
  • Full Claude analysis
  • Uses Phase 1 results as context
  • Quality checks run async

Implementation Plan:

StepTaskEffort
1Design Phase 1 output schemaS
2Build assessment endpoint (no Claude)M
3Create assessment UI componentM
4Add "Proceed to Analysis" gateS
5Refactor generate route for Phase 2M
6Update analysis page flowM
7Test end-to-endM

Success Criteria:

  • Phase 1 completes in < 3 seconds
  • Users see gaps before committing
  • 30%+ reduction in analysis on insufficient evidence
  • No change to Phase 2 output quality

Workstream 2: Quality Metrics Integration

Problem: Faithfulness and coverage checks exist but aren't running in production (disabled for timeout).

Solution: Run quality checks asynchronously after analysis completes.

Implementation Plan:

StepTaskEffort
1Create async quality check jobM
2Update analysis status: completecheckingverifiedS
3Run faithfulness checker asyncM
4Run coverage checker asyncM
5Update analysis record with resultsS
6Notify user when checks completeS
7Build quality metrics UI panelM

Success Criteria:

  • Quality checks run on 100% of analyses
  • Users see metrics within 60s of analysis
  • No impact on analysis generation latency

Workstream 3: Retrieval Transparency

Problem: Users can't see what evidence the AI considered.

Solution: Expose evidence retrieval log to users.

Implementation Plan:

StepTaskEffort
1Design "Evidence Considered" UIS
2Query analysis_evidence_retrieval for displayS
3Show chunks with similarity scoresM
4Highlight which chunks were includedS
5Add "Why wasn't X included?" explainerM

Success Criteria:

  • Users can see all retrieved evidence
  • Similarity scores displayed
  • Clear explanation of inclusion/exclusion

Workstream 4: Prompt Evaluation Framework

Problem: Prompt changes are made ad hoc without systematic evaluation.

Solution: Build evaluation infrastructure for prompt quality.

Implementation Plan:

StepTaskEffort
1Define evaluation metrics per prompt typeS
2Create evaluation harness (run prompt against fixtures)M
3Implement LLM-as-judge for output qualityM
4Build comparison report (version A vs B)M
5Add evaluation to prompt change processS

Success Criteria:

  • Every prompt change has before/after metrics
  • Regressions detected before deploy
  • Evaluation runs in CI

Workstream 5: User Feedback Loop

Problem: We don't capture whether users accept, edit, or reject AI output.

Solution: Track user actions post-analysis to learn from outcomes.

Implementation Plan:

StepTaskEffort
1Add user_action tracking to analysisS
2Track: viewed, regenerated, edited, acceptedS
3Log finding changes after analysisS
4Build feedback dashboard for analysisM
5Use signals to prioritize prompt improvements-

Success Criteria:

  • Know acceptance rate per analysis type
  • Know regeneration reasons
  • Data-driven prompt priorities

Workstream 6: Output Types Expansion

Current: 5 analysis types, 7 report types

Future Considerations:

TypePurposePriority
Comparative AnalysisCompare evidence across time/subjectsPOST-LAUNCH
Timeline AnalysisChronological event reconstructionPOST-LAUNCH
Witness CredibilityAssess testimonial consistencyPOST-LAUNCH
Risk AssessmentPrioritize issues by impactPOST-LAUNCH

Process for Adding New Types:

  1. Define use case and user need
  2. Design output schema
  3. Create prompt with framework integration
  4. Add test fixtures
  5. Build UI components
  6. Test with real evidence
  7. Deploy with monitoring

Part 6: Priority Roadmap

Immediate (Next 2 Weeks)

InitiativeWorkstreamImpact
Two-Phase Analysis DesignWS1Architectural foundation
Async Quality ChecksWS2Re-enable disabled features
Quality Metrics UIWS2User-facing trust indicators

Near-Term (Next Month)

InitiativeWorkstreamImpact
Two-Phase ImplementationWS1Cost savings, better UX
Retrieval TransparencyWS3User trust
User Feedback TrackingWS5Data for improvement

Medium-Term (Next Quarter)

InitiativeWorkstreamImpact
Prompt Evaluation FrameworkWS4Quality assurance
Prompt A/B TestingWS4Data-driven optimization
Custom Framework UploadPhase 3Enterprise feature

Part 7: Reference Documents

This master plan consolidates and supersedes:

DocumentStatusNotes
docs/archive/prompt_engineering_tracker.mdACTIVEPrompt implementation status
docs/claude_ai_work/research/rag-quality-metrics-research.mdREFERENCEOriginal research
docs/guide/workflow/analysis.mdACTIVEUser documentation
docs/guide/features/quality-metrics.mdACTIVEQuality metrics docs
docs/reference/evaluation-framework.mdACTIVECore framework (historical)
CLAUDE.md "AI Quality & Metrics" sectionACTIVEDevelopment guidelines

Key Files

FilePurpose
app/api/analysis/generate/route.tsMain analysis generation
lib/ai/analysis-output-schema.tsOutput types
lib/ai/analysis-output-validators.tsZod validators
lib/ai/quality/*.tsQuality metric calculators
lib/embeddings/retrieval.tsSemantic search
lib/embeddings/pipeline.tsEmbedding processing
__tests__/fixtures/prompts/Test fixtures

Part 8: Success Metrics

North Star

Analysis Acceptance Rate: % of analyses where user sets finding status without regenerating

Target: 80%+ (users trust the output on first try)

Supporting Metrics

MetricCurrentTargetTimeline
Faithfulness Score (avg)Unknown (disabled)≥ 90%Q2
Coverage Score (avg)Unknown (disabled)≥ 85%Q2
Phase 1 → Phase 2 ConversionN/ATrackQ2
Analysis Regeneration RateUnknown< 20%Q2
Quality Check Coverage0%100%Immediate

Changelog

DateChangeAuthor
2026-02-06Initial creationClaude Code

This document is the single source of truth for analysis system evolution. Update it as decisions are made and progress occurs.