Analysis System Master Plan
Created: 2026-02-06 (last reviewed 2026-04-29) Owner: Joe (product vision) ↔ Claude Code (implementation) Status: Living document — the single source of truth for analysis system evolution
Executive Summary
Nquiry's analysis system is the core value proposition. This document unifies all analysis-related planning, research, and implementation into a single strategic framework. It provides:
- Vision — What the analysis system should become
- Current State — Honest assessment of where we are
- Architecture — How the pieces fit together
- Evolution Process — Systematic methodology for improvement
- Workstreams — Concrete initiatives organized by priority
Guiding Principle: Every AI output must be traceable, measurable, and trustworthy. Users are investigation professionals with legal and ethical obligations — our system must support, not undermine, their professional judgment.
Part 1: Vision
The User's Core Question
"Can I trust that the AI found all relevant evidence and analyzed it accurately?"
Everything we build must answer this question affirmatively and demonstrably.
What "Done" Looks Like
For the User:
- Generate analysis with confidence in what the AI considered
- See gaps before committing to conclusions
- Verify every claim traces to evidence
- Understand why the AI reached its conclusions
- Know when to trust and when to verify
For the System:
- Every analysis is reproducible (same inputs → same outputs)
- Every claim can be traced to evidence
- Quality is measured, not assumed
- Prompts evolve based on data, not intuition
- Failures are detected, logged, and learned from
For the Business:
- Differentiate on rigor, not just features
- Build trust that enables enterprise sales
- Create defensible audit trails for compliance
- Reduce AI costs through intelligent gating
Part 2: Current State Assessment
What Works Well
| Component | Status | Notes |
|---|---|---|
| Evidence Evaluation Framework | ✅ Solid | 617-line CIGIE/GAO-based framework |
| Structured Output Schemas | ✅ Solid | Zod validation, TypeScript types |
| Semantic Search | ✅ Functional | pgvector, Titan embeddings, similarity scoring |
| Prompt Versioning | ✅ Functional | DB storage, history table, rollback capability |
| 5 Analysis Types | ✅ Functional | Question, Topic, Gap, Error, Summary |
| 7 Report Types | ✅ Functional | All sections with framework integration |
| Test Fixtures | ✅ Solid | 20 cases (basic, edge, adversarial) |
What Needs Work
| Component | Issue | Impact |
|---|---|---|
| Quality Checks | Disabled for timeout | No faithfulness/coverage in production |
| Two-Phase Flow | Not implemented | Users pay for analysis on insufficient evidence |
| Async Quality | Not implemented | Quality checks could run post-analysis |
| Prompt Tuning | Ad hoc | No systematic evaluation methodology |
| User Feedback | Not captured | Can't learn from accept/edit/reject signals |
| Cost Tracking | Partial | Token usage logged but not surfaced |
| Retrieval Audit | Logged but hidden | Users can't see what evidence was considered |
Technical Debt
| Item | Location | Priority |
|---|---|---|
| Faithfulness/coverage calls commented out | app/api/analysis/generate/route.ts | HIGH |
| Quality metrics UI incomplete | Analysis components | HIGH |
| No prompt A/B testing | Prompt system | MEDIUM |
| Evidence retrieval log not exposed | UI | MEDIUM |
| Custom framework upload | Phase 3 deferred | LOW |
Part 3: Architecture
System Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ USER WORKFLOW │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Upload │ → │ Organize │ → │ Analyze │ → │ Review │ → │ Report │ │
│ │Evidence │ │Questions│ │ │ │Findings │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ ANALYSIS ENGINE │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ PHASE 1: ASSESS │ → │ PHASE 2: ANALYZE │ → │ PHASE 3: VERIFY │ │
│ │ (Fast, Cheap) │ │ (Claude, Costly) │ │ (Async, Quality) │ │
│ │ │ │ │ │ │ │
│ │ • Evidence count │ │ • Prompt select │ │ • Faithfulness │ │
│ │ • Gap detection │ │ • Context build │ │ • Coverage │ │
│ │ • Coverage score │ │ • Claude call │ │ • Confidence │ │
│ │ • Proceed gate │ │ • Output parse │ │ • Audit log │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ SUPPORTING SYSTEMS │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Embedding │ │ Prompt │ │ Quality │ │ Audit │ │
│ │ Pipeline │ │ Manager │ │ Metrics │ │ Trail │ │
│ │ │ │ │ │ │ │ │ │
│ │ • Chunk │ │ • Version │ │ • Calculate │ │ • Log all │ │
│ │ • Embed │ │ • A/B test │ │ • Display │ │ • Trace │ │
│ │ • Search │ │ • Evaluate │ │ • Alert │ │ • Export │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Data Flow
Evidence Upload
│
▼
┌─────────────────┐
│ Text Extraction │ ← PDF, DOCX, images
└─────────────────┘
│
▼
┌─────────────────┐
│ Chunking │ ← ~500 token chunks
└─────────────────┘
│
▼
┌─────────────────┐
│ Embedding │ ← Titan embeddings → pgvector
└─────────────────┘
│
▼
Analysis Request
│
▼
┌─────────────────┐
│ Semantic Search │ ← Query embedding → similarity search
└─────────────────┘
│
▼
┌─────────────────┐
│ Context Builder │ ← Top chunks + manual links + background docs
└─────────────────┘
│
▼
┌─────────────────┐
│ Prompt Assembly │ ← Template + variables + framework
└─────────────────┘
│
▼
┌─────────────────┐
│ Claude (LLM) │ ← Bedrock API
└─────────────────┘
│
▼
┌─────────────────┐
│ Output Parsing │ ← JSON extraction + Zod validation
└─────────────────┘
│
▼
┌─────────────────┐
│ Quality Checks │ ← Faithfulness + Coverage + Confidence
└─────────────────┘
│
▼
Analysis Stored + Displayed
Key Tables
| Table | Purpose |
|---|---|
analysis | Analysis results, quality metrics, prompt traceability |
evidence_chunk | Embedded text chunks for semantic search |
analysis_evidence_retrieval | Audit log of what evidence was retrieved |
prompt_template | Active prompts with versioning |
prompt_template_history | Prompt version history for rollback |
ai_usage | Token usage and cost tracking |
Part 4: Evolution Process
The Improvement Cycle
Every analysis system improvement follows this cycle:
┌─────────────────────────────────────────────────────────┐
│ │
▼ │
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ MEASURE │ → │ ANALYZE │ → │ IMPROVE │ → │ VALIDATE │ ──┘
│ │ │ │ │ │ │ │
│ • Metrics│ │ • Root │ │ • Design │ │ • Test │
│ • Logs │ │ cause │ │ • Build │ │ • Deploy │
│ • User │ │ • Pattern│ │ • Prompt │ │ • Monitor│
│ signal │ │ find │ │ tune │ │ │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
Measurement Framework
Quantitative Metrics:
| Metric | Source | Target |
|---|---|---|
| Faithfulness Score | analysis.faithfulness_score | ≥ 90% |
| Coverage Score | analysis.coverage_score | ≥ 85% |
| Retrieval Quality | analysis.retrieval_stats | avg similarity ≥ 0.80 |
| Validation Pass Rate | analysis.validation_passed | ≥ 99% |
| Generation Success | analysis.generation_status | ≤ 1% failures |
User Signal Metrics:
| Signal | What It Means | How to Capture |
|---|---|---|
| Regenerate | User rejected output | analysis.iteration_number > 1 |
| Edit | User partially accepted | Track finding edits post-analysis |
| Accept | User trusted output | No regenerate + finding set |
| Direction provided | User guiding AI | analysis metadata |
Operational Metrics:
| Metric | Source | Target |
|---|---|---|
| Latency (P95) | CloudWatch | < 25s |
| Token Cost | ai_usage.cost_usd | Track trend |
| Error Rate | Sentry / logs | < 0.5% |
Prompt Evolution Process
When to Revise a Prompt:
- Faithfulness score drops below threshold
- User regeneration rate exceeds 20%
- New analysis type needed
- Framework updates require alignment
- User feedback indicates systematic issues
Prompt Change Process:
1. DOCUMENT the issue
- What's failing? (examples)
- What metrics show it?
- What's the root cause hypothesis?
2. DESIGN the change
- Draft new prompt version
- Identify affected test fixtures
- Plan validation approach
3. TEST offline
- Run against test fixtures
- Compare output quality
- Check for regressions
4. DEPLOY as new version
- Migration creates new version
- Old version archived to history
- Traceability maintained
5. MONITOR in production
- Compare metrics before/after
- Watch for regressions
- Collect user signal
Prompt Version Naming:
- Major: Breaking change (output schema change)
- Minor: Significant improvement (new instructions)
- Patch: Wording refinement
Part 5: Workstreams
Workstream 1: Two-Phase Analysis (PRIORITY)
Problem: Users generate expensive narrative analysis without knowing if evidence is sufficient.
Solution: Split analysis into two phases:
Phase 1: Assessment (Fast, Cheap)
- Count evidence per question
- Run semantic search, compute coverage
- Identify obvious gaps
- Score readiness (0-100)
- Display to user with recommendations
- Gate: User decides whether to proceed
Phase 2: Narrative (User-Initiated)
- Only runs after user confirms
- Full Claude analysis
- Uses Phase 1 results as context
- Quality checks run async
Implementation Plan:
| Step | Task | Effort |
|---|---|---|
| 1 | Design Phase 1 output schema | S |
| 2 | Build assessment endpoint (no Claude) | M |
| 3 | Create assessment UI component | M |
| 4 | Add "Proceed to Analysis" gate | S |
| 5 | Refactor generate route for Phase 2 | M |
| 6 | Update analysis page flow | M |
| 7 | Test end-to-end | M |
Success Criteria:
- Phase 1 completes in < 3 seconds
- Users see gaps before committing
- 30%+ reduction in analysis on insufficient evidence
- No change to Phase 2 output quality
Workstream 2: Quality Metrics Integration
Problem: Faithfulness and coverage checks exist but aren't running in production (disabled for timeout).
Solution: Run quality checks asynchronously after analysis completes.
Implementation Plan:
| Step | Task | Effort |
|---|---|---|
| 1 | Create async quality check job | M |
| 2 | Update analysis status: complete → checking → verified | S |
| 3 | Run faithfulness checker async | M |
| 4 | Run coverage checker async | M |
| 5 | Update analysis record with results | S |
| 6 | Notify user when checks complete | S |
| 7 | Build quality metrics UI panel | M |
Success Criteria:
- Quality checks run on 100% of analyses
- Users see metrics within 60s of analysis
- No impact on analysis generation latency
Workstream 3: Retrieval Transparency
Problem: Users can't see what evidence the AI considered.
Solution: Expose evidence retrieval log to users.
Implementation Plan:
| Step | Task | Effort |
|---|---|---|
| 1 | Design "Evidence Considered" UI | S |
| 2 | Query analysis_evidence_retrieval for display | S |
| 3 | Show chunks with similarity scores | M |
| 4 | Highlight which chunks were included | S |
| 5 | Add "Why wasn't X included?" explainer | M |
Success Criteria:
- Users can see all retrieved evidence
- Similarity scores displayed
- Clear explanation of inclusion/exclusion
Workstream 4: Prompt Evaluation Framework
Problem: Prompt changes are made ad hoc without systematic evaluation.
Solution: Build evaluation infrastructure for prompt quality.
Implementation Plan:
| Step | Task | Effort |
|---|---|---|
| 1 | Define evaluation metrics per prompt type | S |
| 2 | Create evaluation harness (run prompt against fixtures) | M |
| 3 | Implement LLM-as-judge for output quality | M |
| 4 | Build comparison report (version A vs B) | M |
| 5 | Add evaluation to prompt change process | S |
Success Criteria:
- Every prompt change has before/after metrics
- Regressions detected before deploy
- Evaluation runs in CI
Workstream 5: User Feedback Loop
Problem: We don't capture whether users accept, edit, or reject AI output.
Solution: Track user actions post-analysis to learn from outcomes.
Implementation Plan:
| Step | Task | Effort |
|---|---|---|
| 1 | Add user_action tracking to analysis | S |
| 2 | Track: viewed, regenerated, edited, accepted | S |
| 3 | Log finding changes after analysis | S |
| 4 | Build feedback dashboard for analysis | M |
| 5 | Use signals to prioritize prompt improvements | - |
Success Criteria:
- Know acceptance rate per analysis type
- Know regeneration reasons
- Data-driven prompt priorities
Workstream 6: Output Types Expansion
Current: 5 analysis types, 7 report types
Future Considerations:
| Type | Purpose | Priority |
|---|---|---|
| Comparative Analysis | Compare evidence across time/subjects | POST-LAUNCH |
| Timeline Analysis | Chronological event reconstruction | POST-LAUNCH |
| Witness Credibility | Assess testimonial consistency | POST-LAUNCH |
| Risk Assessment | Prioritize issues by impact | POST-LAUNCH |
Process for Adding New Types:
- Define use case and user need
- Design output schema
- Create prompt with framework integration
- Add test fixtures
- Build UI components
- Test with real evidence
- Deploy with monitoring
Part 6: Priority Roadmap
Immediate (Next 2 Weeks)
| Initiative | Workstream | Impact |
|---|---|---|
| Two-Phase Analysis Design | WS1 | Architectural foundation |
| Async Quality Checks | WS2 | Re-enable disabled features |
| Quality Metrics UI | WS2 | User-facing trust indicators |
Near-Term (Next Month)
| Initiative | Workstream | Impact |
|---|---|---|
| Two-Phase Implementation | WS1 | Cost savings, better UX |
| Retrieval Transparency | WS3 | User trust |
| User Feedback Tracking | WS5 | Data for improvement |
Medium-Term (Next Quarter)
| Initiative | Workstream | Impact |
|---|---|---|
| Prompt Evaluation Framework | WS4 | Quality assurance |
| Prompt A/B Testing | WS4 | Data-driven optimization |
| Custom Framework Upload | Phase 3 | Enterprise feature |
Part 7: Reference Documents
This master plan consolidates and supersedes:
| Document | Status | Notes |
|---|---|---|
docs/archive/prompt_engineering_tracker.md | ACTIVE | Prompt implementation status |
docs/claude_ai_work/research/rag-quality-metrics-research.md | REFERENCE | Original research |
docs/guide/workflow/analysis.md | ACTIVE | User documentation |
docs/guide/features/quality-metrics.md | ACTIVE | Quality metrics docs |
docs/reference/evaluation-framework.md | ACTIVE | Core framework (historical) |
| CLAUDE.md "AI Quality & Metrics" section | ACTIVE | Development guidelines |
Key Files
| File | Purpose |
|---|---|
app/api/analysis/generate/route.ts | Main analysis generation |
lib/ai/analysis-output-schema.ts | Output types |
lib/ai/analysis-output-validators.ts | Zod validators |
lib/ai/quality/*.ts | Quality metric calculators |
lib/embeddings/retrieval.ts | Semantic search |
lib/embeddings/pipeline.ts | Embedding processing |
__tests__/fixtures/prompts/ | Test fixtures |
Part 8: Success Metrics
North Star
Analysis Acceptance Rate: % of analyses where user sets finding status without regenerating
Target: 80%+ (users trust the output on first try)
Supporting Metrics
| Metric | Current | Target | Timeline |
|---|---|---|---|
| Faithfulness Score (avg) | Unknown (disabled) | ≥ 90% | Q2 |
| Coverage Score (avg) | Unknown (disabled) | ≥ 85% | Q2 |
| Phase 1 → Phase 2 Conversion | N/A | Track | Q2 |
| Analysis Regeneration Rate | Unknown | < 20% | Q2 |
| Quality Check Coverage | 0% | 100% | Immediate |
Changelog
| Date | Change | Author |
|---|---|---|
| 2026-02-06 | Initial creation | Claude Code |
This document is the single source of truth for analysis system evolution. Update it as decisions are made and progress occurs.