Analysis System Features Guide
Last Updated: 2026-03-04 Implemented by: Analysis System Master Plan (6 workstreams)
This document describes the analysis system features that were built, where they appear in the UI, and how to use them. For the strategic vision and architecture, see docs/reference/architecture/analysis-system.md.
Feature 1: Evidence Readiness Assessment (Two-Phase Analysis)
What it does: Before running an AI analysis, the system performs a fast evidence check that shows you what evidence is available. It completes in under 100ms with no AI cost.
Where it appears: In the Generate Analysis dialog on the Analysis page.
How to use it:
- Navigate to your investigation's Analysis page
- Click Generate Analysis
- Select your analysis type (Question, Topic, Summary, Gap, or Error Check)
- If applicable, select the specific question or topic
- Click Generate Analysis — the evidence summary appears automatically
- You'll see:
- Evidence Items: Total count of evidence in the investigation
- Linked to Question: How many evidence items are linked to the selected question (for question analyses)
- Background Docs: Count of background documents, with indicator of whether AI inclusion is enabled
- Source Type Breakdown: Distribution across content, attachments, and background docs
- Informational note (blue): Shown when no evidence is directly linked to the question — evidence will still be searched via semantic retrieval
- Blocking message (red): Shown only when zero evidence exists in the investigation
- Click "Generate Analysis" to proceed
Why it matters: Gives investigators a quick snapshot of what evidence the AI will work with, without false precision from similarity scoring. The previous similarity-based readiness scoring was removed because abstract investigation questions ("Was there fraud?") consistently scored low against concrete evidence (witness statements, financial records), producing misleading "Insufficient" warnings that investigators learned to ignore.
API endpoint: POST /api/analysis/assess
Feature 2: Async Quality Checks (Faithfulness & Coverage)
What it does: After an analysis is generated, the system automatically runs two quality checks in the background:
- Faithfulness Check: Verifies that claims in the analysis are supported by the evidence
- Coverage Check: Verifies that the analysis addresses all aspects of the question/topic
These run asynchronously and don't block the initial analysis result.
Where it appears: The analysis detail view in the Analysis page — look for the quality metrics badge and expandable panel.
How it works:
- Generate an analysis (any type) — the initial request returns immediately and generation runs in the background
- The UI polls for completion (every 3 seconds, up to 5 minutes)
- Once the analysis is generated, quality checks start running automatically (status transitions:
complete→checking→verified) - Within ~30-60 seconds, quality scores appear:
- Quality Confidence Badge in the analysis list: "established", "probable", "possible", "insufficient"
- Quality Metrics Panel (expandable) showing:
- Faithfulness score (0-100%)
- Coverage score (0-100%)
- Evidence coverage stats (items considered vs. total)
- Retrieval quality stats
- Validation status
Quality confidence scoring: The confidence level is computed primarily from faithfulness and coverage scores. Retrieval similarity acts as a softening cap (can lower the level by at most one step) but cannot drag the score to "insufficient" on its own. This means an analysis with strong faithfulness and coverage will show at least "possible" even with low retrieval similarity.
If quality checks fail: The analysis stays at "complete" status — quality never degrades. You can manually trigger a re-check via the quality-check endpoint.
API endpoints:
POST /api/analysis/{analysis_id}/quality-check— manually trigger quality checksGET /api/analysis/{analysis_id}/quality-status— poll for quality check progress
Feature 3: Evidence Retrieval Transparency
What it does: Shows you exactly which evidence chunks the AI considered when generating an analysis, including similarity scores and whether each chunk was included or excluded.
Where it appears: In the analysis detail view, as an expandable "Evidence Considered" panel between "Evidence Cited" and "Analysis Feedback."
How to use it:
- Generate an analysis, then click on it to expand
- Look for the "Evidence Considered" section (loads on demand when you expand it)
- You'll see each evidence chunk with:
- Rank: Order by relevance
- Source Title: Which evidence item it came from (resolved from parent evidence item title)
- Type Badge: "content", "attachment", or "background_doc"
- Included/Excluded Badge: Whether the chunk was used in the analysis context
- Similarity Bar: Color-coded relevance score:
- Green (85%+): Highly relevant
- Yellow (70-84%): Relevant
- Orange (60-69%): Somewhat relevant
- Gray (<60%): Low relevance
- Exclusion Reason: If excluded, explains why (e.g., "Below similarity threshold")
- Summary stats show total chunks retrieved, included, and excluded
Why it matters: Answers the key trust question: "What evidence did the AI actually look at?" If you see important evidence was excluded, you may want to re-run with different parameters.
API endpoint: GET /api/analysis/{analysis_id}/retrieval
Feature 4: Prompt Evaluation Framework (Developer Tool)
What it does: A CLI tool for systematically testing prompt quality against fixtures. Detects regressions when prompts are changed.
Where it appears: Command line only — not in the UI.
How to use it:
# Run all fixtures in offline mode (no LLM judge, fast)
npx tsx scripts/evaluate-prompts.ts --offline
# Run with LLM-as-judge scoring (requires AI credits)
npx tsx scripts/evaluate-prompts.ts
# Filter by prompt type
npx tsx scripts/evaluate-prompts.ts --offline --prompt-type analysis_question
# Filter by category
npx tsx scripts/evaluate-prompts.ts --offline --category adversarial
# Compare against a baseline
npx tsx scripts/evaluate-prompts.ts --offline --compare __tests__/evaluation-results/baseline.json
# Save results with version label
npx tsx scripts/evaluate-prompts.ts --offline --version v2.1 --output baseline-v2.1.json
What it checks:
- Structural metrics (no LLM needed): JSON parseability, schema validation, citation counts, confidence levels, evidence assessment counts
- Content metrics (LLM judge): Relevance, completeness, accuracy, professional tone
- Expectation checks: Per-fixture pass/fail based on expected outcomes
- Regression detection: Compares two runs and flags any metric that dropped >5%
Test fixtures location: __tests__/fixtures/prompts/{basic,edge,adversarial}/
Feature 5: Engagement Gates + Conclusion-Based Feedback
What it does: Ensures investigators engage with the analysis before recording their judgment. Uses a gated flow that requires reviewing evidence before providing feedback.
Where it appears: In the analysis detail view, within the conclusion section (purple box).
Engagement gates (must be completed in order):
- Gate A — Detail Expansion: Judgment controls remain disabled until the investigator expands the analysis detail section at least once. A
detail_reviewed_attimestamp is persisted. - Gate B — Citation Spot-Check: After Gate A, judgment remains disabled until the investigator opens at least one citation in the Evidence Side Panel. A
citation_checked_attimestamp is persisted.
Gate progress survives page refresh and session changes.
How to use it:
- Open an analysis in the detail view
- Expand the detail section (Gate A unlocks)
- Click on at least one cited evidence item to view it (Gate B unlocks)
- The conclusion section now shows three options:
- Agree — marks this analysis as trustworthy (maps to
accepted) - Disagree — prompts for a reason (maps to
rejected) - Unsure — prompts "What would help you decide?" and pre-populates regeneration feedback (maps to
needs_revision)
- Agree — marks this analysis as trustworthy (maps to
- To regenerate: Select a reason from the dropdown (Inaccurate conclusions, Missing evidence, Too vague, Wrong focus, Other), add direction text, click "Regenerate"
- Each action is tracked with a timestamp and shown as a badge in the analysis list
Feedback status badges appear in the analysis list:
- Green "Agreed"
- Red "Disagreed"
- Blue "Regenerated"
- Amber "Edited"
- Purple "Needs Revision"
- Gray "Viewed"
Metrics dashboard: GET /api/analysis/feedback-metrics?investigation_id={id} returns:
- Acceptance/rejection/regeneration/edit rates
- Breakdown by analysis type
- Top regeneration reasons
- Weekly trend (8 weeks)
API endpoints:
POST /api/analysis/{analysis_id}/feedback— record an actionGET /api/analysis/{analysis_id}/feedback— get feedback history
Feature 6: Prompt Editor Version History, Diff & Rollback
What it does: The admin prompt editor now includes full version management — view any historical version of a prompt template, compare it side-by-side with the current version, and roll back to a previous version.
Where it appears: Admin → Prompt Templates → Version History panel for any prompt.
How to use it:
- Navigate to Admin → Prompt Templates
- Select a prompt template
- The Version History list shows up to 20 versions
- View: Click any version number to see its complete system prompt and user prompt template
- Diff: Click "Compare with Current" to see a side-by-side diff with red/green line-level highlighting
- Rollback: Click "Revert to vN" and confirm. This creates a new version with the historical content (non-destructive — the rollback itself is tracked in history)
API endpoints:
GET /api/admin/prompts/history?prompt_type=...&version=N— retrieve a historical versionPOST /api/admin/prompts/history— execute rollback (creates new version with historical content)
Feature 7: AI Provider Routing (Admin Toggle)
What it does: Allows switching between AWS Bedrock and Anthropic Direct API for AI operations via an admin toggle. No redeployment required — the setting is database-backed with a 30-second cache TTL.
Where it appears: Admin → Settings.
How to use it:
- Navigate to Admin → Settings
- The current AI provider is shown as a card ("AWS Bedrock" or "Anthropic Direct")
- Click to toggle between providers
- The switch takes effect within 30 seconds for all new AI operations
- If the
ANTHROPIC_API_KEYenvironment variable is not set, the toggle to Anthropic will show a validation error
Why it exists: Bedrock quota limits can bottleneck throughput (e.g., 2 RPM on Sonnet). The Anthropic direct API offers higher rate limits. This toggle allows switching without code changes or redeployment.
Technical details:
- Separate rate limiters per provider (Bedrock: 1 concurrency, Anthropic: 5 concurrency)
- All changes are audit-logged
- Setting stored in
app_settingstable as key/value JSONB
API endpoint: GET/PUT /api/admin/settings
Feature 8: Future Output Types (Design Only)
What was created: Design specifications and test fixtures for 4 future analysis types. These are NOT implemented yet — they're ready for when the team decides to build them.
Future types:
| Type | Purpose | Use Case |
|---|---|---|
| Comparative Analysis | Compare evidence across subjects/time periods | Procurement investigations with multiple vendors |
| Timeline Analysis | Reconstruct chronological events | Cybersecurity incidents, fraud timelines |
| Witness Credibility | Assess testimonial consistency | Interview-heavy investigations |
| Risk Assessment | Prioritize issues by impact/likelihood | Compliance audits, HIPAA reviews |
Documentation: docs/reference/future-analysis-types.md
Implementation guide: docs/reference/adding-analysis-types.md
Test fixtures: __tests__/fixtures/prompts/future/f001-f004
Testing Checklist
Use the "Dr. Marcus Chen - Professional Conduct Review" or "The Disappearance of Mr. Davenheim" investigation to test these features.
Feature 1: Evidence Assessment
- Open Analysis page → click "Generate Analysis"
- Select "Question Analysis" → pick a question
- Click "Generate Analysis" — evidence count summary should appear
- Verify: evidence items count, linked to question count, background docs count
- Verify: blue info note appears if no evidence is directly linked
- Click "Generate Analysis" to proceed
- Repeat for Summary type (no question/topic selection needed)
Feature 2: Quality Checks
- Generate an analysis (any type) — should return immediately and poll for completion
- Wait for analysis to complete (~2-3 minutes), then quality checks run
- Look for quality confidence badge in the analysis list (established/probable/possible)
- Click analysis → check for Quality Metrics Panel (expandable) with faithfulness, coverage, evidence coverage
Feature 3: Retrieval Transparency
- Click on any completed analysis
- Look for "Evidence Considered" expandable section
- Verify: chunk list with source titles (not "Unknown Source"), similarity bars, included/excluded badges
Feature 5: Engagement Gates + Feedback
- Click on a completed analysis
- Verify judgment buttons are disabled
- Expand the detail section (Gate A)
- Click a cited evidence item to open it (Gate B)
- Verify Agree/Disagree/Unsure buttons are now enabled in the conclusion section
- Click "Agree" — verify green badge appears
- On another analysis, click "Disagree" — verify prompt for reason, red badge
- Try "Unsure" — verify it pre-populates regeneration feedback
Feature 6: Prompt Version History
- Admin → Prompt Templates → select a prompt with multiple versions
- Click a version number — verify full content view
- Click "Compare with Current" — verify diff with red/green highlighting
- Click "Revert to vN" — verify new version created with historical content
Feature 7: AI Provider Toggle
- Admin → Settings → verify current provider card shown
- Toggle provider → run analysis → verify it completes
- Toggle back → run analysis → verify still works
Key Files Reference
| File | Purpose |
|---|---|
app/api/analysis/assess/route.ts | Evidence count assessment endpoint |
app/inquiries/.../evidence-assessment.tsx | Assessment UI component |
lib/ai/quality/run-quality-checks.ts | Async quality check orchestrator |
lib/ai/quality/confidence-calculator.ts | Quality confidence scoring algorithm |
app/api/analysis/{id}/quality-check/route.ts | Manual quality check trigger |
app/api/analysis/{id}/quality-status/route.ts | Quality check polling |
app/api/analysis/{id}/retrieval/route.ts | Retrieval transparency data |
app/inquiries/.../evidence-retrieval-panel.tsx | Retrieval UI component |
lib/ai/evaluation/ | Prompt evaluation framework |
scripts/evaluate-prompts.ts | Evaluation CLI runner |
app/api/analysis/{id}/feedback/route.ts | User feedback tracking |
hooks/use-analysis-feedback.ts | React feedback hook |
app/api/admin/prompts/history/route.ts | Prompt version history & rollback |
app/admin/prompts/prompt-editor.tsx | Prompt editor with diff/rollback UI |
lib/ai/client.ts | Dual AI provider routing |
lib/settings/index.ts | App settings DAL (provider toggle) |
app/api/admin/settings/route.ts | Admin settings API |
app/admin/settings/page.tsx | AI provider toggle UI |
components/file-viewer.tsx | In-app file viewer component |
app/api/storage/view/route.ts | File viewer S3 proxy endpoint |
app/api/background/extract-text/route.ts | Background doc text extraction |