Evidence Retrieval Pipeline
Last Updated: March 27, 2026 (last reviewed 2026-04-29) Created for: NQU-482
Overview
Nquiry uses a three-stage hybrid retrieval pipeline to find the most relevant evidence when generating AI analysis. This document explains what each stage does, how they work together, what users see, and how to interpret the results.
The Three Stages
Stage 1: Keyword Search
What it does: Scans all evidence text for exact word matches against the question.
How it works:
- Uses PostgreSQL
tsvector/tsqueryfull-text search - Finds evidence containing the same terms as the question
- Catches exact-match evidence that semantic search might miss
When it shines:
- Case numbers, policy codes, names, dates, identifiers
- Canonical example: NQU-380 DOC-198 found exclusively by keyword search
- High-precision retrieval when users search for specific facts
Why it matters: Some evidence can't be found by meaning alone. A case number like "2024-REC-157" or a policy reference like "Section 3.2.1" must be matched exactly. Keyword search is the system's way of ensuring these facts aren't missed.
Stage 2: Semantic Search (Vector Embeddings)
What it does: Converts questions and evidence into meaning vectors, then finds the most conceptually similar passages.
How it works:
- Uses Amazon Titan Text Embeddings V2 via pgvector
- Converts both the question and evidence into dense numerical vectors representing meaning
- Compares vectors to find semantically similar evidence
- Vectors capture conceptual similarity even when words differ
When it shines:
- Related concepts expressed in different ways
- "What policies govern procurement?" matches evidence about "vendor selection procedures"
- Contextual understanding of concepts
- Evidence that's relevant but uses different terminology
Why it matters: Many questions are conceptual. When someone asks about "financial controls," they're looking for evidence about budgets, approvals, audits, and oversight—not just documents with those exact words. Semantic search bridges the language gap.
Stage 3: Reranking
What it does: Re-evaluates and reorders the combined results from keyword and semantic search by reading both the question and evidence together.
How it works:
- Uses Cohere Rerank 3.5
- Reads the original question alongside each candidate evidence passage
- Scores relevance based on actual semantic fit, not just word overlap or vector similarity
- Reorders results: promotes keyword-surfaced evidence that semantic search missed, demotes false positives
When it shines:
- Filtering out misleading matches (passages containing the right words but in wrong context)
- Promoting evidence that's conceptually relevant but didn't rank high in stages 1 or 2
- Final quality control before evidence reaches the AI
Why it matters: The first two stages are broad-brush tools. Keyword search finds all mentions of a term; semantic search finds all related concepts. Reranking is the intelligent filter that says, "These results are actually relevant to answering this specific question."
How the Stages Work Together
Team-Draft Interleaving
Results from keyword and vector search are merged using team-draft interleaving (NQU-462):
- Alternates between picking the next best result from keyword search and semantic search
- Replaces the old ad-hoc "guaranteed keyword top-20" slot reservation
- Scales naturally without hardcoded slot counts
- Ensures both retrieval approaches contribute balanced results
Effect: If keyword search ranks a document as #1 and semantic search ranks it as #4, the team-draft merge might place it at position #1 or #2 (depending on the other results). This prevents keyword search from being buried when semantic search has stronger matches for the same passage.
Query-Adaptive Weighting
The system automatically classifies queries and adjusts keyword weight (NQU-462):
| Query Type | Keyword Boost | Detection |
|---|---|---|
| Entity-heavy | 2.5x | Proper nouns, dates, IDs, capitalized multi-word sequences |
| Hybrid | 1.5x | Mix of specific and conceptual terms |
| Conceptual | 1.0x | Abstract queries, no specific identifiers |
Example:
- "What is the procurement policy?" → Conceptual (1.0x boost)
- "Did Jane Smith approve contract 2024-B-781?" → Entity-heavy (2.5x boost)
- "How did the vendor selection process work?" → Hybrid (1.5x boost)
Why it works: The system recognizes that "Jane Smith" and "2024-B-781" are unlikely to appear in semantic vectors but are critically important to the question. It automatically prioritizes keyword search for these queries without requiring user configuration.
Continuous Feedback Loop
After each analysis generation, the system measures retrieval quality (NQU-462):
- Compares what evidence was retrieved vs. what the AI actually cited
- Stores results in
retrieval_benchmark_resulttable - Tracks retrieval metrics over time:
- R@5, R@10, R@20 (recall at different depths)
- MRR (mean reciprocal rank)
- NDCG@20 (normalized discounted cumulative gain)
Baseline metrics (Davenheim, 5 queries):
- R@5 = 0.166, R@10 = 0.283, R@20 = 0.323
- MRR = 0.400, NDCG@20 = 0.362
This data informs future improvements to the retrieval pipeline.
What Users See: Evidence Considered Panel
When reviewing AI analysis, users see the Evidence Considered panel (NQU-486, redesigned 2026-03-26), which shows:
Header
- Total number of evidence passages retrieved and used
Evidence List
Each passage includes:
- Evidence title and source
- Plain-language retrieval story: "Found by: Keyword + Vector | Rerank confirmed"
- Examples: "Keyword only", "Vector only", "Both | Rerank demoted", "Both | Rerank promoted"
- Similarity score (visible behind info toggle for advanced users)
Why the Plain-Language Story Matters
Rather than showing raw pipeline diagnostics, the Evidence Considered panel tells a story:
"Keyword + Vector | Rerank confirmed"
- This passage matched on keywords AND semantic similarity
- The reranker agreed it was relevant
- High confidence that this evidence should be here
"Keyword only | Rerank promoted"
- Only keyword search found this
- Semantic search didn't catch it
- The reranker elevated it anyway
- Example: Case numbers, specific policy codes
"Vector only | Rerank demoted"
- Semantic search found this
- Keyword search didn't
- The reranker decided it wasn't actually relevant
- The AI chose not to use it in the final analysis
Understanding Similarity Scores
What the Scores Measure
Similarity scores range from 0.0 to 1.0 and represent how closely an evidence passage matches the question in vector space:
| Score Range | Interpretation |
|---|---|
| 0.85–1.0 | Excellent semantic match |
| 0.70–0.85 | Good semantic match |
| 0.50–0.70 | Partial semantic match |
| < 0.50 | Weak semantic match |
Why "Low" Scores Don't Mean Bad Retrieval
Important: A low similarity score does not mean the evidence is irrelevant.
Reasons a low-scoring passage can be critical:
- Keyword-found evidence: Passages with exact-match terms (case numbers, proper nouns) may have low semantic scores because they're highly specific. The reranker confirms they're relevant.
- Terminology mismatch: Evidence using technical jargon or domain-specific language may have lower semantic similarity to a plain-language question, but still be the most relevant source.
- Foundational evidence: Passages that establish facts needed to answer the question may be semantically distant but logically essential.
Example:
- Question: "Did the vendor meet the service level agreement?"
- Evidence passage: "SLA ID-2024-B: Response time < 24 hours, uptime > 99.5%"
- Similarity score: 0.58 (relatively low)
- Relevance: Essential (defines what the SLA requires)
- Rerank decision: Confirmed
- Why score is low: The passage is highly technical and specific, not a semantic match to the plain-language question. But it's exactly what's needed to evaluate the SLA.
What This Means for Users
When reviewing the Evidence Considered panel:
- Don't assume low scores mean poor retrieval. Trust the reranker's decision.
- Look at the retrieval story, not just the score. "Keyword + Vector | Rerank confirmed" is more meaningful than a number.
- High-scoring passages aren't always the most useful. A 0.92 semantic match might be thematically related but not answer the specific question. A 0.45 keyword match might be exactly what's needed.
Quality Assurance & Metrics
Retrieval Quality Assessment
Each analysis includes evidence retrieval metrics to help you assess whether the system found good evidence:
| Metric | What It Measures | Threshold |
|---|---|---|
| Strong Retrieval | Average similarity > 0.85 | Most passages are excellent semantic matches |
| Moderate Retrieval | Average similarity 0.70–0.85 | Mix of good and partial matches |
| Weak Retrieval | Average similarity < 0.70 | Many passages are peripheral to the question |
When Retrieval Quality Matters
- Strong retrieval + high faithfulness = high confidence in the analysis
- Weak retrieval + high faithfulness = AI worked with limited evidence; conclusions are still sound but rely on fewer sources
- Weak retrieval + low faithfulness = potential gap between what evidence exists and what the AI found; consider adding more evidence and regenerating
See AI Quality Metrics for full details on faithfulness, coverage, and overall confidence levels.
Technical Reference for CC
Core Components
| Component | Location | Purpose |
|---|---|---|
| Keyword search | PostgreSQL tsvector/tsquery | Full-text search on evidence text |
| Vector embeddings | pgvector + AWS Titan Text Embeddings V2 | Semantic search via dense vectors |
| Reranking | Cohere Rerank 3.5 API | Re-evaluation of combined results |
| Team-draft merge | retrieval.ts | Alternating result merge |
| Query classification | queryClassifier.ts | Detect entity-heavy vs. conceptual queries |
| Feedback loop | retrieval_benchmark_result table | Track retrieval quality over time |
Key Files for Reference
- evidence-management.md — Evidence storage, indexing, and processing
- ai-analysis.md — Full AI analysis workflow, including "Evidence Retrieval" section
- ai-quality-metrics.md — Detailed metrics for evaluating analysis quality
- evidence_evaluation_framework.md — The 10-criterion framework AI uses to assess evidence quality
Implementation Notes
Evidence Processing Pipeline:
- Evidence text is extracted from PDFs, Word documents, Excel files
- Text is chunked into smaller segments
- Chunks are converted to vector embeddings (async, triggered on upload)
- Full text remains searchable for keyword queries
- Processing status shown in analysis generation feedback
Chunking Strategy:
- Large documents (> 10K words) are split into overlapping chunks
- Overlap preserves context at chunk boundaries
- Chunk size tuned to balance context and specificity
Reranking Workflow:
- Keyword search returns ranked list
- Semantic search returns ranked list
- Team-draft merge interleaves both lists
- Reranker reads each passage + original question
- Reranker returns relevance scores and reordered results
- Top-K passages assembled for AI prompt
User Guidance: When Retrieval Might Need Improvement
Signs of Weak Retrieval
- Analysis has low "Retrieval Quality" metric (< 0.70 average similarity)
- Relevant evidence is missing from Evidence Considered panel
- Faithfulness score is low despite adding more evidence
- AI mentions gaps that shouldn't exist given available evidence
How to Improve Retrieval
- Add more evidence in the relevant topic area
- Wait for embeddings to process — vector search requires embeddings on all evidence
- Rephrase the question to match evidence terminology:
- If evidence says "vendor selection process," ask "How was the vendor selected?"
- If evidence uses technical jargon, use the same terms in questions
- Link evidence to questions during collection — context helps retrieval
- Regenerate with Request Revision:
- Select "New evidence added"
- Explain what evidence was added
- Let the system retrieve again with fresh data
FAQ
Q: Why is a passage with a 0.45 similarity score included in my analysis?
A: Keyword search found it (exact-match terms), and the reranker confirmed it's relevant to your specific question. Low semantic similarity doesn't mean low relevance—it often means the passage is highly technical or specific to your topic. The system is working as designed.
Q: Can I manually adjust which evidence is retrieved?
A: Not directly. The pipeline is designed to surface what's most relevant. If evidence isn't being found, add more related evidence and regenerate. If the AI isn't using retrieved evidence, check the AI's reasoning in the analysis output.
Q: What happens if I have very little evidence?
A: The system retrieves whatever is available and generates analysis with "Insufficient" confidence. The Evidence Considered panel shows what it found; gaps are flagged in the analysis. Use Gap Analysis to identify what's missing, then collect additional evidence.
Q: Do I need to configure keyword weighting or reranking manually?
A: No. Query-adaptive weighting is automatic based on the question content. Reranking thresholds are set by default. If retrieval isn't working well, focus on evidence quality and collection rather than tuning the pipeline.
Q: How often is the retrieval pipeline updated?
A: The three-stage approach is stable. Improvements (team-draft interleaving, query-adaptive weighting) are deployed as incremental enhancements. Baseline metrics are monitored continuously via the feedback loop.
Version History
- 2026-03-27: Initial document created (NQU-482). Covers three-stage pipeline, team-draft interleaving, query-adaptive weighting, continuous feedback, Evidence Considered panel, similarity scores, and user guidance.
- 2026-03-06: Three-stage pipeline first documented in evidence-management.md (NQU-379).
- 2026-03-27: Major pipeline improvements deployed: team-draft interleaving, query-adaptive weighting, retrieval feedback loop (NQU-462). Evidence Considered panel redesigned (NQU-486).
See Also
- Evidence Management — Uploading, organizing, and processing evidence
- AI Analysis — Full AI analysis workflow
- AI Quality Metrics — Interpreting analysis quality scores
- Evidence Evaluation Framework — The 10-criterion framework for evidence quality