Evidence Retrieval Pipeline

Last Updated: March 27, 2026 (last reviewed 2026-04-29) Created for: NQU-482

Overview

Nquiry uses a three-stage hybrid retrieval pipeline to find the most relevant evidence when generating AI analysis. This document explains what each stage does, how they work together, what users see, and how to interpret the results.

The Three Stages

Stage 1: Keyword Search

What it does: Scans all evidence text for exact word matches against the question.

How it works:

Uses PostgreSQL tsvector/tsquery full-text search
Finds evidence containing the same terms as the question
Catches exact-match evidence that semantic search might miss

When it shines:

Case numbers, policy codes, names, dates, identifiers
Canonical example: NQU-380 DOC-198 found exclusively by keyword search
High-precision retrieval when users search for specific facts

Why it matters: Some evidence can't be found by meaning alone. A case number like "2024-REC-157" or a policy reference like "Section 3.2.1" must be matched exactly. Keyword search is the system's way of ensuring these facts aren't missed.

Stage 2: Semantic Search (Vector Embeddings)

What it does: Converts questions and evidence into meaning vectors, then finds the most conceptually similar passages.

How it works:

Uses Amazon Titan Text Embeddings V2 via pgvector
Converts both the question and evidence into dense numerical vectors representing meaning
Compares vectors to find semantically similar evidence
Vectors capture conceptual similarity even when words differ

When it shines:

Related concepts expressed in different ways
"What policies govern procurement?" matches evidence about "vendor selection procedures"
Contextual understanding of concepts
Evidence that's relevant but uses different terminology

Why it matters: Many questions are conceptual. When someone asks about "financial controls," they're looking for evidence about budgets, approvals, audits, and oversight—not just documents with those exact words. Semantic search bridges the language gap.

Stage 3: Reranking

What it does: Re-evaluates and reorders the combined results from keyword and semantic search by reading both the question and evidence together.

How it works:

Uses Cohere Rerank 3.5
Reads the original question alongside each candidate evidence passage
Scores relevance based on actual semantic fit, not just word overlap or vector similarity
Reorders results: promotes keyword-surfaced evidence that semantic search missed, demotes false positives

When it shines:

Filtering out misleading matches (passages containing the right words but in wrong context)
Promoting evidence that's conceptually relevant but didn't rank high in stages 1 or 2
Final quality control before evidence reaches the AI

Why it matters: The first two stages are broad-brush tools. Keyword search finds all mentions of a term; semantic search finds all related concepts. Reranking is the intelligent filter that says, "These results are actually relevant to answering this specific question."

How the Stages Work Together

Team-Draft Interleaving

Results from keyword and vector search are merged using team-draft interleaving (NQU-462):

Alternates between picking the next best result from keyword search and semantic search
Replaces the old ad-hoc "guaranteed keyword top-20" slot reservation
Scales naturally without hardcoded slot counts
Ensures both retrieval approaches contribute balanced results

Effect: If keyword search ranks a document as #1 and semantic search ranks it as #4, the team-draft merge might place it at position #1 or #2 (depending on the other results). This prevents keyword search from being buried when semantic search has stronger matches for the same passage.

Query-Adaptive Weighting

The system automatically classifies queries and adjusts keyword weight (NQU-462):

Query Type	Keyword Boost	Detection
Entity-heavy	2.5x	Proper nouns, dates, IDs, capitalized multi-word sequences
Hybrid	1.5x	Mix of specific and conceptual terms
Conceptual	1.0x	Abstract queries, no specific identifiers

Example:

"What is the procurement policy?" → Conceptual (1.0x boost)
"Did Jane Smith approve contract 2024-B-781?" → Entity-heavy (2.5x boost)
"How did the vendor selection process work?" → Hybrid (1.5x boost)

Why it works: The system recognizes that "Jane Smith" and "2024-B-781" are unlikely to appear in semantic vectors but are critically important to the question. It automatically prioritizes keyword search for these queries without requiring user configuration.

Continuous Feedback Loop

After each analysis generation, the system measures retrieval quality (NQU-462):

Compares what evidence was retrieved vs. what the AI actually cited
Stores results in retrieval_benchmark_result table
Tracks retrieval metrics over time:
- R@5, R@10, R@20 (recall at different depths)
- MRR (mean reciprocal rank)
- NDCG@20 (normalized discounted cumulative gain)

Baseline metrics (Davenheim, 5 queries):

R@5 = 0.166, R@10 = 0.283, R@20 = 0.323
MRR = 0.400, NDCG@20 = 0.362

This data informs future improvements to the retrieval pipeline.

What Users See: Evidence Considered Panel

When reviewing AI analysis, users see the Evidence Considered panel (NQU-486, redesigned 2026-03-26), which shows:

Total number of evidence passages retrieved and used

Evidence List

Each passage includes:

Evidence title and source
Plain-language retrieval story: "Found by: Keyword + Vector | Rerank confirmed"
- Examples: "Keyword only", "Vector only", "Both | Rerank demoted", "Both | Rerank promoted"
Similarity score (visible behind info toggle for advanced users)

Why the Plain-Language Story Matters

Rather than showing raw pipeline diagnostics, the Evidence Considered panel tells a story:

"Keyword + Vector | Rerank confirmed"

This passage matched on keywords AND semantic similarity
The reranker agreed it was relevant
High confidence that this evidence should be here

"Keyword only | Rerank promoted"

Only keyword search found this
Semantic search didn't catch it
The reranker elevated it anyway
Example: Case numbers, specific policy codes

"Vector only | Rerank demoted"

Semantic search found this
Keyword search didn't
The reranker decided it wasn't actually relevant
The AI chose not to use it in the final analysis

Understanding Similarity Scores

What the Scores Measure

Similarity scores range from 0.0 to 1.0 and represent how closely an evidence passage matches the question in vector space:

Score Range	Interpretation
0.85–1.0	Excellent semantic match
0.70–0.85	Good semantic match
0.50–0.70	Partial semantic match
< 0.50	Weak semantic match

Why "Low" Scores Don't Mean Bad Retrieval

Important: A low similarity score does not mean the evidence is irrelevant.

Reasons a low-scoring passage can be critical:

Keyword-found evidence: Passages with exact-match terms (case numbers, proper nouns) may have low semantic scores because they're highly specific. The reranker confirms they're relevant.
Terminology mismatch: Evidence using technical jargon or domain-specific language may have lower semantic similarity to a plain-language question, but still be the most relevant source.
Foundational evidence: Passages that establish facts needed to answer the question may be semantically distant but logically essential.

Example:

Question: "Did the vendor meet the service level agreement?"
Evidence passage: "SLA ID-2024-B: Response time < 24 hours, uptime > 99.5%"
Similarity score: 0.58 (relatively low)
Relevance: Essential (defines what the SLA requires)
Rerank decision: Confirmed
Why score is low: The passage is highly technical and specific, not a semantic match to the plain-language question. But it's exactly what's needed to evaluate the SLA.

What This Means for Users

When reviewing the Evidence Considered panel:

Don't assume low scores mean poor retrieval. Trust the reranker's decision.
Look at the retrieval story, not just the score. "Keyword + Vector | Rerank confirmed" is more meaningful than a number.
High-scoring passages aren't always the most useful. A 0.92 semantic match might be thematically related but not answer the specific question. A 0.45 keyword match might be exactly what's needed.

Quality Assurance & Metrics

Retrieval Quality Assessment

Each analysis includes evidence retrieval metrics to help you assess whether the system found good evidence:

Metric	What It Measures	Threshold
Strong Retrieval	Average similarity > 0.85	Most passages are excellent semantic matches
Moderate Retrieval	Average similarity 0.70–0.85	Mix of good and partial matches
Weak Retrieval	Average similarity < 0.70	Many passages are peripheral to the question

When Retrieval Quality Matters

Strong retrieval + high faithfulness = high confidence in the analysis
Weak retrieval + high faithfulness = AI worked with limited evidence; conclusions are still sound but rely on fewer sources
Weak retrieval + low faithfulness = potential gap between what evidence exists and what the AI found; consider adding more evidence and regenerating

See AI Quality Metrics for full details on faithfulness, coverage, and overall confidence levels.

Technical Reference for CC

Core Components

Component	Location	Purpose
Keyword search	PostgreSQL tsvector/tsquery	Full-text search on evidence text
Vector embeddings	pgvector + AWS Titan Text Embeddings V2	Semantic search via dense vectors
Reranking	Cohere Rerank 3.5 API	Re-evaluation of combined results
Team-draft merge	`retrieval.ts`	Alternating result merge
Query classification	`queryClassifier.ts`	Detect entity-heavy vs. conceptual queries
Feedback loop	`retrieval_benchmark_result` table	Track retrieval quality over time

Key Files for Reference

evidence-management.md — Evidence storage, indexing, and processing
ai-analysis.md — Full AI analysis workflow, including "Evidence Retrieval" section
ai-quality-metrics.md — Detailed metrics for evaluating analysis quality
evidence_evaluation_framework.md — The 10-criterion framework AI uses to assess evidence quality

Implementation Notes

Evidence Processing Pipeline:

Evidence text is extracted from PDFs, Word documents, Excel files
Text is chunked into smaller segments
Chunks are converted to vector embeddings (async, triggered on upload)
Full text remains searchable for keyword queries
Processing status shown in analysis generation feedback

Chunking Strategy:

Large documents (> 10K words) are split into overlapping chunks
Overlap preserves context at chunk boundaries
Chunk size tuned to balance context and specificity

Reranking Workflow:

Keyword search returns ranked list
Semantic search returns ranked list
Team-draft merge interleaves both lists
Reranker reads each passage + original question
Reranker returns relevance scores and reordered results
Top-K passages assembled for AI prompt

User Guidance: When Retrieval Might Need Improvement

Signs of Weak Retrieval

Analysis has low "Retrieval Quality" metric (< 0.70 average similarity)
Relevant evidence is missing from Evidence Considered panel
Faithfulness score is low despite adding more evidence
AI mentions gaps that shouldn't exist given available evidence

How to Improve Retrieval

Add more evidence in the relevant topic area
Wait for embeddings to process — vector search requires embeddings on all evidence
Rephrase the question to match evidence terminology:
- If evidence says "vendor selection process," ask "How was the vendor selected?"
- If evidence uses technical jargon, use the same terms in questions
Link evidence to questions during collection — context helps retrieval
Regenerate with Request Revision:
- Select "New evidence added"
- Explain what evidence was added
- Let the system retrieve again with fresh data

FAQ

Q: Why is a passage with a 0.45 similarity score included in my analysis?

A: Keyword search found it (exact-match terms), and the reranker confirmed it's relevant to your specific question. Low semantic similarity doesn't mean low relevance—it often means the passage is highly technical or specific to your topic. The system is working as designed.

Q: Can I manually adjust which evidence is retrieved?

A: Not directly. The pipeline is designed to surface what's most relevant. If evidence isn't being found, add more related evidence and regenerate. If the AI isn't using retrieved evidence, check the AI's reasoning in the analysis output.

Q: What happens if I have very little evidence?

A: The system retrieves whatever is available and generates analysis with "Insufficient" confidence. The Evidence Considered panel shows what it found; gaps are flagged in the analysis. Use Gap Analysis to identify what's missing, then collect additional evidence.

Q: Do I need to configure keyword weighting or reranking manually?

A: No. Query-adaptive weighting is automatic based on the question content. Reranking thresholds are set by default. If retrieval isn't working well, focus on evidence quality and collection rather than tuning the pipeline.

Q: How often is the retrieval pipeline updated?

A: The three-stage approach is stable. Improvements (team-draft interleaving, query-adaptive weighting) are deployed as incremental enhancements. Baseline metrics are monitored continuously via the feedback loop.

Version History

2026-03-27: Initial document created (NQU-482). Covers three-stage pipeline, team-draft interleaving, query-adaptive weighting, continuous feedback, Evidence Considered panel, similarity scores, and user guidance.
2026-03-06: Three-stage pipeline first documented in evidence-management.md (NQU-379).
2026-03-27: Major pipeline improvements deployed: team-draft interleaving, query-adaptive weighting, retrieval feedback loop (NQU-462). Evidence Considered panel redesigned (NQU-486).

Evidence Retrieval Pipeline

Overview

The Three Stages

Stage 1: Keyword Search

Stage 2: Semantic Search (Vector Embeddings)

Stage 3: Reranking

How the Stages Work Together

Team-Draft Interleaving

Query-Adaptive Weighting

Continuous Feedback Loop

What Users See: Evidence Considered Panel

Header

Evidence List

Why the Plain-Language Story Matters

Understanding Similarity Scores

What the Scores Measure

Why "Low" Scores Don't Mean Bad Retrieval

What This Means for Users

Quality Assurance & Metrics

Retrieval Quality Assessment

When Retrieval Quality Matters

Technical Reference for CC

Core Components

Key Files for Reference

Implementation Notes

User Guidance: When Retrieval Might Need Improvement

Signs of Weak Retrieval

How to Improve Retrieval

FAQ

Version History

See Also

Overview​

The Three Stages​

Stage 1: Keyword Search​

Stage 2: Semantic Search (Vector Embeddings)​

Stage 3: Reranking​

How the Stages Work Together​

Team-Draft Interleaving​

Query-Adaptive Weighting​

Continuous Feedback Loop​

What Users See: Evidence Considered Panel​

Header​

Evidence List​

Why the Plain-Language Story Matters​

Understanding Similarity Scores​

What the Scores Measure​

Why "Low" Scores Don't Mean Bad Retrieval​

What This Means for Users​

Quality Assurance & Metrics​

Retrieval Quality Assessment​

When Retrieval Quality Matters​

Technical Reference for CC​

Core Components​

Key Files for Reference​

Implementation Notes​

User Guidance: When Retrieval Might Need Improvement​

Signs of Weak Retrieval​

How to Improve Retrieval​

FAQ​

Version History​

See Also​

Overview

The Three Stages

Stage 1: Keyword Search

Stage 2: Semantic Search (Vector Embeddings)

Stage 3: Reranking

How the Stages Work Together

Team-Draft Interleaving

Query-Adaptive Weighting

Continuous Feedback Loop

What Users See: Evidence Considered Panel

Header

Evidence List

Why the Plain-Language Story Matters

Understanding Similarity Scores

What the Scores Measure

Why "Low" Scores Don't Mean Bad Retrieval

What This Means for Users

Quality Assurance & Metrics

Retrieval Quality Assessment

When Retrieval Quality Matters

Technical Reference for CC

Core Components

Key Files for Reference

Implementation Notes

User Guidance: When Retrieval Might Need Improvement

Signs of Weak Retrieval

How to Improve Retrieval

FAQ

Version History

See Also