Skip to main content

Evidence Retrieval Pipeline

Last Updated: March 27, 2026 (last reviewed 2026-04-29) Created for: NQU-482

Overview

Nquiry uses a three-stage hybrid retrieval pipeline to find the most relevant evidence when generating AI analysis. This document explains what each stage does, how they work together, what users see, and how to interpret the results.


The Three Stages

What it does: Scans all evidence text for exact word matches against the question.

How it works:

  • Uses PostgreSQL tsvector/tsquery full-text search
  • Finds evidence containing the same terms as the question
  • Catches exact-match evidence that semantic search might miss

When it shines:

  • Case numbers, policy codes, names, dates, identifiers
  • Canonical example: NQU-380 DOC-198 found exclusively by keyword search
  • High-precision retrieval when users search for specific facts

Why it matters: Some evidence can't be found by meaning alone. A case number like "2024-REC-157" or a policy reference like "Section 3.2.1" must be matched exactly. Keyword search is the system's way of ensuring these facts aren't missed.


Stage 2: Semantic Search (Vector Embeddings)

What it does: Converts questions and evidence into meaning vectors, then finds the most conceptually similar passages.

How it works:

  • Uses Amazon Titan Text Embeddings V2 via pgvector
  • Converts both the question and evidence into dense numerical vectors representing meaning
  • Compares vectors to find semantically similar evidence
  • Vectors capture conceptual similarity even when words differ

When it shines:

  • Related concepts expressed in different ways
  • "What policies govern procurement?" matches evidence about "vendor selection procedures"
  • Contextual understanding of concepts
  • Evidence that's relevant but uses different terminology

Why it matters: Many questions are conceptual. When someone asks about "financial controls," they're looking for evidence about budgets, approvals, audits, and oversight—not just documents with those exact words. Semantic search bridges the language gap.


Stage 3: Reranking

What it does: Re-evaluates and reorders the combined results from keyword and semantic search by reading both the question and evidence together.

How it works:

  • Uses Cohere Rerank 3.5
  • Reads the original question alongside each candidate evidence passage
  • Scores relevance based on actual semantic fit, not just word overlap or vector similarity
  • Reorders results: promotes keyword-surfaced evidence that semantic search missed, demotes false positives

When it shines:

  • Filtering out misleading matches (passages containing the right words but in wrong context)
  • Promoting evidence that's conceptually relevant but didn't rank high in stages 1 or 2
  • Final quality control before evidence reaches the AI

Why it matters: The first two stages are broad-brush tools. Keyword search finds all mentions of a term; semantic search finds all related concepts. Reranking is the intelligent filter that says, "These results are actually relevant to answering this specific question."


How the Stages Work Together

Team-Draft Interleaving

Results from keyword and vector search are merged using team-draft interleaving (NQU-462):

  • Alternates between picking the next best result from keyword search and semantic search
  • Replaces the old ad-hoc "guaranteed keyword top-20" slot reservation
  • Scales naturally without hardcoded slot counts
  • Ensures both retrieval approaches contribute balanced results

Effect: If keyword search ranks a document as #1 and semantic search ranks it as #4, the team-draft merge might place it at position #1 or #2 (depending on the other results). This prevents keyword search from being buried when semantic search has stronger matches for the same passage.

Query-Adaptive Weighting

The system automatically classifies queries and adjusts keyword weight (NQU-462):

Query TypeKeyword BoostDetection
Entity-heavy2.5xProper nouns, dates, IDs, capitalized multi-word sequences
Hybrid1.5xMix of specific and conceptual terms
Conceptual1.0xAbstract queries, no specific identifiers

Example:

  • "What is the procurement policy?" → Conceptual (1.0x boost)
  • "Did Jane Smith approve contract 2024-B-781?" → Entity-heavy (2.5x boost)
  • "How did the vendor selection process work?" → Hybrid (1.5x boost)

Why it works: The system recognizes that "Jane Smith" and "2024-B-781" are unlikely to appear in semantic vectors but are critically important to the question. It automatically prioritizes keyword search for these queries without requiring user configuration.

Continuous Feedback Loop

After each analysis generation, the system measures retrieval quality (NQU-462):

  • Compares what evidence was retrieved vs. what the AI actually cited
  • Stores results in retrieval_benchmark_result table
  • Tracks retrieval metrics over time:
    • R@5, R@10, R@20 (recall at different depths)
    • MRR (mean reciprocal rank)
    • NDCG@20 (normalized discounted cumulative gain)

Baseline metrics (Davenheim, 5 queries):

  • R@5 = 0.166, R@10 = 0.283, R@20 = 0.323
  • MRR = 0.400, NDCG@20 = 0.362

This data informs future improvements to the retrieval pipeline.


What Users See: Evidence Considered Panel

When reviewing AI analysis, users see the Evidence Considered panel (NQU-486, redesigned 2026-03-26), which shows:

  • Total number of evidence passages retrieved and used

Evidence List

Each passage includes:

  • Evidence title and source
  • Plain-language retrieval story: "Found by: Keyword + Vector | Rerank confirmed"
    • Examples: "Keyword only", "Vector only", "Both | Rerank demoted", "Both | Rerank promoted"
  • Similarity score (visible behind info toggle for advanced users)

Why the Plain-Language Story Matters

Rather than showing raw pipeline diagnostics, the Evidence Considered panel tells a story:

"Keyword + Vector | Rerank confirmed"

  • This passage matched on keywords AND semantic similarity
  • The reranker agreed it was relevant
  • High confidence that this evidence should be here

"Keyword only | Rerank promoted"

  • Only keyword search found this
  • Semantic search didn't catch it
  • The reranker elevated it anyway
  • Example: Case numbers, specific policy codes

"Vector only | Rerank demoted"

  • Semantic search found this
  • Keyword search didn't
  • The reranker decided it wasn't actually relevant
  • The AI chose not to use it in the final analysis

Understanding Similarity Scores

What the Scores Measure

Similarity scores range from 0.0 to 1.0 and represent how closely an evidence passage matches the question in vector space:

Score RangeInterpretation
0.85–1.0Excellent semantic match
0.70–0.85Good semantic match
0.50–0.70Partial semantic match
< 0.50Weak semantic match

Why "Low" Scores Don't Mean Bad Retrieval

Important: A low similarity score does not mean the evidence is irrelevant.

Reasons a low-scoring passage can be critical:

  1. Keyword-found evidence: Passages with exact-match terms (case numbers, proper nouns) may have low semantic scores because they're highly specific. The reranker confirms they're relevant.
  2. Terminology mismatch: Evidence using technical jargon or domain-specific language may have lower semantic similarity to a plain-language question, but still be the most relevant source.
  3. Foundational evidence: Passages that establish facts needed to answer the question may be semantically distant but logically essential.

Example:

  • Question: "Did the vendor meet the service level agreement?"
  • Evidence passage: "SLA ID-2024-B: Response time < 24 hours, uptime > 99.5%"
  • Similarity score: 0.58 (relatively low)
  • Relevance: Essential (defines what the SLA requires)
  • Rerank decision: Confirmed
  • Why score is low: The passage is highly technical and specific, not a semantic match to the plain-language question. But it's exactly what's needed to evaluate the SLA.

What This Means for Users

When reviewing the Evidence Considered panel:

  • Don't assume low scores mean poor retrieval. Trust the reranker's decision.
  • Look at the retrieval story, not just the score. "Keyword + Vector | Rerank confirmed" is more meaningful than a number.
  • High-scoring passages aren't always the most useful. A 0.92 semantic match might be thematically related but not answer the specific question. A 0.45 keyword match might be exactly what's needed.

Quality Assurance & Metrics

Retrieval Quality Assessment

Each analysis includes evidence retrieval metrics to help you assess whether the system found good evidence:

MetricWhat It MeasuresThreshold
Strong RetrievalAverage similarity > 0.85Most passages are excellent semantic matches
Moderate RetrievalAverage similarity 0.70–0.85Mix of good and partial matches
Weak RetrievalAverage similarity < 0.70Many passages are peripheral to the question

When Retrieval Quality Matters

  • Strong retrieval + high faithfulness = high confidence in the analysis
  • Weak retrieval + high faithfulness = AI worked with limited evidence; conclusions are still sound but rely on fewer sources
  • Weak retrieval + low faithfulness = potential gap between what evidence exists and what the AI found; consider adding more evidence and regenerating

See AI Quality Metrics for full details on faithfulness, coverage, and overall confidence levels.


Technical Reference for CC

Core Components

ComponentLocationPurpose
Keyword searchPostgreSQL tsvector/tsqueryFull-text search on evidence text
Vector embeddingspgvector + AWS Titan Text Embeddings V2Semantic search via dense vectors
RerankingCohere Rerank 3.5 APIRe-evaluation of combined results
Team-draft mergeretrieval.tsAlternating result merge
Query classificationqueryClassifier.tsDetect entity-heavy vs. conceptual queries
Feedback loopretrieval_benchmark_result tableTrack retrieval quality over time

Key Files for Reference

  • evidence-management.md — Evidence storage, indexing, and processing
  • ai-analysis.md — Full AI analysis workflow, including "Evidence Retrieval" section
  • ai-quality-metrics.md — Detailed metrics for evaluating analysis quality
  • evidence_evaluation_framework.md — The 10-criterion framework AI uses to assess evidence quality

Implementation Notes

Evidence Processing Pipeline:

  1. Evidence text is extracted from PDFs, Word documents, Excel files
  2. Text is chunked into smaller segments
  3. Chunks are converted to vector embeddings (async, triggered on upload)
  4. Full text remains searchable for keyword queries
  5. Processing status shown in analysis generation feedback

Chunking Strategy:

  • Large documents (> 10K words) are split into overlapping chunks
  • Overlap preserves context at chunk boundaries
  • Chunk size tuned to balance context and specificity

Reranking Workflow:

  1. Keyword search returns ranked list
  2. Semantic search returns ranked list
  3. Team-draft merge interleaves both lists
  4. Reranker reads each passage + original question
  5. Reranker returns relevance scores and reordered results
  6. Top-K passages assembled for AI prompt

User Guidance: When Retrieval Might Need Improvement

Signs of Weak Retrieval

  • Analysis has low "Retrieval Quality" metric (< 0.70 average similarity)
  • Relevant evidence is missing from Evidence Considered panel
  • Faithfulness score is low despite adding more evidence
  • AI mentions gaps that shouldn't exist given available evidence

How to Improve Retrieval

  1. Add more evidence in the relevant topic area
  2. Wait for embeddings to process — vector search requires embeddings on all evidence
  3. Rephrase the question to match evidence terminology:
    • If evidence says "vendor selection process," ask "How was the vendor selected?"
    • If evidence uses technical jargon, use the same terms in questions
  4. Link evidence to questions during collection — context helps retrieval
  5. Regenerate with Request Revision:
    • Select "New evidence added"
    • Explain what evidence was added
    • Let the system retrieve again with fresh data

FAQ

Q: Why is a passage with a 0.45 similarity score included in my analysis?

A: Keyword search found it (exact-match terms), and the reranker confirmed it's relevant to your specific question. Low semantic similarity doesn't mean low relevance—it often means the passage is highly technical or specific to your topic. The system is working as designed.

Q: Can I manually adjust which evidence is retrieved?

A: Not directly. The pipeline is designed to surface what's most relevant. If evidence isn't being found, add more related evidence and regenerate. If the AI isn't using retrieved evidence, check the AI's reasoning in the analysis output.

Q: What happens if I have very little evidence?

A: The system retrieves whatever is available and generates analysis with "Insufficient" confidence. The Evidence Considered panel shows what it found; gaps are flagged in the analysis. Use Gap Analysis to identify what's missing, then collect additional evidence.

Q: Do I need to configure keyword weighting or reranking manually?

A: No. Query-adaptive weighting is automatic based on the question content. Reranking thresholds are set by default. If retrieval isn't working well, focus on evidence quality and collection rather than tuning the pipeline.

Q: How often is the retrieval pipeline updated?

A: The three-stage approach is stable. Improvements (team-draft interleaving, query-adaptive weighting) are deployed as incremental enhancements. Baseline metrics are monitored continuously via the feedback loop.


Version History

  • 2026-03-27: Initial document created (NQU-482). Covers three-stage pipeline, team-draft interleaving, query-adaptive weighting, continuous feedback, Evidence Considered panel, similarity scores, and user guidance.
  • 2026-03-06: Three-stage pipeline first documented in evidence-management.md (NQU-379).
  • 2026-03-27: Major pipeline improvements deployed: team-draft interleaving, query-adaptive weighting, retrieval feedback loop (NQU-462). Evidence Considered panel redesigned (NQU-486).

See Also