Skip to main content

AI Analysis Quality Metrics

Last Updated: March 27, 2026 (NQU-401 review)

Overview

Nquiry includes a comprehensive quality metrics system that evaluates the reliability and completeness of AI-generated analyses. These metrics help users understand how much confidence to place in AI outputs and identify areas that may need additional review or evidence.

Every AI analysis is evaluated across four dimensions:

  1. Faithfulness - Are the AI's claims grounded in the evidence?
  2. Coverage - Does the analysis address all aspects of the question?
  3. Retrieval Quality - How relevant was the evidence found?
  4. Overall Confidence - A holistic quality assessment

Quality Confidence Levels

The system assigns one of four confidence levels to each analysis:

LevelColorMeaning
EstablishedGreenHigh quality across all metrics. Claims are verified, questions fully addressed, strong evidence retrieval.
ProbableBlueGood quality with minor gaps. Most claims verified, question mostly addressed, adequate evidence.
PossibleAmberAcceptable quality but notable limitations. Some unverified claims or coverage gaps identified.
InsufficientGrayBelow quality thresholds. Evidence too limited to support reliable conclusions.

How Confidence Is Calculated

The overall confidence is derived from the combination of individual metrics:

Established: faithfulness ≥ 95% AND coverage ≥ 95% AND retrieval = strong AND validation passed
Probable: faithfulness ≥ 85% AND coverage ≥ 85% AND retrieval ≥ moderate
Possible: faithfulness ≥ 70% OR coverage ≥ 70%
Insufficient: Below all thresholds or validation failed

Faithfulness Score

What It Measures

Faithfulness measures whether the factual claims made in the AI's analysis are actually supported by the retrieved evidence. This is critical for ensuring the AI isn't "hallucinating" or making unsupported assertions.

How It Works

  1. The system identifies factual claims in the AI's output
  2. Each claim is checked against the evidence chunks that were retrieved
  3. Claims are marked as "supported" if evidence exists, "unsupported" if not

Score Interpretation

ScoreInterpretationAction
95-100%Excellent - Nearly all claims verifiedHigh confidence in factual accuracy
85-94%Good - Most claims verifiedReview any flagged unsupported claims
70-84%Moderate - Notable unsupported claimsCarefully verify conclusions before use
<70%Poor - Many unverified claimsConsider regenerating or manual review

Example Output

Faithfulness: 91%
11/12 claims verified

Unsupported claims:
- "The vendor has a history of similar billing issues"

What to Do When Faithfulness Is Low

  1. Review unsupported claims - The system lists specific claims that couldn't be verified
  2. Check if evidence exists - The claim might be true but evidence wasn't uploaded
  3. Add missing evidence - Upload documents that support the claims
  4. Regenerate analysis - After adding evidence, generate a new analysis

Coverage Score

What It Measures

Coverage measures whether the AI's analysis addresses all the key aspects of the question or topic being analyzed. A high coverage score means the analysis is comprehensive; a low score indicates gaps.

How It Works

  1. The question/topic is broken down into constituent elements
  2. The AI's response is checked to see which elements are addressed
  3. Elements not addressed are flagged as "gaps"

Score Interpretation

ScoreInterpretationAction
95-100%Comprehensive - All aspects addressedAnalysis is thorough
85-94%Good - Minor gaps onlyReview identified gaps
70-84%Partial - Notable gapsConsider follow-up analysis on gaps
<70%Incomplete - Major gapsRegenerate with more specific direction

Example Output

Coverage: 80%
4/5 elements addressed

Gaps:
- No analysis of timeline discrepancies

Understanding Coverage Elements

The system identifies elements based on the question structure:

Example Question: "Did the vendor submit invoices for services that were not performed, and what is the financial impact?"

Elements identified:

  1. Whether invoices were submitted
  2. Whether services were performed
  3. Connection between invoices and services
  4. Financial impact assessment

What to Do When Coverage Is Low

  1. Review the gaps list - Understand what wasn't addressed
  2. Check evidence availability - Gaps may exist because evidence is missing
  3. Provide direction - Regenerate with specific instructions to address gaps
  4. Split the question - Complex questions may need multiple analyses

Retrieval Quality

What It Measures

Retrieval quality measures how effectively the system found relevant evidence for the analysis. It uses semantic search to find evidence chunks that relate to the question being analyzed.

How It Works

Nquiry uses a three-stage hybrid retrieval pipeline (NQU-379, NQU-462):

  1. Keyword search (PostgreSQL tsvector) finds evidence by exact terms
  2. Semantic search (Titan Embeddings V2 via pgvector) finds evidence by meaning
  3. Reranking (Cohere Rerank 3.5) re-scores merged results for query-time relevance
  4. Results are combined via team-draft interleaving before reranking
  5. Statistics are calculated on the final retrieved set

Confidence Levels

LevelAvg SimilarityMeaning
Strong> 0.35Highly relevant evidence found
Moderate0.15 - 0.35Reasonably relevant evidence
Weak< 0.15Limited relevance in available evidence

Note: Thresholds calibrated for Amazon Titan Text Embeddings V2 (NQU-481/491). Titan V2 cosine similarities are naturally lower than other embedding models — relevant evidence averages ~0.24 similarity. If the embedding model changes, re-run npm run calibrate-thresholds to recalibrate.

Detailed Statistics

The system provides detailed retrieval statistics:

MetricDescription
Chunks retrievedTotal evidence chunks found above threshold
Chunks usedChunks actually included in AI context (may be limited by token budget)
High relevanceChunks with similarity > 0.85
Medium relevanceChunks with similarity 0.70 - 0.85
Low relevanceChunks with similarity < 0.70
Similarity rangeMin and max similarity scores
Avg similarityMean similarity across all retrieved chunks

Example Output

Retrieval: Moderate
Avg similarity: 0.78

Chunks retrieved: 12
Chunks used: 10
High relevance: 3
Medium relevance: 7
Similarity range: 0.65 - 0.89

What Affects Retrieval Quality

Higher scores when:

  • Evidence directly discusses the question topic
  • Evidence uses similar terminology to the question
  • Sufficient evidence has been uploaded and processed

Lower scores when:

  • Evidence is tangentially related
  • Technical jargon differs between question and evidence
  • Limited evidence available for the topic

What to Do When Retrieval Is Weak

  1. Add more relevant evidence - Upload documents that directly address the question
  2. Rephrase the question - Use terminology that matches your evidence
  3. Ensure embeddings are processed - Check that evidence has been indexed
  4. Consider the evidence scope - Some questions may not have strong documentary evidence

Schema Validation

What It Measures

Schema validation checks whether the AI's output conforms to the expected structured format. This ensures the analysis can be properly parsed and displayed.

Validation Status

StatusMeaning
PassedOutput matches expected schema
FailedOutput structure is invalid or incomplete

Why Validation Matters

  • Ensures consistent analysis format
  • Enables structured data extraction
  • Supports report generation
  • Maintains data integrity

What to Do When Validation Fails

  1. Review the raw output - Check for formatting issues
  2. Regenerate the analysis - AI occasionally produces malformed output
  3. Report persistent issues - Contact support if validation consistently fails

Using Quality Metrics in Practice

For Investigators

Before relying on an analysis:

  1. Check the overall confidence level
  2. Review any warnings (unsupported claims, coverage gaps)
  3. Consider retrieval quality - weak retrieval may indicate missing evidence

When quality is low:

  1. Don't discard the analysis entirely - it may still provide useful insights
  2. Treat flagged items as hypotheses requiring verification
  3. Use gaps as a guide for additional evidence collection

For Reports

Quality metrics can be included in reports to provide transparency:

Analysis Quality Assessment
---------------------------
Overall Confidence: Probable
- Faithfulness: 95% (19/20 claims verified)
- Coverage: 100% (all elements addressed)
- Evidence Retrieval: Moderate (avg similarity 0.76)

Note: One claim regarding vendor history could not be
verified against available documentation.

For Quality Assurance

Quality metrics enable systematic review:

  • Track metrics across analyses to identify patterns
  • Flag analyses below thresholds for supervisor review
  • Use metrics to prioritize evidence collection efforts
  • Monitor for degradation in AI performance over time

Technical Details

Embedding Model

  • Model: Amazon Titan Embeddings V2
  • Dimensions: 1024
  • Similarity metric: Cosine similarity

Similarity Thresholds

Analysis TypeDefault Threshold
Question Analysis0.20
Topic Analysis0.20
Overall Summary0.50
Gap Analysis0.50

Context Limits

  • Maximum evidence context: 8,000 tokens
  • Chunks are included by similarity rank until limit reached
  • Excluded chunks are logged for audit purposes

Database Fields

Quality metrics are stored in the analysis table:

FieldTypeDescription
quality_confidenceenumOverall confidence level
faithfulness_scorefloat0-1 score for claim verification
faithfulness_detailsjsonbClaim-by-claim breakdown
coverage_scorefloat0-1 score for question coverage
coverage_detailsjsonbElement-by-element breakdown
retrieval_statsjsonbDetailed retrieval statistics
validation_passedbooleanSchema validation result

Frequently Asked Questions

Why is my faithfulness score low even though the evidence supports the claims?

The faithfulness check uses the retrieved evidence chunks, not all evidence. If relevant evidence wasn't retrieved (due to low similarity), claims may appear unsupported. Try:

  • Ensuring all evidence is processed for embeddings
  • Rephrasing questions to match evidence terminology
  • Adding evidence that directly addresses the claims

Why does coverage show gaps when I think everything was addressed?

Coverage is evaluated against a structured breakdown of the question. The AI may have addressed topics implicitly rather than explicitly. Review the specific gaps listed - sometimes rewording the question or providing direction helps.

What's a "good" retrieval similarity score?

  • 0.85+: Excellent - evidence highly relevant
  • 0.70-0.85: Good - evidence reasonably relevant
  • 0.50-0.70: Fair - some relevance, may be tangential
  • <0.50: Low - limited direct relevance

Scores vary based on evidence type and question complexity. Technical investigations may naturally have lower scores than straightforward document reviews.

Should I regenerate every analysis with low quality scores?

Not necessarily. Consider:

  • Is the analysis still useful despite limitations?
  • Can you add evidence to address gaps?
  • Would human review suffice for flagged items?

Regeneration helps when: new evidence is available, the question can be rephrased, or you can provide specific direction to address gaps.

How do quality metrics affect the final report?

Quality metrics are informational - they help you understand AI reliability but don't prevent report generation. You can:

  • Include quality summaries in reports for transparency
  • Use metrics to prioritize which analyses need human review
  • Note limitations based on quality findings

Changelog

DateChange
2026-02-03Initial implementation of quality metrics UI
2026-02-03Added faithfulness, coverage, and retrieval tracking
2026-02-03Lowered similarity threshold to 0.2 for broader retrieval
2026-03-15Fixed quality check error path: failed checks now return null scores instead of fabricated defaults (NQU-394). Previously, failed coverage checks returned 1.0 which inflated confidence levels.
2026-03-15Audit logging added for quality check results: all quality scores are now logged to the audit trail with the analysis.quality_check action type (NQU-381).
2026-03-24Citation repair system deployed (NQU-469): repairCitationIds() matches hallucinated UUIDs to real evidence IDs by title, restoring citation badges across all analyses.
2026-03-26Quality checks fixed for non-question analysis types (NQU-474): gap analysis, error check, and summary now have proper text extraction for faithfulness/coverage checking.
2026-03-26Chained analyses restored (NQU-474): higher-order analyses (topic, gap, error, summary) now correctly receive prior analysis context. Previously chained_analyses was empty.
2026-03-26Quality label renamed: "Insufficient" → "Low Verifiability" (NQU-459), then reverted back to "Insufficient" (product-facts.md §10). Final label: Insufficient.
2026-03-26Evidence Considered panel redesigned (NQU-486): plain-language retrieval story replaces raw pipeline diagnostics.
2026-03-26Prerequisite check fixed (NQU-474): verified analyses now correctly counted as prerequisites for higher-order analysis types.
2026-03-27Retrieval tuning pipeline deployed (NQU-462): benchmark runner, parameter sweep, query-adaptive weighting, team-draft interleaving, continuous retrieval feedback logging.