AI Analysis Quality Metrics

Last Updated: March 27, 2026 (NQU-401 review)

Overview

Nquiry includes a comprehensive quality metrics system that evaluates the reliability and completeness of AI-generated analyses. These metrics help users understand how much confidence to place in AI outputs and identify areas that may need additional review or evidence.

Every AI analysis is evaluated across four dimensions:

Faithfulness - Are the AI's claims grounded in the evidence?
Coverage - Does the analysis address all aspects of the question?
Retrieval Quality - How relevant was the evidence found?
Overall Confidence - A holistic quality assessment

Quality Confidence Levels

The system assigns one of four confidence levels to each analysis:

Level	Color	Meaning
Established	Green	High quality across all metrics. Claims are verified, questions fully addressed, strong evidence retrieval.
Probable	Blue	Good quality with minor gaps. Most claims verified, question mostly addressed, adequate evidence.
Possible	Amber	Acceptable quality but notable limitations. Some unverified claims or coverage gaps identified.
Insufficient	Gray	Below quality thresholds. Evidence too limited to support reliable conclusions.

How Confidence Is Calculated

The overall confidence is derived from the combination of individual metrics:

Established: faithfulness ≥ 95% AND coverage ≥ 95% AND retrieval = strong AND validation passed
Probable:    faithfulness ≥ 85% AND coverage ≥ 85% AND retrieval ≥ moderate
Possible:    faithfulness ≥ 70% OR coverage ≥ 70%
Insufficient:      Below all thresholds or validation failed

Faithfulness Score

What It Measures

Faithfulness measures whether the factual claims made in the AI's analysis are actually supported by the retrieved evidence. This is critical for ensuring the AI isn't "hallucinating" or making unsupported assertions.

How It Works

The system identifies factual claims in the AI's output
Each claim is checked against the evidence chunks that were retrieved
Claims are marked as "supported" if evidence exists, "unsupported" if not

Score Interpretation

Score	Interpretation	Action
95-100%	Excellent - Nearly all claims verified	High confidence in factual accuracy
85-94%	Good - Most claims verified	Review any flagged unsupported claims
70-84%	Moderate - Notable unsupported claims	Carefully verify conclusions before use
<70%	Poor - Many unverified claims	Consider regenerating or manual review

Example Output

Faithfulness: 91%
11/12 claims verified

Unsupported claims:
- "The vendor has a history of similar billing issues"

What to Do When Faithfulness Is Low

Review unsupported claims - The system lists specific claims that couldn't be verified
Check if evidence exists - The claim might be true but evidence wasn't uploaded
Add missing evidence - Upload documents that support the claims
Regenerate analysis - After adding evidence, generate a new analysis

Coverage Score

What It Measures

Coverage measures whether the AI's analysis addresses all the key aspects of the question or topic being analyzed. A high coverage score means the analysis is comprehensive; a low score indicates gaps.

How It Works

The question/topic is broken down into constituent elements
The AI's response is checked to see which elements are addressed
Elements not addressed are flagged as "gaps"

Score Interpretation

Score	Interpretation	Action
95-100%	Comprehensive - All aspects addressed	Analysis is thorough
85-94%	Good - Minor gaps only	Review identified gaps
70-84%	Partial - Notable gaps	Consider follow-up analysis on gaps
<70%	Incomplete - Major gaps	Regenerate with more specific direction

Example Output

Coverage: 80%
4/5 elements addressed

Gaps:
- No analysis of timeline discrepancies

Understanding Coverage Elements

The system identifies elements based on the question structure:

Example Question: "Did the vendor submit invoices for services that were not performed, and what is the financial impact?"

Elements identified:

Whether invoices were submitted
Whether services were performed
Connection between invoices and services
Financial impact assessment

What to Do When Coverage Is Low

Review the gaps list - Understand what wasn't addressed
Check evidence availability - Gaps may exist because evidence is missing
Provide direction - Regenerate with specific instructions to address gaps
Split the question - Complex questions may need multiple analyses

Retrieval Quality

What It Measures

Retrieval quality measures how effectively the system found relevant evidence for the analysis. It uses semantic search to find evidence chunks that relate to the question being analyzed.

How It Works

Nquiry uses a three-stage hybrid retrieval pipeline (NQU-379, NQU-462):

Keyword search (PostgreSQL tsvector) finds evidence by exact terms
Semantic search (Titan Embeddings V2 via pgvector) finds evidence by meaning
Reranking (Cohere Rerank 3.5) re-scores merged results for query-time relevance
Results are combined via team-draft interleaving before reranking
Statistics are calculated on the final retrieved set

Confidence Levels

Level	Avg Similarity	Meaning
Strong	> 0.35	Highly relevant evidence found
Moderate	0.15 - 0.35	Reasonably relevant evidence
Weak	< 0.15	Limited relevance in available evidence

Note: Thresholds calibrated for Amazon Titan Text Embeddings V2 (NQU-481/491). Titan V2 cosine similarities are naturally lower than other embedding models — relevant evidence averages ~0.24 similarity. If the embedding model changes, re-run npm run calibrate-thresholds to recalibrate.

Detailed Statistics

The system provides detailed retrieval statistics:

Metric	Description
Chunks retrieved	Total evidence chunks found above threshold
Chunks used	Chunks actually included in AI context (may be limited by token budget)
High relevance	Chunks with similarity > 0.85
Medium relevance	Chunks with similarity 0.70 - 0.85
Low relevance	Chunks with similarity < 0.70
Similarity range	Min and max similarity scores
Avg similarity	Mean similarity across all retrieved chunks

Example Output

Retrieval: Moderate
Avg similarity: 0.78

Chunks retrieved: 12
Chunks used: 10
High relevance: 3
Medium relevance: 7
Similarity range: 0.65 - 0.89

What Affects Retrieval Quality

Higher scores when:

Evidence directly discusses the question topic
Evidence uses similar terminology to the question
Sufficient evidence has been uploaded and processed

Lower scores when:

Evidence is tangentially related
Technical jargon differs between question and evidence
Limited evidence available for the topic

What to Do When Retrieval Is Weak

Add more relevant evidence - Upload documents that directly address the question
Rephrase the question - Use terminology that matches your evidence
Ensure embeddings are processed - Check that evidence has been indexed
Consider the evidence scope - Some questions may not have strong documentary evidence

Schema Validation

What It Measures

Schema validation checks whether the AI's output conforms to the expected structured format. This ensures the analysis can be properly parsed and displayed.

Validation Status

Status	Meaning
Passed	Output matches expected schema
Failed	Output structure is invalid or incomplete

Why Validation Matters

Ensures consistent analysis format
Enables structured data extraction
Supports report generation
Maintains data integrity

What to Do When Validation Fails

Review the raw output - Check for formatting issues
Regenerate the analysis - AI occasionally produces malformed output
Report persistent issues - Contact support if validation consistently fails

Using Quality Metrics in Practice

For Investigators

Before relying on an analysis:

Check the overall confidence level
Review any warnings (unsupported claims, coverage gaps)
Consider retrieval quality - weak retrieval may indicate missing evidence

When quality is low:

Don't discard the analysis entirely - it may still provide useful insights
Treat flagged items as hypotheses requiring verification
Use gaps as a guide for additional evidence collection

For Reports

Quality metrics can be included in reports to provide transparency:

Analysis Quality Assessment
---------------------------
Overall Confidence: Probable
- Faithfulness: 95% (19/20 claims verified)
- Coverage: 100% (all elements addressed)
- Evidence Retrieval: Moderate (avg similarity 0.76)

Note: One claim regarding vendor history could not be
verified against available documentation.

For Quality Assurance

Quality metrics enable systematic review:

Track metrics across analyses to identify patterns
Flag analyses below thresholds for supervisor review
Use metrics to prioritize evidence collection efforts
Monitor for degradation in AI performance over time

Technical Details

Embedding Model

Model: Amazon Titan Embeddings V2
Dimensions: 1024
Similarity metric: Cosine similarity

Similarity Thresholds

Analysis Type	Default Threshold
Question Analysis	0.20
Topic Analysis	0.20
Overall Summary	0.50
Gap Analysis	0.50

Context Limits

Maximum evidence context: 8,000 tokens
Chunks are included by similarity rank until limit reached
Excluded chunks are logged for audit purposes

Database Fields

Quality metrics are stored in the analysis table:

Field	Type	Description
`quality_confidence`	enum	Overall confidence level
`faithfulness_score`	float	0-1 score for claim verification
`faithfulness_details`	jsonb	Claim-by-claim breakdown
`coverage_score`	float	0-1 score for question coverage
`coverage_details`	jsonb	Element-by-element breakdown
`retrieval_stats`	jsonb	Detailed retrieval statistics
`validation_passed`	boolean	Schema validation result

Frequently Asked Questions

Why is my faithfulness score low even though the evidence supports the claims?

The faithfulness check uses the retrieved evidence chunks, not all evidence. If relevant evidence wasn't retrieved (due to low similarity), claims may appear unsupported. Try:

Ensuring all evidence is processed for embeddings
Rephrasing questions to match evidence terminology
Adding evidence that directly addresses the claims

Why does coverage show gaps when I think everything was addressed?

Coverage is evaluated against a structured breakdown of the question. The AI may have addressed topics implicitly rather than explicitly. Review the specific gaps listed - sometimes rewording the question or providing direction helps.

What's a "good" retrieval similarity score?

0.85+: Excellent - evidence highly relevant
0.70-0.85: Good - evidence reasonably relevant
0.50-0.70: Fair - some relevance, may be tangential
<0.50: Low - limited direct relevance

Scores vary based on evidence type and question complexity. Technical investigations may naturally have lower scores than straightforward document reviews.

Should I regenerate every analysis with low quality scores?

Not necessarily. Consider:

Is the analysis still useful despite limitations?
Can you add evidence to address gaps?
Would human review suffice for flagged items?

Regeneration helps when: new evidence is available, the question can be rephrased, or you can provide specific direction to address gaps.

How do quality metrics affect the final report?

Quality metrics are informational - they help you understand AI reliability but don't prevent report generation. You can:

Include quality summaries in reports for transparency
Use metrics to prioritize which analyses need human review
Note limitations based on quality findings

Changelog

Date	Change
2026-02-03	Initial implementation of quality metrics UI
2026-02-03	Added faithfulness, coverage, and retrieval tracking
2026-02-03	Lowered similarity threshold to 0.2 for broader retrieval
2026-03-15	Fixed quality check error path: failed checks now return null scores instead of fabricated defaults (NQU-394). Previously, failed coverage checks returned 1.0 which inflated confidence levels.
2026-03-15	Audit logging added for quality check results: all quality scores are now logged to the audit trail with the `analysis.quality_check` action type (NQU-381).
2026-03-24	Citation repair system deployed (NQU-469): `repairCitationIds()` matches hallucinated UUIDs to real evidence IDs by title, restoring citation badges across all analyses.
2026-03-26	Quality checks fixed for non-question analysis types (NQU-474): gap analysis, error check, and summary now have proper text extraction for faithfulness/coverage checking.
2026-03-26	Chained analyses restored (NQU-474): higher-order analyses (topic, gap, error, summary) now correctly receive prior analysis context. Previously `chained_analyses` was empty.
2026-03-26	Quality label renamed: "Insufficient" → "Low Verifiability" (NQU-459), then reverted back to "Insufficient" (product-facts.md §10). Final label: Insufficient.
2026-03-26	Evidence Considered panel redesigned (NQU-486): plain-language retrieval story replaces raw pipeline diagnostics.
2026-03-26	Prerequisite check fixed (NQU-474): verified analyses now correctly counted as prerequisites for higher-order analysis types.
2026-03-27	Retrieval tuning pipeline deployed (NQU-462): benchmark runner, parameter sweep, query-adaptive weighting, team-draft interleaving, continuous retrieval feedback logging.

Overview​

Quality Confidence Levels​

How Confidence Is Calculated​

Faithfulness Score​

What It Measures​

How It Works​

Score Interpretation​

Example Output​

What to Do When Faithfulness Is Low​

Coverage Score​

What It Measures​

How It Works​

Score Interpretation​

Example Output​

Understanding Coverage Elements​

What to Do When Coverage Is Low​

Retrieval Quality​

What It Measures​

How It Works​

Confidence Levels​

Detailed Statistics​

Example Output​

What Affects Retrieval Quality​

What to Do When Retrieval Is Weak​

Schema Validation​

What It Measures​

Validation Status​

Why Validation Matters​

What to Do When Validation Fails​

Using Quality Metrics in Practice​

For Investigators​

For Reports​

For Quality Assurance​

Technical Details​

Embedding Model​

Similarity Thresholds​

Context Limits​

Database Fields​

Frequently Asked Questions​

Why is my faithfulness score low even though the evidence supports the claims?​

Why does coverage show gaps when I think everything was addressed?​

What's a "good" retrieval similarity score?​

Should I regenerate every analysis with low quality scores?​

How do quality metrics affect the final report?​

Changelog​

Overview

Quality Confidence Levels

How Confidence Is Calculated

Faithfulness Score

What It Measures

How It Works

Score Interpretation

Example Output

What to Do When Faithfulness Is Low

Coverage Score

What It Measures

How It Works

Score Interpretation

Example Output

Understanding Coverage Elements

What to Do When Coverage Is Low

Retrieval Quality

What It Measures

How It Works

Confidence Levels

Detailed Statistics

Example Output

What Affects Retrieval Quality

What to Do When Retrieval Is Weak

Schema Validation

What It Measures

Validation Status

Why Validation Matters

What to Do When Validation Fails

Using Quality Metrics in Practice

For Investigators

For Reports

For Quality Assurance

Technical Details

Embedding Model

Similarity Thresholds

Context Limits

Database Fields

Frequently Asked Questions

Why is my faithfulness score low even though the evidence supports the claims?

Why does coverage show gaps when I think everything was addressed?

What's a "good" retrieval similarity score?

Should I regenerate every analysis with low quality scores?

How do quality metrics affect the final report?

Changelog