AI Analysis Quality Metrics
Last Updated: March 27, 2026 (NQU-401 review)
Overview
Nquiry includes a comprehensive quality metrics system that evaluates the reliability and completeness of AI-generated analyses. These metrics help users understand how much confidence to place in AI outputs and identify areas that may need additional review or evidence.
Every AI analysis is evaluated across four dimensions:
- Faithfulness - Are the AI's claims grounded in the evidence?
- Coverage - Does the analysis address all aspects of the question?
- Retrieval Quality - How relevant was the evidence found?
- Overall Confidence - A holistic quality assessment
Quality Confidence Levels
The system assigns one of four confidence levels to each analysis:
| Level | Color | Meaning |
|---|---|---|
| Established | Green | High quality across all metrics. Claims are verified, questions fully addressed, strong evidence retrieval. |
| Probable | Blue | Good quality with minor gaps. Most claims verified, question mostly addressed, adequate evidence. |
| Possible | Amber | Acceptable quality but notable limitations. Some unverified claims or coverage gaps identified. |
| Insufficient | Gray | Below quality thresholds. Evidence too limited to support reliable conclusions. |
How Confidence Is Calculated
The overall confidence is derived from the combination of individual metrics:
Established: faithfulness ≥ 95% AND coverage ≥ 95% AND retrieval = strong AND validation passed
Probable: faithfulness ≥ 85% AND coverage ≥ 85% AND retrieval ≥ moderate
Possible: faithfulness ≥ 70% OR coverage ≥ 70%
Insufficient: Below all thresholds or validation failed
Faithfulness Score
What It Measures
Faithfulness measures whether the factual claims made in the AI's analysis are actually supported by the retrieved evidence. This is critical for ensuring the AI isn't "hallucinating" or making unsupported assertions.
How It Works
- The system identifies factual claims in the AI's output
- Each claim is checked against the evidence chunks that were retrieved
- Claims are marked as "supported" if evidence exists, "unsupported" if not
Score Interpretation
| Score | Interpretation | Action |
|---|---|---|
| 95-100% | Excellent - Nearly all claims verified | High confidence in factual accuracy |
| 85-94% | Good - Most claims verified | Review any flagged unsupported claims |
| 70-84% | Moderate - Notable unsupported claims | Carefully verify conclusions before use |
| <70% | Poor - Many unverified claims | Consider regenerating or manual review |
Example Output
Faithfulness: 91%
11/12 claims verified
Unsupported claims:
- "The vendor has a history of similar billing issues"
What to Do When Faithfulness Is Low
- Review unsupported claims - The system lists specific claims that couldn't be verified
- Check if evidence exists - The claim might be true but evidence wasn't uploaded
- Add missing evidence - Upload documents that support the claims
- Regenerate analysis - After adding evidence, generate a new analysis
Coverage Score
What It Measures
Coverage measures whether the AI's analysis addresses all the key aspects of the question or topic being analyzed. A high coverage score means the analysis is comprehensive; a low score indicates gaps.
How It Works
- The question/topic is broken down into constituent elements
- The AI's response is checked to see which elements are addressed
- Elements not addressed are flagged as "gaps"
Score Interpretation
| Score | Interpretation | Action |
|---|---|---|
| 95-100% | Comprehensive - All aspects addressed | Analysis is thorough |
| 85-94% | Good - Minor gaps only | Review identified gaps |
| 70-84% | Partial - Notable gaps | Consider follow-up analysis on gaps |
| <70% | Incomplete - Major gaps | Regenerate with more specific direction |
Example Output
Coverage: 80%
4/5 elements addressed
Gaps:
- No analysis of timeline discrepancies
Understanding Coverage Elements
The system identifies elements based on the question structure:
Example Question: "Did the vendor submit invoices for services that were not performed, and what is the financial impact?"
Elements identified:
- Whether invoices were submitted
- Whether services were performed
- Connection between invoices and services
- Financial impact assessment
What to Do When Coverage Is Low
- Review the gaps list - Understand what wasn't addressed
- Check evidence availability - Gaps may exist because evidence is missing
- Provide direction - Regenerate with specific instructions to address gaps
- Split the question - Complex questions may need multiple analyses
Retrieval Quality
What It Measures
Retrieval quality measures how effectively the system found relevant evidence for the analysis. It uses semantic search to find evidence chunks that relate to the question being analyzed.
How It Works
Nquiry uses a three-stage hybrid retrieval pipeline (NQU-379, NQU-462):
- Keyword search (PostgreSQL tsvector) finds evidence by exact terms
- Semantic search (Titan Embeddings V2 via pgvector) finds evidence by meaning
- Reranking (Cohere Rerank 3.5) re-scores merged results for query-time relevance
- Results are combined via team-draft interleaving before reranking
- Statistics are calculated on the final retrieved set
Confidence Levels
| Level | Avg Similarity | Meaning |
|---|---|---|
| Strong | > 0.35 | Highly relevant evidence found |
| Moderate | 0.15 - 0.35 | Reasonably relevant evidence |
| Weak | < 0.15 | Limited relevance in available evidence |
Note: Thresholds calibrated for Amazon Titan Text Embeddings V2 (NQU-481/491). Titan V2 cosine similarities are naturally lower than other embedding models — relevant evidence averages ~0.24 similarity. If the embedding model changes, re-run
npm run calibrate-thresholdsto recalibrate.
Detailed Statistics
The system provides detailed retrieval statistics:
| Metric | Description |
|---|---|
| Chunks retrieved | Total evidence chunks found above threshold |
| Chunks used | Chunks actually included in AI context (may be limited by token budget) |
| High relevance | Chunks with similarity > 0.85 |
| Medium relevance | Chunks with similarity 0.70 - 0.85 |
| Low relevance | Chunks with similarity < 0.70 |
| Similarity range | Min and max similarity scores |
| Avg similarity | Mean similarity across all retrieved chunks |
Example Output
Retrieval: Moderate
Avg similarity: 0.78
Chunks retrieved: 12
Chunks used: 10
High relevance: 3
Medium relevance: 7
Similarity range: 0.65 - 0.89
What Affects Retrieval Quality
Higher scores when:
- Evidence directly discusses the question topic
- Evidence uses similar terminology to the question
- Sufficient evidence has been uploaded and processed
Lower scores when:
- Evidence is tangentially related
- Technical jargon differs between question and evidence
- Limited evidence available for the topic
What to Do When Retrieval Is Weak
- Add more relevant evidence - Upload documents that directly address the question
- Rephrase the question - Use terminology that matches your evidence
- Ensure embeddings are processed - Check that evidence has been indexed
- Consider the evidence scope - Some questions may not have strong documentary evidence
Schema Validation
What It Measures
Schema validation checks whether the AI's output conforms to the expected structured format. This ensures the analysis can be properly parsed and displayed.
Validation Status
| Status | Meaning |
|---|---|
| Passed | Output matches expected schema |
| Failed | Output structure is invalid or incomplete |
Why Validation Matters
- Ensures consistent analysis format
- Enables structured data extraction
- Supports report generation
- Maintains data integrity
What to Do When Validation Fails
- Review the raw output - Check for formatting issues
- Regenerate the analysis - AI occasionally produces malformed output
- Report persistent issues - Contact support if validation consistently fails
Using Quality Metrics in Practice
For Investigators
Before relying on an analysis:
- Check the overall confidence level
- Review any warnings (unsupported claims, coverage gaps)
- Consider retrieval quality - weak retrieval may indicate missing evidence
When quality is low:
- Don't discard the analysis entirely - it may still provide useful insights
- Treat flagged items as hypotheses requiring verification
- Use gaps as a guide for additional evidence collection
For Reports
Quality metrics can be included in reports to provide transparency:
Analysis Quality Assessment
---------------------------
Overall Confidence: Probable
- Faithfulness: 95% (19/20 claims verified)
- Coverage: 100% (all elements addressed)
- Evidence Retrieval: Moderate (avg similarity 0.76)
Note: One claim regarding vendor history could not be
verified against available documentation.
For Quality Assurance
Quality metrics enable systematic review:
- Track metrics across analyses to identify patterns
- Flag analyses below thresholds for supervisor review
- Use metrics to prioritize evidence collection efforts
- Monitor for degradation in AI performance over time
Technical Details
Embedding Model
- Model: Amazon Titan Embeddings V2
- Dimensions: 1024
- Similarity metric: Cosine similarity
Similarity Thresholds
| Analysis Type | Default Threshold |
|---|---|
| Question Analysis | 0.20 |
| Topic Analysis | 0.20 |
| Overall Summary | 0.50 |
| Gap Analysis | 0.50 |
Context Limits
- Maximum evidence context: 8,000 tokens
- Chunks are included by similarity rank until limit reached
- Excluded chunks are logged for audit purposes
Database Fields
Quality metrics are stored in the analysis table:
| Field | Type | Description |
|---|---|---|
quality_confidence | enum | Overall confidence level |
faithfulness_score | float | 0-1 score for claim verification |
faithfulness_details | jsonb | Claim-by-claim breakdown |
coverage_score | float | 0-1 score for question coverage |
coverage_details | jsonb | Element-by-element breakdown |
retrieval_stats | jsonb | Detailed retrieval statistics |
validation_passed | boolean | Schema validation result |
Frequently Asked Questions
Why is my faithfulness score low even though the evidence supports the claims?
The faithfulness check uses the retrieved evidence chunks, not all evidence. If relevant evidence wasn't retrieved (due to low similarity), claims may appear unsupported. Try:
- Ensuring all evidence is processed for embeddings
- Rephrasing questions to match evidence terminology
- Adding evidence that directly addresses the claims
Why does coverage show gaps when I think everything was addressed?
Coverage is evaluated against a structured breakdown of the question. The AI may have addressed topics implicitly rather than explicitly. Review the specific gaps listed - sometimes rewording the question or providing direction helps.
What's a "good" retrieval similarity score?
- 0.85+: Excellent - evidence highly relevant
- 0.70-0.85: Good - evidence reasonably relevant
- 0.50-0.70: Fair - some relevance, may be tangential
- <0.50: Low - limited direct relevance
Scores vary based on evidence type and question complexity. Technical investigations may naturally have lower scores than straightforward document reviews.
Should I regenerate every analysis with low quality scores?
Not necessarily. Consider:
- Is the analysis still useful despite limitations?
- Can you add evidence to address gaps?
- Would human review suffice for flagged items?
Regeneration helps when: new evidence is available, the question can be rephrased, or you can provide specific direction to address gaps.
How do quality metrics affect the final report?
Quality metrics are informational - they help you understand AI reliability but don't prevent report generation. You can:
- Include quality summaries in reports for transparency
- Use metrics to prioritize which analyses need human review
- Note limitations based on quality findings
Changelog
| Date | Change |
|---|---|
| 2026-02-03 | Initial implementation of quality metrics UI |
| 2026-02-03 | Added faithfulness, coverage, and retrieval tracking |
| 2026-02-03 | Lowered similarity threshold to 0.2 for broader retrieval |
| 2026-03-15 | Fixed quality check error path: failed checks now return null scores instead of fabricated defaults (NQU-394). Previously, failed coverage checks returned 1.0 which inflated confidence levels. |
| 2026-03-15 | Audit logging added for quality check results: all quality scores are now logged to the audit trail with the analysis.quality_check action type (NQU-381). |
| 2026-03-24 | Citation repair system deployed (NQU-469): repairCitationIds() matches hallucinated UUIDs to real evidence IDs by title, restoring citation badges across all analyses. |
| 2026-03-26 | Quality checks fixed for non-question analysis types (NQU-474): gap analysis, error check, and summary now have proper text extraction for faithfulness/coverage checking. |
| 2026-03-26 | Chained analyses restored (NQU-474): higher-order analyses (topic, gap, error, summary) now correctly receive prior analysis context. Previously chained_analyses was empty. |
| 2026-03-26 | Quality label renamed: "Insufficient" → "Low Verifiability" (NQU-459), then reverted back to "Insufficient" (product-facts.md §10). Final label: Insufficient. |
| 2026-03-26 | Evidence Considered panel redesigned (NQU-486): plain-language retrieval story replaces raw pipeline diagnostics. |
| 2026-03-26 | Prerequisite check fixed (NQU-474): verified analyses now correctly counted as prerequisites for higher-order analysis types. |
| 2026-03-27 | Retrieval tuning pipeline deployed (NQU-462): benchmark runner, parameter sweep, query-adaptive weighting, team-draft interleaving, continuous retrieval feedback logging. |