Evaluation Metrics & Scoring
This section describes the available metrics, their scoring semantics, and how to interpret these signals when validating and optimizing your AI workflows.
Default Evaluation Metrics
Karini AI’s default evaluation uses built-in metrics to measure how well your prompts and agents perform. Each metric is scored on a 0–1 scale and includes additional information such as confidence and reasoning.
Standard Prompt Metrics
These metrics apply whenever the system generates a text response (for both prompts and agents).
Answer Relevancy
Prompts & Agents
How well the response addresses the user’s query and stays on-topic.
0 = irrelevant, 1 = fully relevant
Detecting off-topic or incomplete answers
Answer Faithfulness
Prompts & Agents
Whether the response is factually correct and grounded in the given context.
0 = unfaithful, 1 = fully faithful
Reducing hallucinations and unsupported statements
Answer Similarity
Prompts & Agents
Semantic similarity between the response and the ground-truth answer.
0 = very different, 1 = semantically identical
Comparing against reference answers
Context Sufficiency
Prompts & Agents
If the provided context contains enough information to answer the question.
0 = insufficient, 1 = fully sufficient
Debugging retrieval / RAG or missing context
Guideline Adherence
Prompts & Agents
How well the response follows instructions, tone, and formatting guidelines.
0 = does not follow, 1 = fully adheres
Enforcing style guides, formatting, or policy rules
Agent-Specific Metrics
These metrics are only used for agentic workflows involving tools and multi-step reasoning.
Tool Accuracy
Agents only
Correctness of tool selection and parameter usage.
0 = wrong/poor usage, 1 = perfect usage
Evaluating how well agents use tools/APIs
Goal Achievement
Agents only
How well the agent accomplishes the intended objective or task.
0 = goal not met, 1 = goal fully achieved
Validating multi-step workflow success
Execution Efficiency
Agents only
Efficiency of the agent’s execution path (steps, redundancy, resource use).
0 = very inefficient, 1 = highly efficient
Optimizing cost, latency, and unnecessary actions
Metric Output Fields
Each metric returns structured information for every test case.
Score
Numerical rating assigned by the metric.
0.0 – 1.0
Confidence
How confident the evaluator is in the assigned score.
0.0 – 1.0
Reasoning
Text explanation describing why the score was given.
Free text
Confidence Levels
0.8 – 1.0
High
Strong, reliable signal; can usually be trusted.
0.5 – 0.79
Medium
Useful guidance; review edge or critical cases.
0 – 0.49
Low
Uncertain; manual review recommended.
Aggregate Metrics
Mean
Average score across all test cases.
Overall performance level.
Median
Middle score value across all test cases.
Robust central tendency, less affected by outliers.
Standard Deviation
Variation of scores around the mean.
Measures consistency (lower = more stable).
Min / Max
Lowest and highest scores observed in the dataset.
Shows best-case and worst-case behavior.
Creating Custom Metrics
Custom metrics enable you to extend the default evaluation framework with criteria tailored to your specific domain and quality requirements. This allows you to formalize what “good” output means for your organization and have an LLM systematically apply those rules during evaluation.
Custom Prompt: Define an evaluation prompt that receives the model or agent output and ground truth, and returns a structured score.
Domain-Specific: Encode industry- or use case–specific guidelines (e.g., regulatory constraints, brand voice, safety rules).
Model Selection: Choose which LLM will act as the evaluator, balancing accuracy, latency, and cost.
Flexible Scoring: Design a custom scoring rubric and criteria to align with internal review or compliance processes.
Last updated