Evaluation Metrics & Scoring

This section describes the available metrics, their scoring semantics, and how to interpret these signals when validating and optimizing your AI workflows.

Default Evaluation Metrics

Karini AI’s default evaluation uses built-in metrics to measure how well your prompts and agents perform. Each metric is scored on a 0–1 scale and includes additional information such as confidence and reasoning.

Standard Prompt Metrics

These metrics apply whenever the system generates a text response (for both prompts and agents).

Metric
Applies To
What It Measures
Score Range
Typical Use Case

Answer Relevancy

Prompts & Agents

How well the response addresses the user’s query and stays on-topic.

0 = irrelevant, 1 = fully relevant

Detecting off-topic or incomplete answers

Answer Faithfulness

Prompts & Agents

Whether the response is factually correct and grounded in the given context.

0 = unfaithful, 1 = fully faithful

Reducing hallucinations and unsupported statements

Answer Similarity

Prompts & Agents

Semantic similarity between the response and the ground-truth answer.

0 = very different, 1 = semantically identical

Comparing against reference answers

Context Sufficiency

Prompts & Agents

If the provided context contains enough information to answer the question.

0 = insufficient, 1 = fully sufficient

Debugging retrieval / RAG or missing context

Guideline Adherence

Prompts & Agents

How well the response follows instructions, tone, and formatting guidelines.

0 = does not follow, 1 = fully adheres

Enforcing style guides, formatting, or policy rules

Agent-Specific Metrics

These metrics are only used for agentic workflows involving tools and multi-step reasoning.

Metric
Applies To
What It Measures
Score Range
Typical Use Case

Tool Accuracy

Agents only

Correctness of tool selection and parameter usage.

0 = wrong/poor usage, 1 = perfect usage

Evaluating how well agents use tools/APIs

Goal Achievement

Agents only

How well the agent accomplishes the intended objective or task.

0 = goal not met, 1 = goal fully achieved

Validating multi-step workflow success

Execution Efficiency

Agents only

Efficiency of the agent’s execution path (steps, redundancy, resource use).

0 = very inefficient, 1 = highly efficient

Optimizing cost, latency, and unnecessary actions

Metric Output Fields

Each metric returns structured information for every test case.

Field
Description
Range / Type

Score

Numerical rating assigned by the metric.

0.0 – 1.0

Confidence

How confident the evaluator is in the assigned score.

0.0 – 1.0

Reasoning

Text explanation describing why the score was given.

Free text

Confidence Levels

Confidence Range
Level
Interpretation

0.8 – 1.0

High

Strong, reliable signal; can usually be trusted.

0.5 – 0.79

Medium

Useful guidance; review edge or critical cases.

0 – 0.49

Low

Uncertain; manual review recommended.

Aggregate Metrics

Metric
Description
Purpose

Mean

Average score across all test cases.

Overall performance level.

Median

Middle score value across all test cases.

Robust central tendency, less affected by outliers.

Standard Deviation

Variation of scores around the mean.

Measures consistency (lower = more stable).

Min / Max

Lowest and highest scores observed in the dataset.

Shows best-case and worst-case behavior.

Creating Custom Metrics

Custom metrics enable you to extend the default evaluation framework with criteria tailored to your specific domain and quality requirements. This allows you to formalize what “good” output means for your organization and have an LLM systematically apply those rules during evaluation.

  • Custom Prompt: Define an evaluation prompt that receives the model or agent output and ground truth, and returns a structured score.

  • Domain-Specific: Encode industry- or use case–specific guidelines (e.g., regulatory constraints, brand voice, safety rules).

  • Model Selection: Choose which LLM will act as the evaluator, balancing accuracy, latency, and cost.

  • Flexible Scoring: Design a custom scoring rubric and criteria to align with internal review or compliance processes.

Last updated