Evaluation steps
Karini AI provides a structured evaluation pipeline designed to test prompts and agents against standardized datasets. The system automates dataset validation, variable mapping, output generation, and metric scoring to ensure consistent, repeatable, and data-driven assessments. By following the evaluation workflow, teams can measure performance across large input sets, detect regressions between versions, and make informed improvements before deployment.
Follow the steps below to run and review your evaluation.
Step 1: Select What You Want to Evaluate
Open the Prompt or Agent you want to test in the prompt playground.
Ensure the version you want to evaluate is published.
Navigate to the Evaluation tab for that prompt or agent.[Refer screenshot]

Step 2: Upload Your Dataset
Click “Upload dataset” in the Evaluation tab.

Drag-and-drop or select your CSV file from your system.
The system automatically validates:
File type (
.csvor supported.txt)Presence of required columns (
inputoragent_input, andground_truth)A header row in the first line and at least one data row+
Column names not exceeding 60 characters
After the dataset is uploaded, it is displayed in the interface as shown below.

Each row in this dataset becomes a single evaluation test case.
Step 3: Map Dataset Variables
After upload, the system detects all column names in the CSV.
Columns such as
context,instructions, or any custom fields are automatically mapped to corresponding prompt variables (e.g.,{context},{instructions}).You can review and confirm these mappings in the interface.
This ensures that your prompt or agent receives the right inputs and contextual data for each test case.
Step 4: Choose Evaluation Strategy
In the Evaluation panel, pick one or both of:
Default Evaluation – uses built-in metrics (relevancy, faithfulness, similarity, tool accuracy, etc.).
Custom Evaluation – uses your own evaluation prompt and scoring logic.
If using custom evaluation:
Select a custom evaluation prompt from your library.
Choose the LLM model to act as the evaluator.

This step defines how the system will judge each answer and which metrics will be applied.
Step 5: Run the Evaluation
Review the configuration:
Selected prompt/agent version
Uploaded dataset
Chosen evaluation strategy (default, custom, or both)
Click “Run” to start the evaluation job.

The system:
Iterates over each dataset row
Generates an output using the selected prompt or agent
Applies the chosen metrics to score each output
Stores scores, confidence, and reasoning per metric per test case
Progress and status (e.g., processing, completed, failed) are shown in real time.
Step 6: Review Aggregate Results
When evaluation completes, click “View Results”.
In the Results Dashboard, you can view:
Overall performance summary (per metric)
Average scores, distributions, min/max, and standard deviation
This gives you a high-level picture of how well your prompt or agent is performing across the whole dataset.
Step 7: Inspect Individual Test Cases
Drill down into individual test cases to see:
Input (
inputoragent_input)Ground truth
Variables (e.g., context)
Model/agent response
Tool usage and execution trace (for agents)
Metric scores, confidence, and detailed reasoning
Focus especially on low-scoring cases to identify failure modes.
This granular view helps you understand why certain cases failed and what to improve.
Step 8: Compare Versions and Iterate
Make changes to your prompt or agent based on insights (e.g., adjust instructions, improve tools, refine context retrieval).
Publish a new version.
Re-run evaluation using the same dataset and compare:
Metric deltas
Improvements or regressions
Choose the best-performing version for deployment.
This creates a tight feedback loop and turns evaluation into a continuous optimization process.
Last updated