Evaluation steps

Karini AI provides a structured evaluation pipeline designed to test prompts and agents against standardized datasets. The system automates dataset validation, variable mapping, output generation, and metric scoring to ensure consistent, repeatable, and data-driven assessments. By following the evaluation workflow, teams can measure performance across large input sets, detect regressions between versions, and make informed improvements before deployment.

Follow the steps below to run and review your evaluation.

Step 1: Select What You Want to Evaluate

Open the Prompt or Agent you want to test in the prompt playground.
Ensure the version you want to evaluate is published.
Navigate to the Evaluation tab for that prompt or agent.[Refer screenshot]

Step 2: Upload Your Dataset

Click “Upload dataset” in the Evaluation tab.

Drag-and-drop or select your CSV file from your system.
The system automatically validates:
- File type (.csv or supported .txt)
- Presence of required columns (input or agent_input, and ground_truth)
- A header row in the first line and at least one data row+
- Column names not exceeding 60 characters

After the dataset is uploaded, it is displayed in the interface as shown below.

Each row in this dataset becomes a single evaluation test case.

Step 3: Map Dataset Variables

After upload, the system detects all column names in the CSV.
Columns such as context, instructions, or any custom fields are automatically mapped to corresponding prompt variables (e.g., {context}, {instructions}).
You can review and confirm these mappings in the interface.

This ensures that your prompt or agent receives the right inputs and contextual data for each test case.

Step 4: Choose Evaluation Strategy

In the Evaluation panel, pick one or both of:
- Default Evaluation – uses built-in metrics (relevancy, faithfulness, similarity, tool accuracy, etc.).
- Custom Evaluation – uses your own evaluation prompt and scoring logic.
If using custom evaluation:
- Select a custom evaluation prompt from your library.
- Choose the LLM model to act as the evaluator.

This step defines how the system will judge each answer and which metrics will be applied.

Step 5: Run the Evaluation

Review the configuration:
- Selected prompt/agent version
- Uploaded dataset
- Chosen evaluation strategy (default, custom, or both)
Click “Run” to start the evaluation job.

The system:
1. Iterates over each dataset row
2. Generates an output using the selected prompt or agent
3. Applies the chosen metrics to score each output
4. Stores scores, confidence, and reasoning per metric per test case

Progress and status (e.g., processing, completed, failed) are shown in real time.

Step 6: Review Aggregate Results

When evaluation completes, click “View Results”.
In the Results Dashboard, you can view:
- Overall performance summary (per metric)
- Average scores, distributions, min/max, and standard deviation

This gives you a high-level picture of how well your prompt or agent is performing across the whole dataset.

Step 7: Inspect Individual Test Cases

Drill down into individual test cases to see:
- Input (input or agent_input)
- Ground truth
- Variables (e.g., context)
- Model/agent response
- Tool usage and execution trace (for agents)
- Metric scores, confidence, and detailed reasoning
Focus especially on low-scoring cases to identify failure modes.

This granular view helps you understand why certain cases failed and what to improve.

Step 8: Compare Versions and Iterate

Make changes to your prompt or agent based on insights (e.g., adjust instructions, improve tools, refine context retrieval).
Publish a new version.
Re-run evaluation using the same dataset and compare:
- Metric deltas
- Improvements or regressions
Choose the best-performing version for deployment.

This creates a tight feedback loop and turns evaluation into a continuous optimization process.

PreviousPrompt and Agent Evaluation NextEvaluation Metrics & Scoring

Last updated 2 months ago

hashtagFollow the steps below to run and review your evaluation.

hashtagStep 1: Select What You Want to Evaluate

hashtagStep 2: Upload Your Dataset

hashtagStep 3: Map Dataset Variables

hashtagStep 4: Choose Evaluation Strategy

hashtagStep 5: Run the Evaluation

hashtagStep 6: Review Aggregate Results

hashtagStep 7: Inspect Individual Test Cases

hashtagStep 8: Compare Versions and Iterate