# Datasets

In the recipe building process, datasets play a crucial role, serving as the foundation for various data processing operations.&#x20;

Datasets can be added as follow.

1. On the recipe canvas, drag and drop the Dataset element into the recipe.&#x20;
2. Provide a user-friendly name and description.
3. Choose the dataset type: **text or multimodal.**
4. Save the dataset to view it on the dashboard with the latest updates.

### **Default Metadata Extraction After Recipe Processing**:

The metadata feature displays default attributes extracted from a dataset after it has been processed through a recipe or workflow. In this case, fields such as `source_ref`, `checksum`, `file_type`, and other metadata related to the file are extracted automatically. These help identify and validate the dataset's integrity and origin.

<figure><img src="/files/B9v3P1QF0in2Ea6xepiX" alt=""><figcaption></figcaption></figure>

### **Custom Metadata Extraction Using Metadata Extractor Prompt**:

This feature allows for customized metadata extraction by using a specific prompt. Users can specify the relevant keys they want to extract, offering flexibility for tailored metadata extractions. This is useful for datasets that may have additional custom attributes or unique fields not covered by the default extraction.

### **ACL Tags**:

Access Control List (ACL) tags are shown when the recipe is processed with ACLs enabled. These tags define the permissions and access control for the dataset, ensuring that only authorized users or processes can interact with certain data. The ACL information is displayed to ensure proper data governance and security protocols are followed during dataset processing.

<figure><img src="/files/yu8npXnJhtivqp8cRrvw" alt=""><figcaption></figcaption></figure>

### **Embedding model configuration**

This feature specifies the technical details of the embedding model used for text processing.

<figure><img src="/files/AotHV0Swod6H7SKMKW6f" alt=""><figcaption></figcaption></figure>

### Processing task status

<figure><img src="/files/KzFPvkF1oUa4NCTF0Ifp" alt=""><figcaption></figcaption></figure>

The chart offers a clear visual summary of the processing status for tasks. It tracks the success and failure of various stages in the data processing pipeline, providing insights into the overall performance.

#### Key Features:

* **Total Items**: Displays the total number of items being processed.
* **Processing Tasks**: Tracks the following stages:
  * **OCR (Optical Character Recognition)**: Converts images or scanned documents into machine-readable text.
  * **PII (Personally Identifiable Information)**: Detects and manages sensitive personal data within the dataset.
  * **Chunking:** Breaking down larger pieces of data into smaller, more manageable chunks.
  * **Embeddings**: Transforms data into numerical representations for use in machine learning models.
* **Processing Status**: Indicates the success or failure of each task:
  * **Success**: Tasks marked with green indicate successful completion.
  * **Error**: Tasks with orange bars represent errors encountered during processing.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://karini-ai.gitbook.io/karini-ai-documentation/datasets.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
