Datasets

In the recipe building process, datasets play a crucial role, serving as the foundation for various data processing operations.

Datasets can be added as follow.

  1. On the recipe canvas, drag and drop the Dataset element into the recipe.

  2. Provide a user-friendly name and description.

  3. Choose the dataset type: text, image, audio, or video.

  4. Save the dataset to view it on the dashboard with the latest updates.

Default Metadata Extraction After Recipe Processing:

The metadata feature displays default attributes extracted from a dataset after it has been processed through a recipe or workflow. In this case, fields such as source_ref, checksum, file_type, and other metadata related to the file are extracted automatically. These help identify and validate the dataset's integrity and origin.

Custom Metadata Extraction Using Metadata Extractor Prompt:

This feature allows for customized metadata extraction by using a specific prompt. Users can specify the relevant keys they want to extract, offering flexibility for tailored metadata extractions. This is useful for datasets that may have additional custom attributes or unique fields not covered by the default extraction.

ACL Tags:

Access Control List (ACL) tags are shown when the recipe is processed with ACLs enabled. These tags define the permissions and access control for the dataset, ensuring that only authorized users or processes can interact with certain data. The ACL information is displayed to ensure proper data governance and security protocols are followed during dataset processing.

Embedding model configuration

This feature specifies the technical details of the embedding model used for text processing.

Processing task status

The chart offers a clear visual summary of the processing status for tasks. It tracks the success and failure of various stages in the data processing pipeline, providing insights into the overall performance.

Key Features:

  • Total Items: Displays the total number of items being processed.

  • Processing Tasks: Tracks the following stages:

    • OCR (Optical Character Recognition): Converts images or scanned documents into machine-readable text.

    • PII (Personally Identifiable Information): Detects and manages sensitive personal data within the dataset.

    • Chunking: Breaking down larger pieces of data into smaller, more manageable chunks.

    • Embeddings: Transforms data into numerical representations for use in machine learning models.

  • Processing Status: Indicates the success or failure of each task:

    • Success: Tasks marked with green indicate successful completion.

    • Error: Tasks with orange bars represent errors encountered during processing.

Last updated