Create Graph RAG recipe

To create a new recipe, go to the Recipe Page, click Add New, select Karini as the runtime option, provide a user-friendly name and detailed description and choose recipe type as Graph RAG.

You will be presented with an option to use the default template with preset components, or manually create the necessary components.

Using Preset Template: By selecting "Yes," a recipe template with be created with the following predefined components:

Additionally, you can utilize the Generate Descriptor and Upload Descriptor options to efficiently customize or import Graph Descriptor into their workflow.

Manual Configuration: When the "No" option is selected, the canvas will remain empty, providing users with the flexibility to manually configure the recipe.

Recipe Configuration

Configure the following elements by dragging them on to the recipe canvas.

Source

Define your data storage connector You can select from an appropriate data connector type from a list of available connectors. Refer connectors for in-depth information.

Note: The Graph RAG implementation currently supports only two types of connectors: Amazon S3 and Amazon S3 Manifest connectors.

Configure the storage paths for the connector and apply necessary filters to restrict the data being included in the source. Enable recursive search if needed to include data from nested directories.

You have the option to test your data connector setup by using the "Test" button.

Dataset

Dataset serves as an internal collection of dataset items which are pointers to the data source. For the default template, the dataset configuration comes with default settings which can be modified. For manual configuration, a new dataset configuration must be created.

To create a new dataset, refer to the Datasets.

Karini AI provides various options for data preprocessing.

OCR:

For source data which contains files of types pdf or image, you can perform Optical Character Recognition (OCR) by selecting one of the following options:

  • Unstructured IO with Extract Images: This method is used for extracting images from unstructured data sources. It processes unstructured documents, identifying and extracting images that can be further analyzed or used in different applications.

  • PyMuPDF with Fallback to Amazon Textract: This approach utilizes PyMuPDF to extract text and images from PDF documents. If PyMuPDF fails or is insufficient, the process falls back to Amazon Textract, ensuring a comprehensive extraction by leveraging Amazon's advanced OCR capabilities.

  • Amazon Textract with Extract Table: Amazon Textract is used to extract structured data, such as tables, from documents. This method specifically focuses on identifying and extracting tabular data, making it easier to analyze and use structured information from scanned documents or PDFs.

PII:

If you need to mask Personally Identifiable Information (PII) within your dataset, you can enable the PII Masking option. You can select from the list of entities that you want masked for data pre-processing. To learn more about the entities refer to this documentation.

Note:Link your Source element to the Dataset element in the recipe canvas to start creating your data ingestion.

Metadata Extraction Prompt

The Metadata Extraction Prompt can be configured in two ways: Template-based Configuration and Manual Configuration.

Template-based Configuration

In the Template-based Configuration,The Graph RAG recipe includes a pre-configured Metadata Extraction Prompt designed to streamline the setup process. This default prompt serves as a ready-to-use template. While the prompt is pre-defined, it must be tested and published before it can be used in production workflows. Users are encouraged to use this prompt as a starting point and customize it as needed to align with their specific data extraction requirements.

To test and deploy the default prompt, follow these steps:

  1. Open the Predefined Prompt in Prompt Playground Open the default pre-configured prompt in the Prompt Playground. This can be done by clicking the arrow icon next to the prompt in the recipe canvas, which opens it for editing and testing.

Refer to the image below.

  1. Review and Customize the Prompt Use the Prompt Playground to review the default prompt configuration, including the instructions, examples, and graph descriptor. You can customize any part of the prompt to align it with your specific data extraction needs and domain-specific requirements.

  2. Test and Publish the Prompt

    Use the Test & Compare feature to validate the prompt's extraction behavior. Iterate as needed until the results meet your expectations. Once finalized, publish the prompt to make it available for use in the recipe.

  3. Select the Published Prompt After publishing, the prompt becomes available within the Graph RAG recipe. When a prompt is added, the associated primary and fallback models, along with their respective parameters, will be displayed in the metadata extraction prompt's right panel within the recipe. For more information, refer to the Prompt Versions section.

Manual Configuration

In the Manual Configuration, the extraction prompt must be created, tested, and published using the Prompt Playground. Please refer to the Prompt Management for detailed instructions.

Note: While creating new prompt, use the "Graph Metadata Extractor" prompt template as a starting point. Customize it as needed to match your graph structure, data format, or specific metadata requirements.

The Page Range and Segmentation Configuration setting in the Metadata Extraction element allows precise control over the document portions processed during extraction.

Page Range and Segmentation

The Page Range and Segmentation feature provides granular control over the document sections processed during the extraction phase of the pipeline. This functionality allows users to specify exact page ranges and segment sizes, thereby enhancing the efficiency and relevance of metadata extraction.

Page Range

The Page Range input offers flexible selection of pages for extraction using a customizable syntax, ensuring that only the most pertinent sections of a document are processed.

All Pages Enabling this option directs the system to extract data from the entire document. This setting guarantees a comprehensive extraction without the need for manual configuration of page ranges.

Enable Pages per Segment This option segments the document into fixed-page blocks prior to extraction. The number of pages per segment can be defined based on the user's requirements.

  • Pages per Segment: Specifies the number of pages contained within each segment. For example, setting the value to 10 will divide a 30-page document into 3 segments, each containing 10 pages.

If the summarization prompt is utilized, the metadata extraction prompt should be linked to it; else the metadata extraction prompt can be linked to the knowledge base.

Summarize Metadata

The Summarize Metadata feature can be used to perform metadata consolidation and normalization after the extraction phase in a workflow. It takes multiple JSON outputs, typically generated from individual pages or document segments, and intelligently merges them into a single, unified metadata structure.

To use this feature, drag and drop the Summarize Metadata tile onto the canvas, then select a published and validated summarization metadata prompt that has been tested in the Prompt Playground. Refer to the Prompt Management for more details.

Link the summarization prompt to Knowledge Base.

Knowledge Base

The Graph RAG recipe enables the integration and configuration of a Knowledge Base for graph data, providing seamless connectivity with graph databases such as Neo4j and Amazon Neptune.

The following steps outline the process for configuring and connecting a Knowledge Base for graph processing.

Neo4j (On-Premises)[Need to check explanation]

For users employing an on-premises setup, the "On-premises" checkbox can be enabled to configure the connection to a locally hosted Neo4j database.

Neo4j with Credentials

When selecting Neo4j, users must input their Neo4j credentials:

  • URI: The URI required for connecting to the Neo4j instance.

  • Database Name: The specific database within Neo4j to connect to.

  • User Name: The username for authentication.

  • User Password: The password corresponding to the user account.

  • Processing Region: Specify the region (e.g., us-east-1) for processing the graph data.

Test Neo4j Connection: Prior to saving the configuration, click the Test Neo4j Connection button to verify that the connection details are correct and the connection to the Neo4j instance is successful.

Amazon Neptune

For Neptune, you will need to input their Neptune credentials as follows:

  • Namespace: The logical partition within Neptune used to organize the data.

  • Neptune Endpoint: The URL of the Neptune cluster for API access.

  • Neptune Role ARN: The AWS IAM role that grants the necessary permissions to access Neptune.

  • AWS Region: The geographic location of the Neptune cluster.

Test Neptune Connection: Click this button to validate the connection and credentials to Neptune.

If the summarization prompt is utilized, the metadata extraction prompt should be linked to it; else the metadata extraction prompt can be linked to the knowledge base.

Graph Descriptor

To configure a recipe, a descriptor must be provided for proper setup. The descriptor can either be generated or uploaded using the available interface. After providing the descriptor, select the appropriate indexing type either Vector Index or Text Index based on your specific use case and save it. Refer Descriptor for more details.

Before generating a descriptor, ensure the following configurations are properly set up:

  • Source: The source element must be correctly configured with the relevant data source for the recipe.

  • Dataset: The dataset element should be associated with either a newly created dataset or the default dataset for processing within the Graph RAG recipe.

  • Natural Language Assistant Model: The Natural Language Assistant model must be configured on the Organization page to enable descriptor generation.

Finally, the Graph RAG recipe appears as follows:

Last updated