Dataset search configuration

The following configurations allow you to optimize how search results are retrieved and ranked. These settings enable fine-tuning of search behavior to meet your specific dataset requirements. Adjust them to enhance the relevance and accuracy of your search results.

Prompt Context Options

Choose how retrieved embeddings are used to build prompts for LLMs.

Option 1: Use Embedding Chunks

Description: Use the vector-embedded chunks directly as context.

How it Works:

Semantic search retrieves Top-K chunks
Chunks are concatenated as context
LLM receives query + chunk context
LLM generates answer from chunks

Option 2: Summarize Chunks

Description: Summarize retrieved chunks before using as context.

How it Works:

Semantic search retrieves Top-K chunks
Chunks are sent to LLM for summarization
Summary becomes the context
LLM generates answer from summary

Option 3: Use Document Embeddings

Description: Retrieve entire documents based on chunk matches.

How it Works:

Semantic search retrieves Top-K chunks
System identifies source documents for chunks
Full document text becomes context (or specific pages)
LLM generates answer from document

Sub-Options:

Use Entire Document: Include full document text
Use Matching Page: Include only the page containing the match
- Include Previous and Next Page: Add surrounding pages for context

Top-K Configuration

Controls the number of results retrieved from the vector database.

Top-K: Maximum number of matching vectors to retrieve (1-100). Default is 5.

Reranker

The Reranker refines search results by reordering the Top-K results based on a reranker LLMs, improving the precision.

Configuration:
- Enable Reranker: Check "Enable reranker" in dataset settings.
- Top-N: Number of results to return after reranking (1-100, ≤ Top-K). Recommended: 3-5.
- Reranker Threshold: Filters results below a relevance score (0.0-1.0, default: 0.5). Example: 0.5 for balanced relevance.
Reranking Pipeline:
- Top-K results are retrieved.
- Reranker scores results.
- Chunks are filtered by threshold, and Top-N are returned.

Advanced query reconstruction

Query Rewrite

Automatically improves user queries for better semantic search results by handling ambiguity and expanding context.

Methods:

Multi-Query Rewrite: Breaks complex queries into simpler sub-queries.

How it Works: LLM identifies sub-questions, generates focused queries, and merges results.

Original Query:
"Tell me about the company's revenue, growth strategy, and recent acquisitions"

Rewritten Queries:
1. "Company annual revenue and financial performance"
2. "Company strategic growth initiatives and expansion plans"
3. "Recent mergers and acquisitions completed by the company"

Query Expansion: Adds contextual information to queries for better retrieval.

How it Works: LLM enhances the query with additional context.

Original Query:
"How does photosynthesis work?"
Expanded Query:
 Information about the biological process where plants convert light energy into chemical energy, including the role of chlorophyll, chloroplasts, carbon dioxide, water, and the production of glucose and oxygen through light-dependent and light-independent reactions."

Configuration: Enable "Advanced query reconstruction" and choose a method (Multi-query or Query expansion).

ACL Pre-filtering

ACL(Access Control List) Pre-filtering ensures that users can only access search results from documents they are authorized to view, enforcing document-level security.

How It Works:

Documents are tagged with ACL metadata during embedding.
User Identity is passed with each search request.
The Vector Database is pre-filtered to include only accessible documents based on the user's permissions.

Security Model:

Access control is enforced before semantic search.
Group-based permissions are supported, ensuring users cannot see unauthorized content.
The process is transparent to the end user.

Configuration

Enable ACL Pre-filtering: Check "Enable ACL restriction" in the dataset settings.
User Data: Automatically included in the session, with details such as user ID, name, and groups.

Example User Data:

{
  "user": {
    "user_id": "unique-user-id",
    "user_name": "John Doe",
    "user_email": "john@example.com"
  },
  "user_groups": [
    {
      "group_id": "sales-team",
      "group_email": "sales@example.com",
      "group_name": "Sales Team"
    }
  ]
}

Prerequisites

Document Tagging: Documents must have ACL metadata and is supported for connectors with permissions such as Box, Dropbox, Sharepoint, SiteCore, Google Drive, Google File Storage or Amazon S3 Manifest. Below are examples of metadata

{
  "allowed_users": ["user1@example.com", "user2@example.com"],
  "allowed_groups": ["engineering", "management"],
  "access_level": "restricted"
}

Embedding: ACL metadata is embedded with documents, and the vector database is configured to filter based on ACL fields.

Dynamic Metadata Filtering

Dynamic Metadata Filtering automatically generates filters based on query context, improving the relevance and precision of search results by narrowing the search space before semantic retrieval.

How It Works:

LLM receives the query and metadata schema.
Relevant filters are determined based on the query.
Opensearch query DSL is generated.
The filter is applied to the vector database, and semantic search is performed on the filtered subset.

Configuration

Enable Dynamic Metadata Filtering: Check "Enable dynamic metadata filtering" in dataset settings.
Optional: Customize the prompt template for specific filtering strategies.
Metadata Configuration: Define metadata fields in your dataset (e.g., document type, date, department).

Example Metadata:

{
  "metadata": [
    {
      "name": "document_type",
      "type": "keyword",
      "description": "Type of document (invoice, contract, report)"
    },
    {
      "name": "date",
      "type": "date",
      "description": "Document creation date"
    },
    {
      "name": "department",
      "type": "keyword",
      "description": "Department that owns the document"
    }
  ]
}

Examples

Example 1: Date Range Filtering

Query: "Show me reports from last quarter."

Generated Filter:

{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "metadata.document_type": "report"
          }
        },
        {
          "range": {
            "metadata.date": {
              "gte": "2024-10-01",
              "lt": "2025-01-01"
            }
          }
        }
      ]
    }
  }
}

Example 2: Multi-Field Filtering

Query: "Find engineering team documents about API design."

Generated Filter:

{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "metadata.department": "engineering"
          }
        },
        {
          "match": {
            "metadata.topic": {
              "query": "API design",
              "fuzziness": "AUTO"
            }
          }
        }
      ]
    }
  }
}

Hybrid Search

Hybrid Search combines semantic vector search and keyword-based search to optimize retrieval accuracy by merging the strengths of both methods.

How It Works:

Vector Search:
- Embed query.
- Find semantically similar chunks.
- Score based on cosine similarity (0.0-1.0).
Keyword Search:
- Tokenize query.
- Find matching keywords using BM25.
- Score based on BM25 (0.0-1.0).
Score Combination:
- Final Score = (vector_weight × vector_score) + (keyword_weight × keyword_score).
Merged Results:
- Deduplicate and sort by combined score.
- Return Top-K results.

Configuration

Enable Hybrid Search: Check "Enable Hybrid Search" in settings.
Query Template: Customize the weights for vector and keyword search in JSON format.

Example Query Template:

{
  "vector_weight": 0.6,
  "keyword_weight": 0.4,
  "fields": [
    "raw_chunk^2.0"
  ]
}

Weight Parameters

vector_weight: Higher values prioritize semantic search (0.0-1.0).
- Example: 0.6 = 60% vector search.
keyword_weight: Higher values prioritize exact keyword matches (0.0-1.0).
- Example: 0.4 = 40% keyword search.
Fields: Specify text fields to search, with the option to boost field importance (e.g., "title^3.0").

Common Configurations:

Semantic-heavy: 0.7 vector, 0.3 keyword
Balanced: 0.6 vector, 0.4 keyword (default)
Keyword-heavy: 0.4 vector, 0.6 keyword

PreviousDataset Search NextRecipes

Last updated 1 month ago

hashtagPrompt Context Options

hashtagOption 1: Use Embedding Chunks

hashtagOption 2: Summarize Chunks

hashtagOption 3: Use Document Embeddings

hashtagTop-K Configuration

hashtagReranker

hashtagAdvanced query reconstruction

hashtagQuery Rewrite

hashtagACL Pre-filtering

hashtagConfiguration

hashtagPrerequisites

hashtagDynamic Metadata Filtering

hashtagConfiguration

hashtagExamples

hashtagHybrid Search

hashtagConfiguration

Prompt Context Options

Option 1: Use Embedding Chunks

Option 2: Summarize Chunks

Option 3: Use Document Embeddings

Top-K Configuration

Reranker

Advanced query reconstruction

Query Rewrite

ACL Pre-filtering

Configuration

Prerequisites

Dynamic Metadata Filtering

Configuration

Examples

Hybrid Search

Configuration