Dataset search configuration

The following configurations allow you to optimize how search results are retrieved and ranked. These settings enable fine-tuning of search behavior to meet your specific dataset requirements. Adjust them to enhance the relevance and accuracy of your search results.

Prompt Context Options

Choose how retrieved embeddings are used to build prompts for LLMs.

Option 1: Use Embedding Chunks

Description: Use the vector-embedded chunks directly as context.

How it Works:

  1. Semantic search retrieves Top-K chunks

  2. Chunks are concatenated as context

  3. LLM receives query + chunk context

  4. LLM generates answer from chunks

Option 2: Summarize Chunks

Description: Summarize retrieved chunks before using as context.

How it Works:

  1. Semantic search retrieves Top-K chunks

  2. Chunks are sent to LLM for summarization

  3. Summary becomes the context

  4. LLM generates answer from summary

Option 3: Use Document Embeddings

Description: Retrieve entire documents based on chunk matches.

How it Works:

  1. Semantic search retrieves Top-K chunks

  2. System identifies source documents for chunks

  3. Full document text becomes context (or specific pages)

  4. LLM generates answer from document

Sub-Options:

  • Use Entire Document: Include full document text

  • Use Matching Page: Include only the page containing the match

    • Include Previous and Next Page: Add surrounding pages for context

Top-K Configuration

Controls the number of results retrieved from the vector database.

Top-K: Maximum number of matching vectors to retrieve (1-100). Default is 5.

Reranker

The Reranker refines search results by reordering the Top-K results based on a reranker LLMs, improving the precision.

  • Configuration:

    • Enable Reranker: Check "Enable reranker" in dataset settings.

    • Top-N: Number of results to return after reranking (1-100, ≤ Top-K). Recommended: 3-5.

    • Reranker Threshold: Filters results below a relevance score (0.0-1.0, default: 0.5). Example: 0.5 for balanced relevance.

  • Reranking Pipeline:

    • Top-K results are retrieved.

    • Reranker scores results.

    • Chunks are filtered by threshold, and Top-N are returned.

Advanced query reconstruction

Query Rewrite

Automatically improves user queries for better semantic search results by handling ambiguity and expanding context.

  • Methods:

    • Multi-Query Rewrite: Breaks complex queries into simpler sub-queries.

      • How it Works: LLM identifies sub-questions, generates focused queries, and merges results.

    • Query Expansion: Adds contextual information to queries for better retrieval.

      • How it Works: LLM enhances the query with additional context.

  • Configuration: Enable "Advanced query reconstruction" and choose a method (Multi-query or Query expansion).

ACL Pre-filtering

ACL(Access Control List) Pre-filtering ensures that users can only access search results from documents they are authorized to view, enforcing document-level security.

How It Works:

  1. Documents are tagged with ACL metadata during embedding.

  2. User Identity is passed with each search request.

  3. The Vector Database is pre-filtered to include only accessible documents based on the user's permissions.

Security Model:

  • Access control is enforced before semantic search.

  • Group-based permissions are supported, ensuring users cannot see unauthorized content.

  • The process is transparent to the end user.

Configuration

  • Enable ACL Pre-filtering: Check "Enable ACL restriction" in the dataset settings.

  • User Data: Automatically included in the session, with details such as user ID, name, and groups.

Example User Data:


Prerequisites

  • Document Tagging: Documents must have ACL metadata and is supported for connectors with permissions such as Box, Dropbox, Sharepoint, SiteCore, Google Drive, Google File Storage or Amazon S3 Manifest. Below are examples of metadata

  • Embedding: ACL metadata is embedded with documents, and the vector database is configured to filter based on ACL fields.

Dynamic Metadata Filtering

Dynamic Metadata Filtering automatically generates filters based on query context, improving the relevance and precision of search results by narrowing the search space before semantic retrieval.

How It Works:

  1. LLM receives the query and metadata schema.

  2. Relevant filters are determined based on the query.

  3. Opensearch query DSL is generated.

  4. The filter is applied to the vector database, and semantic search is performed on the filtered subset.

Configuration

  • Enable Dynamic Metadata Filtering: Check "Enable dynamic metadata filtering" in dataset settings.

  • Optional: Customize the prompt template for specific filtering strategies.

  • Metadata Configuration: Define metadata fields in your dataset (e.g., document type, date, department).

Example Metadata:


Examples

Example 1: Date Range Filtering

Query: "Show me reports from last quarter."

Generated Filter:


Example 2: Multi-Field Filtering

Query: "Find engineering team documents about API design."

Generated Filter:

Hybrid Search combines semantic vector search and keyword-based search to optimize retrieval accuracy by merging the strengths of both methods.

How It Works:

  1. Vector Search:

    • Embed query.

    • Find semantically similar chunks.

    • Score based on cosine similarity (0.0-1.0).

  2. Keyword Search:

    • Tokenize query.

    • Find matching keywords using BM25.

    • Score based on BM25 (0.0-1.0).

  3. Score Combination:

    • Final Score = (vector_weight × vector_score) + (keyword_weight × keyword_score).

  4. Merged Results:

    • Deduplicate and sort by combined score.

    • Return Top-K results.

Configuration

  • Enable Hybrid Search: Check "Enable Hybrid Search" in settings.

  • Query Template: Customize the weights for vector and keyword search in JSON format.

Example Query Template:

Weight Parameters

  • vector_weight: Higher values prioritize semantic search (0.0-1.0).

    • Example: 0.6 = 60% vector search.

  • keyword_weight: Higher values prioritize exact keyword matches (0.0-1.0).

    • Example: 0.4 = 40% keyword search.

  • Fields: Specify text fields to search, with the option to boost field importance (e.g., "title^3.0").

Common Configurations:

  • Semantic-heavy: 0.7 vector, 0.3 keyword

  • Balanced: 0.6 vector, 0.4 keyword (default)

  • Keyword-heavy: 0.4 vector, 0.6 keyword

Last updated