Dataset search configuration
The following configurations allow you to optimize how search results are retrieved and ranked. These settings enable fine-tuning of search behavior to meet your specific dataset requirements. Adjust them to enhance the relevance and accuracy of your search results.
Prompt Context Options
Choose how retrieved embeddings are used to build prompts for LLMs.
Option 1: Use Embedding Chunks
Description: Use the vector-embedded chunks directly as context.
How it Works:
Semantic search retrieves Top-K chunks
Chunks are concatenated as context
LLM receives query + chunk context
LLM generates answer from chunks
Option 2: Summarize Chunks
Description: Summarize retrieved chunks before using as context.
How it Works:
Semantic search retrieves Top-K chunks
Chunks are sent to LLM for summarization
Summary becomes the context
LLM generates answer from summary
Option 3: Use Document Embeddings
Description: Retrieve entire documents based on chunk matches.
How it Works:
Semantic search retrieves Top-K chunks
System identifies source documents for chunks
Full document text becomes context (or specific pages)
LLM generates answer from document
Sub-Options:
Use Entire Document: Include full document text
Use Matching Page: Include only the page containing the match
Include Previous and Next Page: Add surrounding pages for context
Top-K Configuration
Controls the number of results retrieved from the vector database.
Top-K: Maximum number of matching vectors to retrieve (1-100). Default is 5.
Reranker
The Reranker refines search results by reordering the Top-K results based on a reranker LLMs, improving the precision.
Configuration:
Enable Reranker: Check "Enable reranker" in dataset settings.
Top-N: Number of results to return after reranking (1-100, ≤ Top-K). Recommended: 3-5.
Reranker Threshold: Filters results below a relevance score (0.0-1.0, default: 0.5). Example: 0.5 for balanced relevance.
Reranking Pipeline:
Top-K results are retrieved.
Reranker scores results.
Chunks are filtered by threshold, and Top-N are returned.
Advanced query reconstruction
Query Rewrite
Automatically improves user queries for better semantic search results by handling ambiguity and expanding context.
Methods:
Multi-Query Rewrite: Breaks complex queries into simpler sub-queries.
How it Works: LLM identifies sub-questions, generates focused queries, and merges results.
Query Expansion: Adds contextual information to queries for better retrieval.
How it Works: LLM enhances the query with additional context.
Configuration: Enable "Advanced query reconstruction" and choose a method (Multi-query or Query expansion).
ACL Pre-filtering
ACL(Access Control List) Pre-filtering ensures that users can only access search results from documents they are authorized to view, enforcing document-level security.
How It Works:
Documents are tagged with ACL metadata during embedding.
User Identity is passed with each search request.
The Vector Database is pre-filtered to include only accessible documents based on the user's permissions.
Security Model:
Access control is enforced before semantic search.
Group-based permissions are supported, ensuring users cannot see unauthorized content.
The process is transparent to the end user.
Configuration
Enable ACL Pre-filtering: Check "Enable ACL restriction" in the dataset settings.
User Data: Automatically included in the session, with details such as user ID, name, and groups.
Example User Data:
Prerequisites
Document Tagging: Documents must have ACL metadata and is supported for connectors with permissions such as Box, Dropbox, Sharepoint, SiteCore, Google Drive, Google File Storage or Amazon S3 Manifest. Below are examples of metadata
Embedding: ACL metadata is embedded with documents, and the vector database is configured to filter based on ACL fields.
Dynamic Metadata Filtering
Dynamic Metadata Filtering automatically generates filters based on query context, improving the relevance and precision of search results by narrowing the search space before semantic retrieval.
How It Works:
LLM receives the query and metadata schema.
Relevant filters are determined based on the query.
Opensearch query DSL is generated.
The filter is applied to the vector database, and semantic search is performed on the filtered subset.
Configuration
Enable Dynamic Metadata Filtering: Check "Enable dynamic metadata filtering" in dataset settings.
Optional: Customize the prompt template for specific filtering strategies.
Metadata Configuration: Define metadata fields in your dataset (e.g., document type, date, department).
Example Metadata:
Examples
Example 1: Date Range Filtering
Query: "Show me reports from last quarter."
Generated Filter:
Example 2: Multi-Field Filtering
Query: "Find engineering team documents about API design."
Generated Filter:
Hybrid Search
Hybrid Search combines semantic vector search and keyword-based search to optimize retrieval accuracy by merging the strengths of both methods.
How It Works:
Vector Search:
Embed query.
Find semantically similar chunks.
Score based on cosine similarity (0.0-1.0).
Keyword Search:
Tokenize query.
Find matching keywords using BM25.
Score based on BM25 (0.0-1.0).
Score Combination:
Final Score = (vector_weight × vector_score) + (keyword_weight × keyword_score).
Merged Results:
Deduplicate and sort by combined score.
Return Top-K results.
Configuration
Enable Hybrid Search: Check "Enable Hybrid Search" in settings.
Query Template: Customize the weights for vector and keyword search in JSON format.
Example Query Template:
Weight Parameters
vector_weight: Higher values prioritize semantic search (0.0-1.0).
Example: 0.6 = 60% vector search.
keyword_weight: Higher values prioritize exact keyword matches (0.0-1.0).
Example: 0.4 = 40% keyword search.
Fields: Specify text fields to search, with the option to boost field importance (e.g.,
"title^3.0").
Common Configurations:
Semantic-heavy: 0.7 vector, 0.3 keyword
Balanced: 0.6 vector, 0.4 keyword (default)
Keyword-heavy: 0.4 vector, 0.6 keyword
Last updated