Sitecore
To configure Sitecore integration, follow these steps based on your data source type:
1. Folder Configuration:
Source Type: Select "Folders".
Folder ID: Enter the unique Folder ID where your data is stored in Sitecore.
Recursive Search: Enable "Recursive search" if you want to include subfolders and their contents.
2. Manifest File Configuration:
Source Type: Choose "SiteCore Manifest".
S3 Bucket Path: Provide the path to the S3 bucket where the Sitecore manifest file is stored.
Credentials: Ensure the correct credentials are set up for both Sitecore and the S3 bucket to enable access.
Sample Manifest Structure
A manifest defines how content is extracted, filtered, and processed. It provides reusable, declarative configurations for large-scale ingestion workflows.
{
"knowledge_articles": {
"type": "Knowledge Articles",
"extraction_rules": {
"content_type": "html",
"content_fields": ["Body"],
"metadata_fields": {
"Date"
}
},
"filters": ["PLACEHOLDER_FILTER"],
"sources": [
{ "path": "/placeholder/path/one", "recursive": true },
{ "path": "/placeholder/path/two", "recursive": true }
]
},
"questions_answers": {
"type": "questions_answers",
"extraction_rules": {
"content_type": "html",
"content_fields": [
"Question",
"Answer"
]
},
"filters": ["Release"],
"sources": [
{ "path": "/placeholder/path/faq", "recursive": true }
]
},
"video_image_files": {
"type": "Video and Image Files",
"extraction_rules": {
"content_type": "file",
"content_fields": [],
"folder_type": "Media Library",
"metadata_fields": {
"ItemID": "keyword",
"Display Date": "date"
}
},
"filters": [
"*png",
"*jpg",
"*vidyard player",
"*mp4"
],
"sources": [
{ "path": "/placeholder/media/path/one", "recursive": true },
{ "path": "/placeholder/media/path/two", "recursive": true },
{ "path": "/placeholder/media/path/three", "recursive": true }
]
},
"pdf_files": {
"type": "PDF Files",
"extraction_rules": {
"content_type": "file",
"content_fields": [],
"metadata_fields": {
"ItemID" : "keyword",
"Display Date" : "date",
"Date" : "date"
}
},
"filters": [
"*pdf",
"*docx"
],
"sources": [
{ "path": "/placeholder/path/documents", "recursive": true }
]
},
"archived_files": {
"type": "Archived Files",
"extraction_rules": {
"content_type": "file",
"content_fields": [],
"metadata_fields": {
"ItemID" : "keyword",
"Date" : "date",
"Display Date" : "date"
}
},
"filters": [
"*zip",
"*rar"
],
"unzipped_files_filter": [
["*.pdf","*.ppt","*.html"]
],
"sources": [
{ "path": "/placeholder/path/archive", "recursive": true }
]
}
}knowledge_articles: Defines the configuration for extracting and processing knowledge article data.type: Specifies the type of data or document being processed (e.g.,PLACEHOLDER_TYPE).extraction_rules: Defines the rules for extracting content and metadata from the documents.content_type: Specifies the format of the document content (e.g.,htmlorfile).content_fields: Lists the fields within the content to extract (e.g., text or specific data).metadata_fields: Defines the metadata fields and their types (e.g., author, creation date).
filters: Defines filters to apply to the content, such as specific keywords or conditions.sources: Specifies the paths to the data sources, with an option to recurse through subdirectories.path: The location of the data source.recursive: A boolean indicating whether to process subdirectories.
questions_answers: Defines the configuration for extracting question and answer pairs.video_image_files: Configuration for processing video and image files.folder_type: Specifies the type of folder containing the files.filters: A list of filters to apply specifically for video and image files.
pdf_files: Defines the configuration for processing PDF files.filters: Specifies filters such as file extensions for processing PDFs.
archived_files: Configuration for processing archived files (e.g., ZIP files).unzipped_files_filter: Filters for processing files after they have been unzipped.
Last updated