Sitecore

To configure Sitecore integration, follow these steps based on your data source type:

1. Folder Configuration:

  • Source Type: Select "Folders".

  • Folder ID: Enter the unique Folder ID where your data is stored in Sitecore.

  • Recursive Search: Enable "Recursive search" if you want to include subfolders and their contents.

2. Manifest File Configuration:

  • Source Type: Choose "SiteCore Manifest".

  • S3 Bucket Path: Provide the path to the S3 bucket where the Sitecore manifest file is stored.

  • Credentials: Ensure the correct credentials are set up for both Sitecore and the S3 bucket to enable access.

Sample Manifest Structure

A manifest defines how content is extracted, filtered, and processed. It provides reusable, declarative configurations for large-scale ingestion workflows.

{
  "knowledge_articles": {
    "type": "Knowledge Articles",
    "extraction_rules": {
      "content_type": "html",
      "content_fields": ["Body"],
      "metadata_fields": {
        "Date"
      }
    },
    "filters": ["PLACEHOLDER_FILTER"],
    "sources": [
      { "path": "/placeholder/path/one", "recursive": true },
      { "path": "/placeholder/path/two", "recursive": true }
    ]
  },

  "questions_answers": {
    "type": "questions_answers",
    "extraction_rules": {
      "content_type": "html",
      "content_fields": [
        "Question",
        "Answer"
      ]
    },
    "filters": ["Release"],
    "sources": [
      { "path": "/placeholder/path/faq", "recursive": true }
    ]
  },

  "video_image_files": {
    "type": "Video and Image Files",
    "extraction_rules": {
      "content_type": "file",
      "content_fields": [],
      "folder_type": "Media Library",
      "metadata_fields": {
        "ItemID": "keyword",
        "Display Date": "date"
      }
    },
    "filters": [
       "*png",
        "*jpg",
        "*vidyard player",
        "*mp4"
    ],
    "sources": [
      { "path": "/placeholder/media/path/one", "recursive": true },
      { "path": "/placeholder/media/path/two", "recursive": true },
      { "path": "/placeholder/media/path/three", "recursive": true }
    ]
  },

  "pdf_files": {
    "type": "PDF Files",
    "extraction_rules": {
      "content_type": "file",
      "content_fields": [],
      "metadata_fields": {
       "ItemID" : "keyword",
        "Display Date" : "date",
        "Date" : "date"
      }
    },
    "filters": [
       "*pdf",
      "*docx"
    ],
    "sources": [
      { "path": "/placeholder/path/documents", "recursive": true }
    ]
  },

  "archived_files": {
    "type": "Archived Files",
    "extraction_rules": {
      "content_type": "file",
      "content_fields": [],
      "metadata_fields": {
       "ItemID" : "keyword",
        "Date" : "date",
        "Display Date" : "date"
      }
    },
    "filters": [
      "*zip",
      "*rar"
    ],
    "unzipped_files_filter": [
      ["*.pdf","*.ppt","*.html"]
    ],
    "sources": [
      { "path": "/placeholder/path/archive", "recursive": true }
    ]
  }
}
  • knowledge_articles: Defines the configuration for extracting and processing knowledge article data.

  • type: Specifies the type of data or document being processed (e.g., PLACEHOLDER_TYPE).

  • extraction_rules: Defines the rules for extracting content and metadata from the documents.

    • content_type: Specifies the format of the document content (e.g., html or file).

    • content_fields: Lists the fields within the content to extract (e.g., text or specific data).

    • metadata_fields: Defines the metadata fields and their types (e.g., author, creation date).

  • filters: Defines filters to apply to the content, such as specific keywords or conditions.

  • sources: Specifies the paths to the data sources, with an option to recurse through subdirectories.

    • path: The location of the data source.

    • recursive: A boolean indicating whether to process subdirectories.

  • questions_answers: Defines the configuration for extracting question and answer pairs.

  • video_image_files: Configuration for processing video and image files.

    • folder_type: Specifies the type of folder containing the files.

    • filters: A list of filters to apply specifically for video and image files.

  • pdf_files: Defines the configuration for processing PDF files.

    • filters: Specifies filters such as file extensions for processing PDFs.

  • archived_files: Configuration for processing archived files (e.g., ZIP files).

    • unzipped_files_filter: Filters for processing files after they have been unzipped.

Last updated