# Sitecore

To configure Sitecore integration, follow these steps based on your data source type:

#### 1. **Folder Configuration**:

* **Source Type**: Select **"Folders"**.
* **Folder ID**: Enter the unique **Folder ID** where your data is stored in Sitecore.
* **Recursive Search**: Enable **"Recursive search"** if you want to include subfolders and their contents.

#### 2. **Manifest File Configuration**:

* **Source Type**: Choose **"SiteCore Manifest"**.
* **S3 Bucket Path**: Provide the path to the **S3 bucket** where the Sitecore manifest file is stored.
* **Credentials**: Ensure the correct credentials are set up for both Sitecore and the S3 bucket to enable access.

**Sample Manifest Structure**

A manifest defines how content is extracted, filtered, and processed. It provides reusable, declarative configurations for large-scale ingestion workflows.

```
{
  "knowledge_articles": {
    "type": "Knowledge Articles",
    "extraction_rules": {
      "content_type": "html",
      "content_fields": ["Body"],
      "metadata_fields": {
        "Date"
      }
    },
    "filters": ["PLACEHOLDER_FILTER"],
    "sources": [
      { "path": "/placeholder/path/one", "recursive": true },
      { "path": "/placeholder/path/two", "recursive": true }
    ]
  },

  "questions_answers": {
    "type": "questions_answers",
    "extraction_rules": {
      "content_type": "html",
      "content_fields": [
        "Question",
        "Answer"
      ]
    },
    "filters": ["Release"],
    "sources": [
      { "path": "/placeholder/path/faq", "recursive": true }
    ]
  },

  "video_image_files": {
    "type": "Video and Image Files",
    "extraction_rules": {
      "content_type": "file",
      "content_fields": [],
      "folder_type": "Media Library",
      "metadata_fields": {
        "ItemID": "keyword",
        "Display Date": "date"
      }
    },
    "filters": [
       "*png",
        "*jpg",
        "*vidyard player",
        "*mp4"
    ],
    "sources": [
      { "path": "/placeholder/media/path/one", "recursive": true },
      { "path": "/placeholder/media/path/two", "recursive": true },
      { "path": "/placeholder/media/path/three", "recursive": true }
    ]
  },

  "pdf_files": {
    "type": "PDF Files",
    "extraction_rules": {
      "content_type": "file",
      "content_fields": [],
      "metadata_fields": {
       "ItemID" : "keyword",
        "Display Date" : "date",
        "Date" : "date"
      }
    },
    "filters": [
       "*pdf",
      "*docx"
    ],
    "sources": [
      { "path": "/placeholder/path/documents", "recursive": true }
    ]
  },

  "archived_files": {
    "type": "Archived Files",
    "extraction_rules": {
      "content_type": "file",
      "content_fields": [],
      "metadata_fields": {
       "ItemID" : "keyword",
        "Date" : "date",
        "Display Date" : "date"
      }
    },
    "filters": [
      "*zip",
      "*rar"
    ],
    "unzipped_files_filter": [
      ["*.pdf","*.ppt","*.html"]
    ],
    "sources": [
      { "path": "/placeholder/path/archive", "recursive": true }
    ]
  }
}
```

* **`knowledge_articles`**: Defines the configuration for extracting and processing knowledge article data.
* **`type`**: Specifies the type of data or document being processed (e.g., `PLACEHOLDER_TYPE`).
* **`extraction_rules`**: Defines the rules for extracting content and metadata from the documents.
  * **`content_type`**: Specifies the format of the document content (e.g., `html` or `file`).
  * **`content_fields`**: Lists the fields within the content to extract (e.g., text or specific data).
  * **`metadata_fields`**: Defines the metadata fields and their types (e.g., author, creation date).
* **`filters`**: Defines filters to apply to the content, such as specific keywords or conditions.
* **`sources`**: Specifies the paths to the data sources, with an option to recurse through subdirectories.
  * **`path`**: The location of the data source.
  * **`recursive`**: A boolean indicating whether to process subdirectories.
* **`questions_answers`**: Defines the configuration for extracting question and answer pairs.
* **`video_image_files`**: Configuration for processing video and image files.
  * **`folder_type`**: Specifies the type of folder containing the files.
  * **`filters`**: A list of filters to apply specifically for video and image files.
* **`pdf_files`**: Defines the configuration for processing PDF files.
  * **`filters`**: Specifies filters such as file extensions for processing PDFs.
* **`archived_files`**: Configuration for processing archived files (e.g., ZIP files).
  * **`unzipped_files_filter`**: Filters for processing files after they have been unzipped.
