Drupal CMS Connector

For a general introduction to the connector, please refer to RheinInsights Drupal CMS Enterprise Search und RAG Connector.

Drupal Configuration

Our connector uses the module JSON-API for crawling. This means you need to enable the following modules in your Drupal instance.

Therefore, as an administrator do the following

  1. Open your Drupal instance

  2. Open the administration bar

  3. Click on Extend

  4. Enable the module JSON:API

  5. Enable the module Serialization

    image-20251111-095918.png
  6. Enable the module HTTP Basic Authentication

    image-20251111-100201.png
  7. Click on install

  8. Go to Configuration > Web Services > Json:API

  9. Limit allowed operations to Accept only JSON:API read operations

    image-20251111-100128.png
  10. Save the configuration

Crawl User

The connector needs a crawl user which has the following permissions:

  1. Read access to all pages (nodes), which should be indexed

  2. Read access to all media, which should be indexed

  3. Read access to the entity definitions and languages

The connector uses Basic Auth to authenticate against Drupal.

Content Source Configuration

The content source configuration of the connector comprises the following mandatory configuration fields.

image-20251111-100446.png
  1. Instance base URL, which is the fully qualified domain name or host name to the Drupal instance

  2. Username: is the username of the crawl user

  3. Password: is the password for this crawl user

  4. Public keys for SSL certificates: this configuration is needed, if you run the environment with self-signed certificates, or certificates which are not known to the Java key store.
    We use a straight-forward approach to validate SSL certificates. In order to render a certificate valid, add the modulus of the public key into this text field. You can access this modulus by viewing the certificate within the browser.

Details of a Self Signed Certificate
  1. Index draft articles: if enabled then also draft articles are indexed. Otherwise only articles in published state.

  2. Excluded files from crawling: here you can add file extensions to filter attachments which should not be sent to the search engine.

  3. Rate Limiting. This will define a rate limiting for the connector, i.e., limit the number of API requests per second (across all threads).

  4. Response timeout (ms). Defines how long the connector until an API call is aborted and the operation be marked as failed.

  5. Connection timeout (ms). Defines how long the connector waits for a connection for an API call.

  6. Socket timeout (ms). Defines how long the connector waits for receiving all data from an API call.

  7. The general settings are described at General Crawl Settings and you can leave these with its default values.

After entering the configuration parameters, click on validate. This validates the content crawl configuration directly against the content source. If there are issues when connecting, the validator will indicate these on the page. Otherwise, you can save the configuration and continue with Content Transformation configuration.

Limitations for Incremental Crawls and Recommended Crawl Schedules

Drupal does not offer a change log. This means that incremental crawls can detect new and changed Drupal articles and media items. However, deleted articles or media items will not be detected in incremental crawls.

Therefore, we recommend to configure incremental crawls to run every 15-30 minutes, as well as a weekly full scan of the documents of the Drupal instance. For more information see Crawl Scheduling .