Documentation

Atlassian Confluence Connector

For a general introduction to the connector, please refer to RheinInsights Atlassian Confluence Server and Data Center Connector.

Confluence Configuration

Crawl User

The connector needs a crawl user which has the following permissions:

Read access to all spaces and pages, which should be indexed
Permission to access all space and page permissions
Read access to all users and groups (i.e., must be space administrator)

Password Policy

The crawl user must have no password rotation or the password needs to be reset when it changes.

Legacy Remote APIs

The connector needs to access space permissions and unfortunately, there are no REST APIs available for these. Therefore, if you want to run the connector in an enterprise search scenario, please enable Remote API (XML-RPC & SOAP) at “Further Configuration” within the Confluence administration:

Content Source Configuration

The content source configuration of the connector comprises the following mandatory configuration fields.

Confluence host name, which is the fully qualified domain name or host name of the Confluence instance without protocol
HTTP protocol: either http or https
Public keys for SSL certificates: this configuration is needed, if you run the environment with self-signed certificates, or certificates which are not known to the Java key store.
We use a straight-forward approach to validate SSL certificates. In order to render a certificate valid, add the modulus of the public key into this text field. You can access this modulus by viewing the certificate within the browser.

HTTP port: determines the port of the Confluence instance
Crawl user: is the user name which is used by the connector to crawl the instance. Please see the section above for the necessary user permissions.
User password: is the corresponding password for the crawl user
Excluded files from crawling: here you can add file extensions to filter attachments which should not be sent to the search engine.
Excluded spaces from crawling: here you can add space names or keys to exclude these from crawling
Please note that a change to this list will yield an incremental crawl to remove all pages and attachments from excluded spaces.
Included spaces from crawling: this is an include list. If empty, all spaces but the excluded spaces are indexed. But if you add at least one entry (even an empty one), only this space will be included for crawling. Please note that a change to this list will yield an incremental crawl to remove all pages and attachments from excluded spaces.
The general settings are described at General Crawl Settings and you can leave these with its default values.

After entering the configuration parameters, click on validate. This validates the content crawl configuration directly against the content source. If there are issues when connecting, the validator will indicate these on the page. Otherwise, you can save the configuration and continue with Content Transformation configuration.

Limitations for Incremental Crawls and Recommended Crawl Schedules

Atlassian Confluence does not offer a complete change log. This means that incremental crawls can detect new and changed Confluence spaces, pages and attachments. However, removed spaces, pages and attachments will not be detected in incremental crawls, as well as significant changes to the space permission schemes.

Therefore, we recommend to configure incremental crawls to run every 15-30 minutes, full scan principal crawls to run twice a day, as well as a weekly full scan of the documents of the Confluence instance. For more information see Crawl Scheduling .