GitHub Enterprise Cloud Connector

For a general introduction to our GitHub Enterprise Connector, please refer to RheinInsights GitHub Enterprise Cloud Connector.

Hard Disk Requirements

Please note that due to the nature of git and GitHub, the connector needs to clone repositories. The cloned repositories will be stored in the connector’s tmp folder and stay there for the crawl duration. At the end of each crawl, the connector will remove these folders.

From a storage perspective this means that the connector needs to have sufficient disk space in its working folder to store the respectively cloned repositories. The storage amount is the same as when you execute a git clone commands for all repositories in scope.

GitHub Enterprise Cloud Configuration

In order to allow the connector to crawl your GitHub instance, please configure a GitHub app as follows. In order to do so, you need to have sufficient administrative privileges for your organization.

Your GitHub Organization’s Name

First, take a note of your organization name, as part of the Url https://github.com/organizations/<OrgName>/settings

Create a New GitHub App

  1. Navigate to your apps, i.e., https://github.com/settings/apps

  2. Click on “New GitHub App”

Configure the App Fundamentals

On the next page, do the following:

  1. Give it a name, e.g. RheinInsights GitHub Connector

    1. Set a homepage URL, for instance https://www.rheininsights.com

    2. Disable “expire user authorization tokens”

    3. Disable webhooks

    4. Then add the following repository permissions:

      1. Knowledge bases: read-only

      2. Content: read_only

      3. Discussions: read_only

      4. Issues: read_only

      5. Metadata: read_only

    5. Organization Permissions

      1. Knowledge bases: read-only

      2. Members: read-only

      3. Organization codespaces: read-only

      4. Projects: read-only

    6. Click on “Create GitHub App” for this enterprise

Install the App and Generate a Private Key

On the next page, or https://github.com/settings/apps/<your app name>

  1. Navigate to the bottom of the page by clicking at generate a private key

  2. Generate the key

  3. Store the key securely until you need it for the connector configuration.

On the right hand side click on “install app”

  1. Click on “Install”

  2. Choose all or just the repositories, you want to crawl into your search / RAG:

  3. Click on Install

Afterwards,

  1. On the next page, take a note of the installation Id as part of the URL:
    https://github.com/organizations/<OrgName>/settings/installations/<InstallationIId>

  2. please collect the following information by clicking on “App settings”

  1. Take a note of the app id

Content Source Configuration

The content source configuration of the connector comprises the following mandatory configuration fields.

  1. GitHub Edition: Please set this to GitHub Enterprise Cloud.

  2. Organization Id. Here please add Your GitHub Organization’s Name

  3. Add the app id from Step “Install the App and Generate a Private Key”.

  4. Installation Id. Please add the installation id, as being noted in Step “Install the App and Generate a Private Key”.

  5. Private key. Please upload the private key, which you downloaded in Step “Install the App and Generate a Private Key” above

  6. Exclude archived repositories from crawling. Enable this option in order to not crawl all archived repos. Repositories which become archived between crawls, will be removed from the search index.

  7. Exclude disabled repositories from crawling. Enable this option in order to not crawl all disabled repos. Repositories which become disabled between crawls, will be removed from the search index.

  8. Head-only mode. This will only crawl the most recent version of a branch in each repository.
    By default, the connector will crawl only the latest revision. If you like to crawl all changed files in all commits, then you need to disable this option.

  9. Included repositories into crawling. Here add regular expression (in Java format) which match the names of the repositories, which you only want to crawl.

  10. Excluded repositories from crawling. Here add regular expression (in Java format) which match the names of the repositories, which you do not want to crawl.

  11. Included branches into crawling. Here add regular expression (in Java format) which match the names of the branches, which you only want to crawl. These apply to all repositories.
    By default, the connector will crawl refs/remotes/origin/master. If you like to crawl more or all branches, then remove this option or leave it empty.

  12. Excluded branches from crawling. Here add regular expression (in Java format) which match the names of the repositories, which you do not want to crawl. Also here, this setting applies to all repositories.

  13. Maximum content size (MB): This is file size limitation. If files exceed this size, they won’t be crawled.

  14. The general settings are described at General Crawl Settings and you can leave these with its default values.

Please note that due to the nature of the GitHub APIs, the connector clones repositories during crawling and stores the contents in a tmp directory on the crawl-server. After crawling a repository, this tmp-folder becomes deleted.

After entering the configuration parameters, click on validate. This validates the content crawl configuration directly against the content source. If there are issues when connecting, the validator will indicate these on the page. Otherwise, you can save the configuration and continue with Content Transformation configuration.

Recommended Crawl Schedules

Incremental Crawls

Even though Git comes with a native change log, the connector will always crawl an entire revision. It adds the files, which have changed during a commit. It skips revisions during incremental crawling, it already has seen. Incremental crawls will not delete skipped files, also not in “head-only” mode.

Depending on the size of your GitHub instance, we recommend to configure incremental crawls to run every 12 or 24 hours.

Principal Crawls

Principal scans should run twice per day. These scans index the user group relationships, which are important for private repositories.

Full Content Scans

Furthermore, full content scans are normally only needed, if you run “head-only” mode. change content processing and need to reindex everything. For more information see Crawl Scheduling .