GitHub Enterprise Server Connector

For a general introduction to our GitHub Enterprise Server Connector, please refer to RheinInsights GitHub Enterprise Server Connector.

Hard Disk Requirements

Please note that due to the nature of git and GitHub, the connector needs to clone repositories. The cloned repositories will be stored in the connector’s tmp folder and stay there for the crawl duration. At the end of each crawl, the connector will remove these folders.

From a storage perspective this means that the connector needs to have sufficient disk space in its working folder to store the respectively cloned repositories. The storage amount is the same as when you execute a git clone commands for all repositories in scope.

GitHub Enterprise Server Configuration

In order to allow the connector to crawl your GitHub instance, please configure a GitHub app as follows. In order to do so, you need to have sufficient administrative privileges for your organization.

Your GitHub Organization’s Name and Hostname

Take a note of the hostname and your organization name, as part of the Url

https://<hostname>/organizations/<organization>/settings/profile

Create a New GitHub App

  1. Navigate to your apps, i.e., https://<hostname>/organizations/<organization>/settings/apps

  2. Click on “New GitHub App”

Configure the App Fundamentals

On the next page, do the following:

  1. Give it a name, e.g. RheinInsights GitHub Connector

    1. Set a homepage URL, for instance https://www.rheininsights.com

    2. Disable “expire user authorization tokens”

    3. Disable webhooks

    4. Then add the following repository permissions:

      1. Knowledge bases: read-only

      2. Content: read_only

      3. Discussions: read_only

      4. Issues: read_only

      5. Metadata: read_only

    5. Organization Permissions

      1. Knowledge bases: read-only

      2. Members: read-only

      3. Organization codespaces: read-only

      4. Projects: read-only

    6. Click on “Create GitHub App” for this enterprise

Install the App and Generate a Private Key

On the next page, or https://<hostname>/organizations/<org>/settings/apps/<yourapp>

  1. Navigate to the bottom of the page by clicking at generate a private key

  2. Generate the key

  3. Store the key securely until you need it for the connector configuration.

On the right hand side click on “install app”

  1. Click on “Install”

  2. Choose all or just the repositories, you want to crawl into your search / RAG:

  3. Click on Install

Afterwards,

  1. On the next page, take a note of the installation Id as part of the URL:
    https://<hostname>/organizations/<OrgName>/settings/installations/<InstallationId>

  2. please collect the following information by clicking on “App settings”

  1. Take a note of the app id

Content Source Configuration

The content source configuration of the connector comprises the following mandatory configuration fields.

  1. GitHub Edition. Please choose GitHub Enterprise Server as the edition to crawl.

  2. Organization Id. Here please add Your GitHub Organization’s Name

  3. Hostname. Please add the hostname (with https:// or http:// and with a trailing /) of your installation.
    For instance https://githubserver.organization.com/organizations/org/ becomes https://githubserver.organization.com/

  4. Public keys for SSL certificates: this configuration is needed, if you run the environment with self-signed certificates, or certificates which are not known to the Java key store.
    We use a straight-forward approach to validate SSL certificates. In order to render a certificate valid, add the modulus of the public key into this text field. You can access this modulus by viewing the certificate within the browser.

  1. Add the app id from Step “Install the App and Generate a Private Key”.

  2. Installation id. Please add the installation id as being written down in step “Install the App and Generate a Private Key”.

  3. Private key. Please upload the private key, which you downloaded in Step “Install the App and Generate a Private Key” above

  4. Exclude archived repositories from crawling. Enable this option in order to not crawl all archived repos. Repositories which become archived between crawls, will be removed from the search index.

  5. Exclude disabled repositories from crawling. Enable this option in order to not crawl all disabled repos. Repositories which become disabled between crawls, will be removed from the search index.

  6. Head-only mode. This will only crawl the most recent version of a branch in each repository. This mode is the default and you can turn it explicitly off to crawl all revisions and index changed files.

  7. Included repositories into crawling. Here add regular expression (in Java format) which match the names of the repositories, which you only want to crawl.

  8. Excluded repositories from crawling. Here add regular expression (in Java format) which match the names of the repositories, which you do not want to crawl.

  9. Included branches into crawling. Here add regular expression (in Java format) which match the names of the branches, which you only want to crawl. These apply to all repositories.
    Per default, we crawl refs/remotes/origins/master. If you want to crawl other branches, please add these here.

  10. Excluded branches from crawling. Here add regular expression (in Java format) which match the names of the repositories, which you do not want to crawl. Also here, this setting applies to all repositories.

  11. Maximum content size (MB): This is file size limitation. If files exceed this size, they won’t be crawled.

  12. The general settings are described at General Crawl Settings and you can leave these with its default values.

Please note that due to the nature of the GitHub APIs, the connector clones repositories during crawling and stores the contents in a tmp directory on the crawl-server. After crawling a repository, this tmp-folder becomes deleted.

After entering the configuration parameters, click on validate. This validates the content crawl configuration directly against the content source. If there are issues when connecting, the validator will indicate these on the page. Otherwise, you can save the configuration and continue with Content Transformation configuration.

Recommended Crawl Schedules

Incremental Crawls

Even though Git comes with a native change log, the connector will always crawl an entire revision. It adds the files, which have changed during a commit. It skips revisions during incremental crawling, it already has seen. Incremental crawls will not delete skipped files, also not in “head-only” mode.

Depending on the size of your GitHub instance, we recommend to configure incremental crawls to run every 12 or 24 hours.

Principal Crawls

Principal scans should run twice per day. These scans index the user group relationships, which are important for private repositories.

Full Content Scans

Furthermore, full content scans are normally only needed, if you run “head-only” mode. change content processing and need to reindex everything. For more information see Crawl Scheduling .