Google Drive Connector

For a general introduction to our Google Drive Connector, please refer to https://www.rheininsights.com/en/connectors/google-drive.php .

Google Drive Configuration

Our Google Drive Connector is intended to index all documents from an organization. This means that it needs the following

  1. Service Account User

  2. This service account authenticates using a JSON secret and thus the policy disableServiceAccountKeyCreation must be disabled

  3. The service account must use domain-wide delegation

  4. You need to enable Google Drive APIs and Google Admin Directory APIs for this service account

In order to set up the crawl user, please proceed as follows:

Create a new Project

  1. Open https://console.cloud.google.com/cloud-resource-manager

  2. Create a new project, name it such as “Google Drive Connector”

Enable the APIs

  1. Open the project

  2. Open API & Services

  3. Click on Enable APIs und Services

  4. Search for Google Drive API and click on it

  5. Hit enable API

  6. Go back to Enable APIs und Services

  7. Search for Admin SDK API and click on Admin SDK API

  8. Click on enable

Create a Service Account

  1. Within the project search, open the service accounts dialog in IAM

  2. Click on Create Service Account

  3. Give it a name and click on create and continue

  4. At Grant this service account access to project, click Done

  5. At Grant users access to this service account, click Done

  6. In the next dialog, click on the newly created service account

  7. Click on Keys

  8. Click on Add Key

  9. Click on Create new key

  10. Choose JSON

Enable Domain-Wide Delegation

  1. Copy the service account’s Unique ID from the service account’s detail view

  2. Go to Domain-wide Delegation in the Google Workspace Admin Console

  3. Click on Add New

  4. Enter the unique id of the crawl account

    and the following scopes
    https://www.googleapis.com/auth/drive.metadata.readonly,
    https://www.googleapis.com/auth/drive.file,
    https://www.googleapis.com/auth/drive.readonly,
    https://www.googleapis.com/auth/admin.directory.group,
    https://www.googleapis.com/auth/admin.directory.user,
    https://www.googleapis.com/auth/admin.directory.group.member.readonly

Customer Id

  1. Go to https://admin.google.com > Account > Account Settings > Profile and make a note of your customer Id

Content Source Configuration

The content source configuration of the connector comprises the following mandatory configuration fields.

 

  1. Service name: This identifier is used to tell the Google APIs who is connecting against them. You will find this id in the API metrics in your Google Cloud project.

  2. Service user certificate (as JSON): Here you need to paste the contents of the private key for the crawl user into. You generated this in one of the steps above.

  3. Organization's customer ID: Here you need to add your organization Id into. Above, we described, where you can find this ID.

  4. Admin directory user: Here you have to add a valid e-mail address of an Google Directory admin into. This admin needs to have view permission of users, groups and user-group relationships in your Admin Directory.

  5. Crawl personal drives: This flag tells the connector to crawl or skip personal drives. By default, we recommend indexing also personal drives.

  6. Crawl items in trash: Enable this flag, if you like to still index items which are in the Google Drive trashes. By default, this flag is disabled.

  7. Excluded files by extension: here you can add a list of file suffixes which will be filtered out while crawling and not being indexed at all.

  8. Excluded drivesby regular expression: here you can add regexes or individual shared drive ids or names to exclude the drives for these users from crawling.

  9. Included drives by regular expression: here you can add regexes or individual shared drive ids to only include these drive(s) in crawling.

  10. Excluded users by regular expression: here you can add regexes or individual user Ids to exclude the drives for these users from crawling. Please note that all users who have access to a specific shared drive are excluded, the connector will not index this shared drive, too.

  11. Included users by regular expression: here you can add regexes or individual user Ids to only include their drive(s) in crawling.

  12. Include delegates in ACL: Enable this flag to include delegate users in the ACLs. Each mailbox comes with a Group ACL and delegate users will be part of this group. Otherwise, it will be just the owner.

  13. Maximum content size (MB): This is file size limitation. If files exceed this size, they won’t be crawled.

  14. The general settings are described at General Crawl Settings and you can leave these with its default values.

Permissions to an item in a personal Drive are limited to the owners, as well as users where the file has been explicitly shared with. If a document is shared to everyone who has a link, then the connector ignores this permission.

After entering the configuration parameters, click on validate. This validates the content crawl configuration directly against the content source. If there are issues when connecting, the validator will indicate these on the page. Otherwise, you can save the configuration and continue with Content Transformation configuration.

Recommended Crawl Schedules

Google Drive offers a complete change log. So the connector can efficiently detect new, updated and deleted mails and attachments. However, due to the vast amount of changes in an organization, it may vary how quickly the connector is able to get through all changes.

However, we recommend to configure incremental crawls to run every 60 minutes.

Principal scans should run twice per day. These pick up the user group relationships, which are important for shared drives.

Furthermore, full content scans are normally not needed for Google Drive, only if you change content processing and need to reindex everything. For more information see Crawl Scheduling .