Skip to content

ACL hydration

Premium

DataRobot's RAG ACL management capabilities are a premium feature; contact your DataRobot representative for enablement information. This functionality is not available in the DataRobot trial experience.

Access Control List (ACL) hydration enforces fine-grained authorization for vector database (VDB) results based on original document permissions from external sources, such as Google Drive and SharePoint. When files are ingested from these sources, DataRobot captures and maintains their access control information, ensuring that users can only access VDB chunks for documents they have permission to view in the original source system.

Overview

ACL hydration enables organizations to maintain the same access control policies in DataRobot that exist in their source systems. This is particularly important for users who need to ensure that sensitive documents remain protected when used in vector databases for generative AI applications.

For datasets, ACL hydration works as described below:

  1. Ingest files from an external source (e.g., Google Drive or SharePoint) and register them in the File Registry.
  2. DataRobot captures initial ACLs during ingestion and stores them in the cache.
  3. DataRobot continuously monitors ACL changes in the source system using polling mechanisms:
    • For Google Drive, DataRobot queries the Drive Activity API for permission changes.
    • For SharePoint, DataRobot uses Delta Query to track permission updates.
  4. When a user queries a VDB, DataRobot filters results based on the user's permissions in the original source system, ensuring only accessible chunks are returned.
What is the latency for ACL updates?

DataRobot targets low-latency ACL updates to ensure permissions are applied as quickly as possible. While the system is designed to support near real-time updates (targeting a few seconds in the long term), initial implementations may have latencies of approximately 1 minute, which is acceptable for most use cases. Updates that take 10 minutes or longer are not acceptable.

Key capabilities

ACL hydration provides the following capabilities:

Capability Description
Initial ACL capture Automatically captures and stores access control information when files are first ingested from Google Drive or SharePoint.
ACL change monitoring Continuously polls source systems to detect and apply permission changes as they occur.
VDB result filtering Filters vector database query results to return only chunks from documents the user has permission to access.
Multi-user support Supports multiple users with different permission levels, ensuring each user sees only the data they're authorized to access.

Supported data sources

ACL hydration is currently supported for the following data sources:

Data source ACL tracking method Notes
Google Drive Drive Activity API Supports both shared drives.
SharePoint Delta Query Uses Microsoft's Delta Query API to track permission changes.
Future data sources

Support for additional data sources may be added in future releases. Locally uploaded files and database connections do not support ACL hydration, as they do not have external access control systems to reference.

Setting up ACL hydration

To enable ACL hydration for your organization, you must configure admin connections for the supported data sources:

Prerequisites

Before setting up ACL hydration, ensure you have:

  • An organization administrator account with appropriate permissions.
  • Admin-level connections configured for Google Drive and/or SharePoint.
  • Files ingested from the supported data sources.

Configuration steps

  1. Configure admin connections: Set up organization-level admin connections for Google Drive and SharePoint:

    • Navigate to organization settings.
    • Configure the Google Drive admin connection (recommended: use a service account).
    • Configure the SharePoint admin connection.
    • These connections are used to poll for ACL changes across all files in your organization.
  2. Associate external principals: Map external user identities (Google/SharePoint emails) to DataRobot users:

    • This step associates local DataRobot users with their external identities.
    • For new users, an async job automatically refreshes external principals to minimize the no-ACL gap.
  3. Ingest files: Create data sources and ingest files from Google Drive or SharePoint:

    • Retrieve the connection ID for your Google Drive or SharePoint connection.
    • For Google Drive, extract both the Drive ID and Folder ID from the Google Drive URL.
    • Create the data source and begin ingesting files.
    • Initial ACLs are automatically captured during ingestion.

How ACL hydration works

Initial ACL capture

When files are first ingested from Google Drive or SharePoint:

  1. DataRobot captures the current access control list for each file.
  2. The ACL information is stored in the File Registry, including:
    • The origin of each file (connector type and external file ID).
    • Original catalog ID and catalog version ID.
    • Initial ACLs with user and group permissions.
  3. This information is preserved even if the catalog item is updated or files are modified.

ACL change monitoring

DataRobot continuously monitors for ACL changes in the source systems:

  1. Polling mechanism: The system queries the source APIs at regular intervals (targeting low latency):
    • For Google Drive, DataRobot uses Drive Activity API to query for permission change events.
    • For SharePoint, DataRobot uses Delta Query to retrieve incremental changes.
  2. Update processing: When ACL changes are detected:
    • The system updates the stored ACL information in the File Registry
    • Changes are applied to affect future VDB queries immediately
  3. Latency: Updates are designed to be applied within acceptable timeframes (targeting seconds, with 1 minute being acceptable for the initial release)

VDB result filtering

When a user queries a vector database:

  1. Permission check: DataRobot checks the user's permissions for each file referenced in the VDB chunks
  2. Filtering: Only chunks from files the user has permission to access are included in the results
  3. Authorization: The system uses the stored ACL information combined with the user's external principal mapping to determine access

Access control behavior

User access levels

ACL hydration respects the original source system permissions:

Access level Behavior
Full access Users with full access to a file in the source system can see all VDB chunks from that file.
Restricted access Users with restricted access see only the chunks they're authorized for based on source system permissions
No access Users without access to a file in the source system cannot see any VDB chunks from that file, even if they have access to the catalog item in DataRobot

File access scenarios

The following scenarios illustrate how ACL hydration works in practice:

Scenario 1: User who ingested files

  • A user who ingests files from Google Drive has access to all files they ingested
  • This user can see all VDB chunks from those files in query results

Scenario 2: User with restricted access

  • A user who is a member of a specific group (e.g., "connectivity group") has access only to files shared with that group
  • This user can see VDB chunks only from files they have permission to access in the source system
  • Files in folders they don't have access to are filtered out

Scenario 3: Public files

  • Files marked as public or with "connectivity access" are accessible to all users
  • All users can see VDB chunks from these public files

Considerations

  • ACL hydration is not automatically applied to existing VDBs. Only newly created VDBs with files from supported sources will have ACL enforcement

Troubleshooting

Issue Solution
ACL updates are taking too long Check the status of ACL service. DataRobot does not support personal Google Drive drives in the initial release.
New users cannot access files they should have permission for Wait for the automatic async job to complete. New users may experience an up to 1 hour delay before ACLs are fully applied.