ACL hydration¶
プレミアム機能
DataRobot's RAG ACL management capabilities are a premium feature; contact your DataRobot representative for enablement information. This functionality is not available in the DataRobot trial experience.
Access Control List (ACL) hydration enforces fine-grained authorization for vector database results based on original document permissions from external sources, such as Google Drive and SharePoint. When files are ingested from these sources, DataRobot captures and maintains their access control information, ensuring that users can only access vector database chunks for documents they have permission to view in the original source system.
概要¶
ACL hydration enables organizations to maintain the same access control policies in DataRobot that exist in their source systems. This is particularly important for administrators who need to ensure that sensitive documents remain protected when used in vector databases for generative AI applications.
For datasets, ACL hydration works as follows:
- Ingest files from a supported external source. DataRobot registers them in the File Registry.
- DataRobot captures and caches initial ACLs during ingestion.
- DataRobot continuously monitors ACL changes in the source system using polling mechanisms:
- For Google Drive, DataRobot queries the Drive Activity API for permission changes.
- For SharePoint, DataRobot uses Delta Query to track permission updates.
- When a user queries a vector database, DataRobot filters results based on the user's permissions in the original source system, ensuring only accessible chunks are returned.
What is the latency for ACL updates?
DataRobot targets low-latency ACL updates to ensure permissions are applied as quickly as possible. While the system is designed to support near real-time updates (targeting a few seconds in the long term), initial implementations may have latencies of approximately 1 minute, which is acceptable for most use cases. Updates that take 10 minutes or longer are not acceptable.
See the feature considerations for more information.
Key capabilities¶
ACL hydration provides the following capabilities:
| 機能 | 説明 |
|---|---|
| Initial ACL capture | Automatically captures and stores access control information when files are first ingested from supported data sources. |
| ACL change monitoring | Continuously polls source systems to detect and apply permission changes as they occur. |
| Vector database result filtering | Filters vector database query results to return only chunks from documents the user has permission to access. |
| Multi-user support | Supports multiple users with different permission levels, ensuring each user sees only the data they're authorized to access. |
Supported data sources¶
ACL hydration is currently supported for the following data sources:
| データソース | ACL tracking method |
|---|---|
| Googleドライブ | Drive Activity API |
| SharePoint | Delta Query |
Setting up ACL hydration¶
To enable ACL hydration for your organization, configure admin connections for the supported data sources.
前提条件¶
Before setting up ACL hydration, ensure you have:
- An organization administrator account with appropriate permissions.
- A connection to a supported data source.
- Files ingested from the supported data sources (after the connection is configured).
Configuration steps¶
-
As a DataRobot organization admin, go to the Data connections page.
-
Select an existing connection of the supported data source type or create a new connection (using service accounts for admin connections is highly recommended).
-
Toggle on Enable access control list synchronization. DataRobot will use this connection to poll for ACL changes across all files in your organization.
-
Configure the ACL management details as needed:
- Admin impersonation email: Set the administrator account to impersonate in the background. This field is mandatory for Google Drive.
-
Domain (_optional): Overwrites the user's email domain with the specified string. For example, if Domain is set to
sharepoint.com, and the user email ismy-account@company.com, DataRobot will trackmy-account@sharepoint.comACLs. -
保存をクリックします。
There can be only one admin connection per data source type. Any admin user in the DataRobot organization can enable ACL synchronization for a connection they have permission to. Doing so overwrites the existing admin connection for that data source for the entire organization. Organization admin users do not see all admin connections, only those that have been shared with them explicitly or that they own.
-
Retrieve the connection ID to create the data source and begin ingesting files. Initial ACLs are automatically captured during ingestion.
備考
For Google Drive, extract both the Drive ID and Folder ID from the Google Drive URL.
For new DataRobot users, an async job automatically refreshes external principals (user IDs) to minimize the no-ACL gap.
How ACL hydration works¶
ACL hydration captures the access control list, monitors the source system settings, and returns vector database results to each user based on those settings.
When files are first ingested, DataRobot captures the current access control list for each file.
ACL information is stored in the File Registry, including: * The origin of each file (connector type and external file ID). * Original catalog ID and catalog version ID. * Initial ACLs with user and group permissions.
Permission information is preserved even if the catalog item is updated or files are modified.
DataRobot continuously monitors for ACL changes in the source system using polling mechanism queries the source APIs at regular intervals, targeting low latency. 例: * For Google Drive, DataRobot uses Drive Activity API to query for permission change events. * For SharePoint, DataRobot uses Delta Query to retrieve incremental changes.
When ACL changes are detected, DataRobot updates the stored ACL information in the File Registry. Changes are applied immediately and impact to future vector database queries.
When a user queries a vector database:
DataRobot checks the user's permissions for each file referenced in the vector database chunks. Only chunks from files the user has permission to access are included in the results. The system uses the stored ACL information ,combined with the user's external principal mapping, to determine access.
User access levels¶
ACL hydration respects the original source system permissions:
| アクセスレベル | 動作 |
|---|---|
| Full access | Users with full access to a file in the source system can see all vector database chunks from that file. |
| Restricted access | Users with restricted access see only the chunks they're authorized for, based on source system permissions. |
| No access | Users without access to a file in the source system cannot see any vector database chunks from that file, even if they have access to the catalog item in DataRobot. |
File access scenarios¶
The following sample scenarios illustrate how ACL hydration works:
Scenario 1: User who ingested files
- A user who ingests files from Google Drive has access to all files they ingested.
- This user can see all vector database chunks from those files in query results.
Scenario 2: User with restricted access
- A user who is a member of a specific group has access only to files shared with that group.
- This user can see vector database chunks only from files they have permission to access in the source system.
- Results from files in folders they don't have access to are filtered out.
Scenario 3: Public files
- Files marked as public or with "connectivity access" are accessible to all users.
- All users can see vector database chunks from these public files.
Integrate user identities with applications and agents¶
DataRobot supports integration between applications, agents, vector databases, and ACL enforcement. This integration is built so that applications and agents automatically enforce ACL-filtered vector database results for the requesting user.
For applications that call DataRobot agents and use ACL-enabled vector databases, you must propagate the requesting user's identity so that the agent can apply the correct ACL filtering.
以下の手順を実行します。
-
Read the identity token from the incoming request. Your app receives requests (e.g., from a front end or API client). Read the
X-DataRobot-Identity-Tokenheader from each request. -
Pass the
X-DataRobot-Identity-Tokenheader when calling the agent. When your app invokes a DataRobot agent (e.g., for chat or RAG), include the same header in the outbound call so that the agent can identify the user and filter vector database results by that user's permissions.
The way you pass the header depends on how your app calls the agent (e.g., as extra_headers to an LLM client or as a headers object to a stream manager). Reading the header is the same in all cases:
# Example: read the identity token from the request (e.g., FastAPI/Starlette)
identity_token = request.headers.get("X-DataRobot-Identity-Token")
Then, forward this value in the headers you send when calling the agent. If the identity token is not propagated, the agent cannot apply per-user ACL filtering and behavior may not match the intended access policy.
機能に関する注意事項¶
-
DataRobot supports ACL hydration with Sharepoint and Google Drive.
-
Datarobot does not support personal drives for Google Drive. It only supports shared drives.
-
Locally uploaded files and database connections do not support ACL hydration, as they do not have external access control systems to reference.
-
ACL hydration enforcement is not automatically applied to existing vector databases and ingested files. It is only applied to newly created vector databases with newly ingested files from supported sources.
-
DataRobot applies the latest retrieved ACLs regardless of the background synchronization health. If you need to confirm that identity and ACL sync are healthy, use the admin API monitoring described in Monitoring background synchronization.
-
DataRobot does not support cross-drive links for ACL hydration. For example, if you ingest files using a link pointing to another drive, the ingestion completes successfully but the files are considered inaccessible via ACL.
Monitoring background synchronization¶
Administrators can monitor the health of background synchronization using the Event Logs API. Two event types are relevant:
User identity and membership synchronization¶
Use the following to return a report on user identity and membership synchronization with the external source:
GET https://{DATAROBOT_URL}/api/v2/eventLogs/?event=External+principals+synchronized+for+a+connector
The request returns the following response fields:
| フィールド | 説明 |
|---|---|
timestamp |
The time of the report, represented in UTC. |
context.connectorType |
The data source (e.g., gdrive). |
context.mostOutdatedMinutes |
The user identity synchronization latency, in minutes. |
context.mode |
The job status, either: <INITIALIZE (user identity mapping is still being created), UPDATE (normal operation), or FAILURE (the job did not complete successfully). |
ACL synchronization¶
Use the following to return a report on ACL file synchronization:
GET {public API endpoint}/api/v2/eventLogs/?event=ACL+synchronized+for+a+connector
The request returns the following response fields:
| フィールド | 説明 |
|---|---|
timestamp |
The time of the report, represented in UTC. |
context.connectorType |
The data source. |
context.result |
The job status. |
context.seconds |
The time to complete the ACL synchronization time to complete, in seconds. |
トラブルシューティング¶
| 問題 | 解決方法 |
|---|---|
| ACL updates are taking too long | Check the status of the ACL service. DataRobot does not support personal Google Drive drives. |
| New users cannot access files they should have permission to | Wait for the automatic async job to complete. New users may experience up to one-hour delay before ACLs are fully applied. |
