Developer documentation > Developer learning > Python API client user guide > Data > File registry

File registry¶

To work with unstructured data, DataRobot provides a simple file system interface through which you upload and work with files. This file system interface mimics a traditional file system with a directory structure and supports common Unix file system operations. DataRobot's file system uses containers, referred to as catalog items, to store one or more files using a key-value storage approach where the file's path is the key and its contents the value. Uploaded files can be leveraged behind the scenes in other areas or workflows in the DataRobot platform, such as creating a vector database with files uploaded from a SharePoint site.

Using DataRobot's datarobot.fs.DataRobotFileSystem, an fsspec-compatible implementation, you can quickly stand up file-based workflows that leverage the same code patterns as other fsspec-backed file systems.

File system terminology¶

The DataRobot file system uses the following terminology:

Catalog item: The container that stores one or more files. Catalog items are a form of data assets in the DataRobot platform. Each catalog item has its own ID, permissions, and version history.
Catalog item directory: The top-level directory in the file system that maps to a catalog item. The directory name matches the catalog item's ID. All files in a catalog item live as paths inside this directory.
Path: The location of a file or directory in the file system. A path includes the catalog item ID and the internal path within the catalog item. Paths take the form dr://<catalog_item_id>/path/to/file or <catalog_item_id>/path/to/file.
Overwrite strategy: A setting that controls the behavior when an upload or write targets a path where a file already exists. See FilesOverwriteStrategy for the available options.
Signed URL: A temporary, time-limited URL that grants direct read access to a single file without further authentication. Useful for sharing files or handing them to external tools.

Reminders¶

The following should be kept in mind when working with the DataRobot file system:

The file system simulates a top-level directory structure by giving each catalog item its own directory named after its ID. Files inside the catalog item appear as paths inside that directory.
Permissions are attached to the catalog item containing the files. All files inside a catalog item inherit permissions from the catalog item with respect to utilizing the File System API documented here. Files may also have external access control lists (ACLs) permissions attached to them if the connector used to ingest the files supports it. See the documentation for ACL Hydration for more information.
Because the file system uses key-value pairs to store files inside containers, directory structures are simulated and may change based on their contents. This results in the following consequences:
- The file system does not support empty directories. A directory is deleted automatically when all files inside it are deleted.
- To create a directory X, upload a file to a path that contains the directory name (for example, X/file.txt).
A catalog item itself may be empty even though empty directories inside it are not supported.
Some file operations may cause name collisions when creating/moving/copying files in the file system. File collisions are handled according to the overwrite strategy specified when performing the operation.

Set up the file system¶

The examples in this guide build on each other. The setup below configures the DataRobot client and creates a DataRobotFileSystem to use.

Using Python 3.9+, install the datarobot fs package add-on:

pip install 'datarobot[fs]'

import datarobot as dr
from datarobot.fs import DataRobotFileSystem

dr.Client(token="<YOUR_API_TOKEN>", endpoint="https://app.datarobot.com/api/v2")

fs = DataRobotFileSystem()

Create a new catalog item¶

A catalog item is the container that holds your files in the DataRobot file system. Every path you reference is rooted at a catalog item, so you'll need one to start your workflow. There are two ways to create a file catalog item: create a new empty catalog item, or clone an existing catalog item. Both approaches return the new catalog item's ID, which you'll reuse to build paths in the format dr://{catalog_id}/... for every subsequent operation.

Use create_catalog_item_dir to create a brand-new, empty catalog item.

# Create a brand-new, empty catalog item
catalog_id = fs.create_catalog_item_dir()

Use clone_catalog_item_dir to create a copy of an existing catalog item. Pass files_to_omit to exclude specific files from the clone. The paths in files_to_omit are relative to the source catalog item's root.

# Clone an existing catalog item, copying every file into a new one
source_catalog_id = "<EXISTING_CATALOG_ITEM_ID>"
clone_id = fs.clone_catalog_item_dir(source_catalog_id)

# Or clone but omit specific files from the source
partial_clone_id = fs.clone_catalog_item_dir(
    source_catalog_id,
    files_to_omit=["data/scores.csv", "notes/draft.txt"],
)

Add files¶

Add files to the DataRobot file system by uploading files from your local machine, a public URL, or a data source. Alternatively, write content directly to a file path to create a new file.

Write content directly to new files¶

Write content directly to a file path to create a new file in that location. Use open in write mode for text or buffered binary writes, and pipe_file or pipe for a one-shot write of raw bytes.

# Write a text file in place
with fs.open(f"dr://{catalog_id}/notes/readme.txt", mode="w") as f:
    f.write("This catalog item contains demo files for the file system guide.")

# Write raw bytes in a single call.
fs.pipe_file(f"dr://{catalog_id}/data/sample.csv", b"name,score\nCharlie,72\n")

By default, open uses FilesOverwriteStrategy.REPLACE, so writing to a path that already contains a file will overwrite the existing file. Alternatively, specify a different overwrite_strategy to change this behavior. For example, use FilesOverwriteStrategy.RENAME to create a duplicate file suffixed with (2) instead.

from datarobot.enums import FilesOverwriteStrategy

# Write to the existing path notes/readme.txt. 
# The new content will be placed in a new file /notes/readme (2).txt
with fs.open(
    f"dr://{catalog_id}/notes/readme.txt",
    mode="w",
    overwrite_strategy=FilesOverwriteStrategy.RENAME
) as f:
    f.write("This content is written to a new file because RENAME was specified.")

Upload local files¶

To copy a file from your local machine into the catalog item, use put_file or [put]file-system#datarobot.fs.file_system.DataRobotFileSystem.put){ target=_blank } for multiple files or directories.

The example below first creates a few small local files, then uploads them in two different ways.

import tempfile
import fsspec

# Use the local fsspec implementation to stage demo files in a temp directory.
local_fs = fsspec.filesystem("local")
local_dir = tempfile.mkdtemp()

local_fs.makedirs(f"{local_dir}/notes", exist_ok=True)
with local_fs.open(f"{local_dir}/scores.csv", "w") as f:
    f.write("name,score\nAlice,95\nBob,87\n")
with local_fs.open(f"{local_dir}/notes/agenda.txt", "w") as f:
    f.write("Q3 planning agenda")
with local_fs.open(f"{local_dir}/notes/actions.txt", "w") as f:
    f.write("1. Review roadmap\n2. Confirm budget\n")

# Upload a single file
fs.put_file(f"{local_dir}/scores.csv", f"dr://{catalog_id}/data/scores.csv")

# Upload a directory recursively. Trailing slashes mark both paths as directories.
fs.put(f"{local_dir}/notes/", f"dr://{catalog_id}/notes/", recursive=True)

Upload files from a URL¶

Use put_from_url to ingest a file directly from any URL the DataRobot server can reach. The file is streamed server-side, so there is no need to download it locally first.

# Ingest from url and create file dr://<catalog-id>/external/iris.csv
fs.put_from_url(
    path=f"dr://{catalog_id}/external/",
    url="https://s3.amazonaws.com/datarobot_public_datasets/iris.csv",
)

By default, put_from_url blocks until the upload completes. To start the upload and return immediately, pass wait_for_completion=False, or use upload_timeout to control how long to wait when blocking.

Upload files from a data source¶

To bring files in from a connector-backed system (S3, SharePoint, Google Drive, Confluence, and others), use put_from_data_source. This requires a DataSource configured against an unstructured DataStore, plus a Credential that can access it.

The example below configures an S3 bucket DataSource and copies a folder of documents into the catalog item. The same pattern can be applied for other source systems. See put_from_data_source for SharePoint and Google Drive variants.

credential = dr.Credential.create_s3(
    name="S3 Credential",
    aws_access_key_id="<AWS_ACCESS_KEY_ID>",
    aws_secret_access_key="<AWS_SECRET_ACCESS_KEY>",
)
s3_connector = next(c for c in dr.Connector.list() if c.connector_type == "s3")

s3_data_store = dr.DataStore.create(
    data_store_type=dr.enums.DataStoreTypes.DR_CONNECTOR_V1,
    canonical_name="My S3 Bucket",
    fields=[
        {"id": "fs.defaultFS", "name": "Bucket Name", "value": "my-bucket-name"},
        {"id": "fs.rootDirectory", "name": "Prefix", "value": "/"},
        {"id": "fs.s3.awsRegion", "name": "S3 Bucket Region", "value": "us-east-1"},
    ],
    connector_id=s3_connector.id,
)
s3_data_source = dr.DataSource.create(
    data_source_type=dr.enums.DataStoreTypes.DR_CONNECTOR_V1,
    canonical_name="S3 Documents",
    params=dr.DataSourceParameters(
        data_store_id=s3_data_store.id,
        path="documents/",
    ),
)

fs.put_from_data_source(
    path=f"dr://{catalog_id}/s3_documents/",
    data_source_id=s3_data_source.id,
    credential_id=credential.credential_id,
)

By default, put_from_data_source blocks until the upload completes. To start the upload and return immediately, pass wait_for_completion=False or use upload_timeout to control how long to wait when blocking.

Browse and search files¶

The file system supports the standard fsspec discovery methods:

ls: Shallow listing of directory contents.
find: Recursively look through all files (optionally including directories).
walk: Generator that yields directory trees one level at a time (similar to Python's os.walk).
glob: Match files or directories by pattern.
tree: Visualize the directory tree structure.

# List the immediate contents of the catalog item. Set detail=False for just the paths.
fs.ls(f"dr://{catalog_id}/", detail=False)

# Use detail=True (the default) to also retrieve size, type, and format.
for item in fs.ls(f"dr://{catalog_id}/", detail=True):
    print(f"{item['name']:50s} type={item['type']:10s} size={item['size']}")

# Recursively list every file. Pass withdirs=True to include directories.
all_files = fs.find(f"dr://{catalog_id}/")

# Walk the directory tree one level at a time, similar to os.walk().
for dirpath, dirnames, filenames in fs.walk(f"dr://{catalog_id}/"):
    print((dirpath, dirnames, filenames))

# Glob lets you match by pattern. Supports *, **, ?, and [abc] character classes.
csv_files = fs.glob(f"dr://{catalog_id}/**/*.csv")

# Visualize the catalog item layout. The recursion_limit controls how deep to walk.
print(fs.tree(f"dr://{catalog_id}/", recursion_limit=3))

Tip

Patterns ending with / will only match directories. For example, dr://{catalog_id}/*/ returns the top-level subdirectories of the catalog item.

Manipulate files¶

The DataRobot file system supports methods to copy, move, and delete files. All three methods accept single paths, lists of paths, and glob patterns, and support recursive operations on directories.

Copy files¶

Use copy to duplicate files or directories to a new path. Pass recursive=True to copy a directory and all of its contents, and use glob patterns to copy multiple files at once. Pass overwrite_strategy to specify how to handle naming collisions at the destination. Copying between catalog items is also supported, provided the user has permissions to the source and destination catalog items.

from datarobot.enums import FilesOverwriteStrategy

# Copy a single file.
fs.copy(
    f"dr://{catalog_id}/data/scores.csv",
    f"dr://{catalog_id}/backups/scores_backup.csv",
)

# Copy a directory recursively, skipping any files that already exist in the destination.
# Both paths end with / to mark them as directories.
fs.copy(
    f"dr://{catalog_id}/notes/",
    f"dr://{catalog_id}/archive/notes_snapshot/",
    recursive=True,
    overwrite_strategy=FilesOverwriteStrategy.SKIP
)

# Use a glob pattern to copy all .txt files into a single folder.
fs.copy(
    f"dr://{catalog_id}/**/*.txt",
    f"dr://{catalog_id}/all_text_files/",
    recursive=True,
)

Move and rename files¶

Use mv to move a file to a new path or rename it. Moving a file between catalog items is supported, provided the user has permissions to both the source and destination catalog items.

# Rename a file by moving it to a new path within the same catalog item.
fs.mv(f"dr://{catalog_id}/backups/scores_backup.csv", f"dr://{catalog_id}/backups/scores_v1.csv")

# Move a file into a different directory (note the trailing slash on the target).
fs.mv(f"dr://{catalog_id}/backups/scores_v1.csv", f"dr://{catalog_id}/archive/")

Delete files¶

Use rm to delete files and directories. Pass recursive=True to delete a directory and all of its contents. Use glob patterns to delete multiple files at once.

# Delete a single file.
fs.rm(f"dr://{catalog_id}/archive/scores_v1.csv")

# Delete a directory recursively.
fs.rm(f"dr://{catalog_id}/all_text_files/", recursive=True)

# Delete every csv file under archive.
fs.rm(f"dr://{catalog_id}/archive/**/*.csv", recursive=True)

Deleting a directory or deleting a catalog item

Deleting all files inside a directory automatically removes the directory because the DataRobot file system does not support empty directories. However, catalog items are different. Deleting all files inside a catalog item does not delete the catalog item.

To delete the catalog item itself, call fs.rm on the catalog item root (for example: fs.rm(f"dr://{catalog_id}/")). This soft-deletes the catalog item. A soft-deleted catalog item is hidden but can be restored with Files.un_delete() if you change you want to restore it.

Read files¶

To read files, use:

open for streaming and standard file-like access.
cat or cat_file for one-shot reads.
sign to generate a temporary signed URL.
get to download files locally.

Stream a file with open¶

open returns a DataRobotFile that behaves like a standard Python file object. This is the most flexible way to read large files, supporting iteration line-by-line, seek to a position, or read fixed-size chunks.

# Iterate line by line. Never loads the full file into memory.
with fs.open(f"dr://{catalog_id}/data/scores.csv", mode="r") as f:
    for line in f:
        print(line.rstrip())

# Read in binary mode with seeking.
with fs.open(f"dr://{catalog_id}/data/scores.csv", mode="rb") as f:
    header = f.read(20)
    f.seek(0)
    full = f.read()

Read files with `cat`¶

cat returns the file contents in a single call. Pass a single path to get back bytes, or a glob/list of paths to get back a {path: bytes} dictionary.

# Read a single file as bytes.
data = fs.cat(f"dr://{catalog_id}/data/scores.csv")

# Read every CSV file in the catalog item at once.
all_csvs = fs.cat(f"dr://{catalog_id}/**/*.csv", recursive=True)
for path, content in all_csvs.items():
    print(f"{path}: {len(content)} bytes")

Generate a signed URL¶

A signed URL gives a third-party tool (a browser, a downstream service, a notebook user) temporary read access to a file without sharing your DataRobot API token. Use a signed URL to download a file from the DataRobot file system.

import requests

url = fs.sign(f"dr://{catalog_id}/data/scores.csv", expiration=300)

# Download file locally using signed url
local_path = "scores.csv"
with requests.get(url, stream=True) as r:
    r.raise_for_status()
    with open(local_path, "wb") as f:
        for chunk in r.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)

Download files locally¶

To download a file to your local machine, use get_file for a single file, or get for multiple files or an entire directory.

import tempfile

local_dir = tempfile.mkdtemp()

# Download a single file to a local path.
fs.get_file(f"dr://{catalog_id}/data/scores.csv", f"{local_dir}/scores.csv")

# Download a directory recursively. Trailing slashes mark both paths as directories.
fs.get(f"dr://{catalog_id}/notes/", f"{local_dir}/notes/", recursive=True)

# Use a glob pattern to download all .csv files into a single local directory.
fs.get(f"dr://{catalog_id}/**/*.csv", f"{local_dir}/all_csvs/", recursive=True)

Inspect files and directories¶

Use info to get detailed metadata for a single file or directory.
Use exists to check if a file or directory exists.
Use isfile to determine if the path refers to a file.
Use isdir to determine if the path refers to a directory.
Use du to check disk usage for files and directories.

# Detailed metadata for a single file or directory.
file_info = fs.info(f"dr://{catalog_id}/data/scores.csv")

# Quick existence checks.
print("exists?", fs.exists(f"dr://{catalog_id}/data/scores.csv"))
print("isfile?", fs.isfile(f"dr://{catalog_id}/data/scores.csv"))
print("isdir?",  fs.isdir(f"dr://{catalog_id}/data/"))

# Total disk usage for the entire catalog item.
print(f"Total bytes: {fs.du(f'dr://{catalog_id}/', total=True):,}")

Dict-like access with `get_mapper`¶

get_mapper returns an instance ofDataRobotFSMap, a MutableMapping rooted at the given path. This is useful when working with libraries that accept fsspec-style mappers (for example, Zarr, Xarray, and other array stores), or when you simply prefer dictionary semantics for reading and writing files.

mapper = fs.get_mapper(f"dr://{catalog_id}/data/")

# List files in mapping
print("Keys:", list(mapper))
# Read bytes from a file
print("scores.csv first 30 bytes:", mapper["scores.csv"][:30])

# Write to create a new file
mapper["generated.txt"] = b"This file was created via the mapper interface."

# Membership checks and size.
print("'scores.csv' in mapper?", "scores.csv" in mapper)
print("Total files:", len(mapper))

# Delete file
del mapper["generated.txt"]