Incremental learning overview¶

Incremental Learning enables modeling on static datasets up to 100 GB (and effectively unlimited size for dynamic datasets). Datasets are split into chunks by the Chunking Service, and those chunks are fed into Incremental Learning for training. Incremental Learning supports chunk sizes up to 4 GB. Larger workers are provisioned dynamically as chunk sizes grow.

Feature flags¶

The Chunking Service is enabled by default. To disable it, set DISABLE_DATA_CHUNKING_SERVICE to true. The feature flag INCREMENTAL_LEARNING_IMPROVEMENTS is only required if you want to tune Spark configurations (see below).

Example configuration (Helm values):

# helm chart values snippet
core:
  config_env_vars:
    DISABLE_DATA_CHUNKING_SERVICE: false
    INCREMENTAL_LEARNING_IMPROVEMENTS: false

Worker resources for static datasets¶

For static datasets in AI Registry, Incremental Learning uses Apache Spark workers to chunk the dataset. Memory requirements are based on dataset size. Because all chunks are created at once (e.g., a 100 GB dataset becomes ~25 × 4 GB chunks), sufficient resources are required to complete the job.

Datasets are grouped into four size bins:

SMALL
MEDIUM
LARGE
XLARGE

Thresholds for these bins are controlled by environment variables (defaults in parentheses):

CHUNKING_WORKER_DATASET_SIZE_SMALL_START (10 GB)
CHUNKING_WORKER_DATASET_SIZE_MEDIUM_START (25 GB)
CHUNKING_WORKER_DATASET_SIZE_LARGE_START (50 GB)
CHUNKING_WORKER_DATASET_SIZE_XLARGE_START (75 GB)

For each bin, you can set the worker memory limit (defaults in parentheses):

CHUNKING_WORKER_MEMORY_LIMIT_FOR_SMALL_DATASET (60 GB)
CHUNKING_WORKER_MEMORY_LIMIT_FOR_MEDIUM_DATASET (120 GB)
CHUNKING_WORKER_MEMORY_LIMIT_FOR_LARGE_DATASET (178 GB)
CHUNKING_WORKER_MEMORY_LIMIT_FOR_XLARGE_DATASET (256 GB)

General guidance¶

The maximum supported static dataset size is 100 GB (upper end of XLARGE). The minimum is 10 GB. Recommended instance sizes:

Dataset Size Range	Size Classification	Instance Size Recommendation (Default)
10 GB–25 GB	SMALL	64 GB
25 GB–50 GB	MEDIUM	128 GB
50 GB–70 GB	LARGE	256 GB
70 GB–100 GB	XLARGE	512 GB

These settings are configurable to accommodate customers with different definitions of “SMALL”/“XLARGE” and corresponding worker resources.

Spark configurations¶

Certain Apache Spark settings can be tuned via environment variables. To enable tuning, set INCREMENTAL_LEARNING_IMPROVEMENTS to true (Private Preview). When enabled, the following variables override defaults:

CHUNKING_WORKER_MAX_PARTITIONS_BYTES — Max size (bytes) of a single input partition when reading files.
CHUNKING_WORKER_SPARK_MEMORY_FRACTION — Fraction of executor heap for unified execution/storage.
CHUNKING_WORKER_SPARK_MEMORY_STORAGE_FRACTION — Fraction of the unified pool reserved for storage.
CHUNKING_WORKER_SPARK_DRIVER_MEMORY_FRACTION — Fraction of spark.driver.memory to use.
CHUNKING_WORKER_SPARK_SHUFFLE_COMPRESS — Compress shuffle blocks exchanged over the network.
CHUNKING_WORKER_SPARK_SHUFFLE_SPILL_COMPRESS — Compress shuffle spill files written to disk.
CHUNKING_WORKER_SPARK_MAX_SIZE_IN_FLIGHT — Max bytes per in-flight shuffle fetch.
CHUNKING_WORKER_SPARK_MAX_REQS_IN_FLIGHT — Max concurrent shuffle fetch requests.

Default Spark values¶

設定	デフォルト	備考
`spark.driver.memory`	`1g`	Driver JVM heap.
`spark.memory.fraction`	`0.6`	Portion of executor heap for unified exec/storage.
`spark.memory.storageFraction`	`0.5`	Share of unified pool reserved for storage (cache).
`spark.sql.files.maxPartitionBytes`	`134217728` (128 MB)	Max size of a single input split when reading files.
`spark.shuffle.compress`	`true`	Compress shuffle blocks exchanged over the network.
`spark.shuffle.spill.compress`	`true`	Compress shuffle spill files written to disk.
`spark.reducer.maxSizeInFlight`	`48m`	Max bytes per in-flight shuffle fetch.
`spark.reducer.maxReqsInFlight`	`1000`	Max concurrent shuffle fetch requests.

チャンクサイズ¶

The default chunk size is 4 GB. It can't be changed within the app UI, but you can adjust it via environment variables:

MIN_CHUNK_SIZE_BYTES (1 GB)
MAX_CHUNK_SIZE_BYTES (4 GB)

Note: For Incremental Learning to work, the chunk size must be large enough to yield at least two chunks (one is reserved for validation).

To configure these options, see Tuning DataRobot Environment Variables.

Example customization (values in bytes)¶

# helm chart values snippet
core:
  config_env_vars:
    CHUNKING_WORKER_DATASET_SIZE_SMALL_START: 10737418240
    CHUNKING_WORKER_DATASET_SIZE_MEDIUM_START: 26843545600
    CHUNKING_WORKER_DATASET_SIZE_LARGE_START: 53687091200
    CHUNKING_WORKER_DATASET_SIZE_XLARGE_START: 80530636800
    CHUNKING_WORKER_MEMORY_LIMIT_FOR_SMALL_DATASET: 64424509440
    CHUNKING_WORKER_MEMORY_LIMIT_FOR_MEDIUM_DATASET: 128849018880
    CHUNKING_WORKER_MEMORY_LIMIT_FOR_LARGE_DATASET: 191126044672
    CHUNKING_WORKER_MEMORY_LIMIT_FOR_XLARGE_DATASET: 274877906944

    CHUNKING_WORKER_MAX_PARTITIONS_BYTES: 67108864             # 64 MB
    CHUNKING_WORKER_SPARK_MEMORY_FRACTION: 0.7
    CHUNKING_WORKER_SPARK_MEMORY_STORAGE_FRACTION: 0.3
    CHUNKING_WORKER_SPARK_DRIVER_MEMORY_FRACTION: 0.5          # Use half of the driver memory
    CHUNKING_WORKER_SPARK_SHUFFLE_COMPRESS: "true"
    CHUNKING_WORKER_SPARK_SHUFFLE_SPILL_COMPRESS: "true"
    CHUNKING_WORKER_SPARK_MAX_SIZE_IN_FLIGHT: "16m"            # 16 megabytes
    CHUNKING_WORKER_SPARK_MAX_REQS_IN_FLIGHT: "64"