Incremental Learning Overview¶
Incremental Learning enables modeling on static datasets up to 100 GB (and effectively unlimited size for dynamic datasets). Datasets are split into chunks by the Chunking Service, and those chunks are fed into Incremental Learning for training. Incremental Learning supports chunk sizes up to 4 GB. Larger workers are provisioned dynamically as chunk sizes grow.
Feature Flags¶
The Chunking Service is enabled by default. To disable it, set DISABLE_DATA_CHUNKING_SERVICE to true.
The feature flag INCREMENTAL_LEARNING_IMPROVEMENTS is only required if you want to tune Spark configurations (see below).
Example configuration (Helm values):
# helm chart values snippet
core:
config_env_vars:
DISABLE_DATA_CHUNKING_SERVICE: false
INCREMENTAL_LEARNING_IMPROVEMENTS: false
Worker Resources for Static Datasets¶
For static datasets in AI Registry, Incremental Learning uses Apache Spark workers to chunk the dataset. Memory requirements are based on dataset size. Because all chunks are created at once (e.g., a 100 GB dataset becomes ~25 × 4 GB chunks), sufficient resources are required to complete the job.
Datasets are grouped into four size bins:
SMALLMEDIUMLARGEXLARGE
Thresholds for these bins are controlled by environment variables (defaults in parentheses):
CHUNKING_WORKER_DATASET_SIZE_SMALL_START(10 GB)CHUNKING_WORKER_DATASET_SIZE_MEDIUM_START(25 GB)CHUNKING_WORKER_DATASET_SIZE_LARGE_START(50 GB)CHUNKING_WORKER_DATASET_SIZE_XLARGE_START(75 GB)
For each bin, you can set the worker memory limit (defaults in parentheses):
CHUNKING_WORKER_MEMORY_LIMIT_FOR_SMALL_DATASET(60 GB)CHUNKING_WORKER_MEMORY_LIMIT_FOR_MEDIUM_DATASET(120 GB)CHUNKING_WORKER_MEMORY_LIMIT_FOR_LARGE_DATASET(178 GB)CHUNKING_WORKER_MEMORY_LIMIT_FOR_XLARGE_DATASET(256 GB)
General Guidance¶
The maximum supported static dataset size is 100 GB (upper end of XLARGE). The minimum is 10 GB. Recommended instance sizes:
| Dataset Size Range | Size Classification | Instance Size Recommendation (Default) |
|---|---|---|
| 10 GB–25 GB | SMALL | 64 GB |
| 25 GB–50 GB | MEDIUM | 128 GB |
| 50 GB–70 GB | LARGE | 256 GB |
| 70 GB–100 GB | XLARGE | 512 GB |
These settings are configurable to accommodate customers with different definitions of “SMALL”/“XLARGE” and corresponding worker resources.
Spark Configurations¶
Certain Apache Spark settings can be tuned via environment variables. To enable tuning, set INCREMENTAL_LEARNING_IMPROVEMENTS to true (Private Preview). When enabled, the following variables override defaults:
CHUNKING_WORKER_MAX_PARTITIONS_BYTES— Max size (bytes) of a single input partition when reading files.CHUNKING_WORKER_SPARK_MEMORY_FRACTION— Fraction of executor heap for unified execution/storage.CHUNKING_WORKER_SPARK_MEMORY_STORAGE_FRACTION— Fraction of the unified pool reserved for storage.CHUNKING_WORKER_SPARK_DRIVER_MEMORY_FRACTION— Fraction ofspark.driver.memoryto use.CHUNKING_WORKER_SPARK_SHUFFLE_COMPRESS— Compress shuffle blocks exchanged over the network.CHUNKING_WORKER_SPARK_SHUFFLE_SPILL_COMPRESS— Compress shuffle spill files written to disk.CHUNKING_WORKER_SPARK_MAX_SIZE_IN_FLIGHT— Max bytes per in-flight shuffle fetch.CHUNKING_WORKER_SPARK_MAX_REQS_IN_FLIGHT— Max concurrent shuffle fetch requests.
Default Spark Values¶
| 設定 | デフォルト | 備考 |
|---|---|---|
spark.driver.memory |
1g |
Driver JVM heap. |
spark.memory.fraction |
0.6 |
Portion of executor heap for unified exec/storage. |
spark.memory.storageFraction |
0.5 |
Share of unified pool reserved for storage (cache). |
spark.sql.files.maxPartitionBytes |
134217728 (128 MB) |
Max size of a single input split when reading files. |
spark.shuffle.compress |
true |
Compress shuffle blocks exchanged over the network. |
spark.shuffle.spill.compress |
true |
Compress shuffle spill files written to disk. |
spark.reducer.maxSizeInFlight |
48m |
Max bytes per in-flight shuffle fetch. |
spark.reducer.maxReqsInFlight |
1000 |
Max concurrent shuffle fetch requests. |
Chunk Size¶
The default chunk size is 4 GB. It cannot be changed within the app UI, but you can adjust it via environment variables:
MIN_CHUNK_SIZE_BYTES(1 GB)MAX_CHUNK_SIZE_BYTES(4 GB)
Note: For Incremental Learning to work, the chunk size must be large enough to yield at least two chunks (one is reserved for validation).
To configure these options, see Tuning Datarobot Environment Variables.
Example Customization (values in bytes)¶
# helm chart values snippet
core:
config_env_vars:
CHUNKING_WORKER_DATASET_SIZE_SMALL_START: 10737418240
CHUNKING_WORKER_DATASET_SIZE_MEDIUM_START: 26843545600
CHUNKING_WORKER_DATASET_SIZE_LARGE_START: 53687091200
CHUNKING_WORKER_DATASET_SIZE_XLARGE_START: 80530636800
CHUNKING_WORKER_MEMORY_LIMIT_FOR_SMALL_DATASET: 64424509440
CHUNKING_WORKER_MEMORY_LIMIT_FOR_MEDIUM_DATASET: 128849018880
CHUNKING_WORKER_MEMORY_LIMIT_FOR_LARGE_DATASET: 191126044672
CHUNKING_WORKER_MEMORY_LIMIT_FOR_XLARGE_DATASET: 274877906944
CHUNKING_WORKER_MAX_PARTITIONS_BYTES: 67108864 # 64 MB
CHUNKING_WORKER_SPARK_MEMORY_FRACTION: 0.7
CHUNKING_WORKER_SPARK_MEMORY_STORAGE_FRACTION: 0.3
CHUNKING_WORKER_SPARK_DRIVER_MEMORY_FRACTION: 0.5 # Use half of the driver memory
CHUNKING_WORKER_SPARK_SHUFFLE_COMPRESS: "true"
CHUNKING_WORKER_SPARK_SHUFFLE_SPILL_COMPRESS: "true"
CHUNKING_WORKER_SPARK_MAX_SIZE_IN_FLIGHT: "16m" # 16 megabytes
CHUNKING_WORKER_SPARK_MAX_REQS_IN_FLIGHT: "64"