Skip to content

Incremental Learning Overview

Incremental Learning enables modeling on static datasets up to 100 GB (and effectively unlimited size for dynamic datasets). Datasets are split into chunks by the Chunking Service, and those chunks are fed into Incremental Learning for training. Incremental Learning supports chunk sizes up to 4 GB. Larger workers are provisioned dynamically as chunk sizes grow.

Feature Flags

The Chunking Service is enabled by default. To disable it, set DISABLE_DATA_CHUNKING_SERVICE to true. The feature flag INCREMENTAL_LEARNING_IMPROVEMENTS is only required if you want to tune Spark configurations (see below).

Example configuration (Helm values):

# helm chart values snippet
core:
  config_env_vars:
    DISABLE_DATA_CHUNKING_SERVICE: false
    INCREMENTAL_LEARNING_IMPROVEMENTS: false 

Worker Resources for Static Datasets

For static datasets in AI Registry, Incremental Learning uses Apache Spark workers to chunk the dataset. Memory requirements are based on dataset size. Because all chunks are created at once (e.g., a 100 GB dataset becomes ~25 × 4 GB chunks), sufficient resources are required to complete the job.

Datasets are grouped into four size bins:

  • SMALL
  • MEDIUM
  • LARGE
  • XLARGE

Thresholds for these bins are controlled by environment variables (defaults in parentheses):

  • CHUNKING_WORKER_DATASET_SIZE_SMALL_START (10 GB)
  • CHUNKING_WORKER_DATASET_SIZE_MEDIUM_START (25 GB)
  • CHUNKING_WORKER_DATASET_SIZE_LARGE_START (50 GB)
  • CHUNKING_WORKER_DATASET_SIZE_XLARGE_START (75 GB)

For each bin, you can set the worker memory limit (defaults in parentheses):

  • CHUNKING_WORKER_MEMORY_LIMIT_FOR_SMALL_DATASET (60 GB)
  • CHUNKING_WORKER_MEMORY_LIMIT_FOR_MEDIUM_DATASET (120 GB)
  • CHUNKING_WORKER_MEMORY_LIMIT_FOR_LARGE_DATASET (178 GB)
  • CHUNKING_WORKER_MEMORY_LIMIT_FOR_XLARGE_DATASET (256 GB)

General Guidance

The maximum supported static dataset size is 100 GB (upper end of XLARGE). The minimum is 10 GB. Recommended instance sizes:

Dataset Size Range Size Classification Instance Size Recommendation (Default)
10 GB–25 GB SMALL 64 GB
25 GB–50 GB MEDIUM 128 GB
50 GB–70 GB LARGE 256 GB
70 GB–100 GB XLARGE 512 GB

These settings are configurable to accommodate customers with different definitions of “SMALL”/“XLARGE” and corresponding worker resources.

Spark Configurations

Certain Apache Spark settings can be tuned via environment variables. To enable tuning, set INCREMENTAL_LEARNING_IMPROVEMENTS to true (Private Preview). When enabled, the following variables override defaults:

  • CHUNKING_WORKER_MAX_PARTITIONS_BYTES — Max size (bytes) of a single input partition when reading files.
  • CHUNKING_WORKER_SPARK_MEMORY_FRACTION — Fraction of executor heap for unified execution/storage.
  • CHUNKING_WORKER_SPARK_MEMORY_STORAGE_FRACTION — Fraction of the unified pool reserved for storage.
  • CHUNKING_WORKER_SPARK_DRIVER_MEMORY_FRACTION — Fraction of spark.driver.memory to use.
  • CHUNKING_WORKER_SPARK_SHUFFLE_COMPRESS — Compress shuffle blocks exchanged over the network.
  • CHUNKING_WORKER_SPARK_SHUFFLE_SPILL_COMPRESS — Compress shuffle spill files written to disk.
  • CHUNKING_WORKER_SPARK_MAX_SIZE_IN_FLIGHT — Max bytes per in-flight shuffle fetch.
  • CHUNKING_WORKER_SPARK_MAX_REQS_IN_FLIGHT — Max concurrent shuffle fetch requests.

Default Spark Values

設定 デフォルト 備考
spark.driver.memory 1g Driver JVM heap.
spark.memory.fraction 0.6 Portion of executor heap for unified exec/storage.
spark.memory.storageFraction 0.5 Share of unified pool reserved for storage (cache).
spark.sql.files.maxPartitionBytes 134217728 (128 MB) Max size of a single input split when reading files.
spark.shuffle.compress true Compress shuffle blocks exchanged over the network.
spark.shuffle.spill.compress true Compress shuffle spill files written to disk.
spark.reducer.maxSizeInFlight 48m Max bytes per in-flight shuffle fetch.
spark.reducer.maxReqsInFlight 1000 Max concurrent shuffle fetch requests.

Chunk Size

The default chunk size is 4 GB. It cannot be changed within the app UI, but you can adjust it via environment variables:

  • MIN_CHUNK_SIZE_BYTES (1 GB)
  • MAX_CHUNK_SIZE_BYTES (4 GB)

Note: For Incremental Learning to work, the chunk size must be large enough to yield at least two chunks (one is reserved for validation).

To configure these options, see Tuning Datarobot Environment Variables.

Example Customization (values in bytes)

# helm chart values snippet
core:
  config_env_vars:
    CHUNKING_WORKER_DATASET_SIZE_SMALL_START: 10737418240
    CHUNKING_WORKER_DATASET_SIZE_MEDIUM_START: 26843545600
    CHUNKING_WORKER_DATASET_SIZE_LARGE_START: 53687091200
    CHUNKING_WORKER_DATASET_SIZE_XLARGE_START: 80530636800
    CHUNKING_WORKER_MEMORY_LIMIT_FOR_SMALL_DATASET: 64424509440
    CHUNKING_WORKER_MEMORY_LIMIT_FOR_MEDIUM_DATASET: 128849018880
    CHUNKING_WORKER_MEMORY_LIMIT_FOR_LARGE_DATASET: 191126044672
    CHUNKING_WORKER_MEMORY_LIMIT_FOR_XLARGE_DATASET: 274877906944

    CHUNKING_WORKER_MAX_PARTITIONS_BYTES: 67108864             # 64 MB
    CHUNKING_WORKER_SPARK_MEMORY_FRACTION: 0.7
    CHUNKING_WORKER_SPARK_MEMORY_STORAGE_FRACTION: 0.3
    CHUNKING_WORKER_SPARK_DRIVER_MEMORY_FRACTION: 0.5          # Use half of the driver memory
    CHUNKING_WORKER_SPARK_SHUFFLE_COMPRESS: "true"
    CHUNKING_WORKER_SPARK_SHUFFLE_SPILL_COMPRESS: "true"
    CHUNKING_WORKER_SPARK_MAX_SIZE_IN_FLIGHT: "16m"            # 16 megabytes
    CHUNKING_WORKER_SPARK_MAX_REQS_IN_FLIGHT: "64"