Training on large datasets¶
Hi Team—we've got a couple of questions from a customer around our data ingest limits. Appreciate if someone can answer or point me in the right direction.
What are some good practices on how to handle AutoML and EDA for larger datasets?
We are still using this presentation based on my R&D work:
The original training data converted from a SAS model has 21M records with 6425 features, the physical data size is about 260GB in CSV format. We want to feed in all 21M records with 960+ features, estimated data size will be about 40GB in CSV format.
100GB: 99GB (training) + 1GB (external holdout)
Self-managed: 96 CPUs, 3TB RAM, 50TB HDD
SaaS: Cloud account with 20 modelling workers
Do we really need to train the model on a large dataset?
Divide and conquer approach
- Randomly sample the original dataset into sample of NGB.
- Run Autopilot.
- Deploy the recommended-to-deploy model (trained on 100% of the sampled dataset).
- Take a NGB sample, with all features.
- Run Autopilot.
- Run Feature Impact on the recommended-to-deploy model (trained on 100% of the sampled dataset).
- Use Feature Impact to select the features having impact more than 1%.
- Select those features (>= 1%) from the full dataset, and if the result is less 10GB, model all rows.
- If the result is > 10GB, randomly sample the dataset from Step 5 to 10GB.
- Both divide and conquer approaches (dataset sampling and feature sampling) can challenge models trained on a full-size dataset.
- Full-size trained model vs. dataset sampling: +1.5% (worst case) and +15.2% (best case).
- Full-size trained model vs. feature sampling: -0.7% (worst case) and +8.7% (best case).
- Feature sampling is suitable for datasets containing hundreds of features (or more) and can lead to models that are similar, or even superior, over all metrics (accuracy, training/scoring time) when compared to models trained on a full dataset.