Sample datasets¶
You may find it useful to work with a sample of a dataset before bringing all the data into your Data Prep project. For large datasets, this can make initial exploration and discovery easier. The Sampling tool also gives you the flexibility to filter down to a specific set of rows in your data, and then sample on the remainder.
Work with the Sampling tool¶
To access the Sampling tool, click sampling in the project Tools bar:
You may want to sample a very large dataset for initial discovery before bringing all of the data into your project. The Sampling tool also gives you the flexibility to filter down to a specific set of rows in your data, and then sample on the remainder.
Note
If you choose to sample your data, you are only shown the patterns, lookup combinations, and aggregations for that sample. When your exploration is complete, you can easily remove the sampling operation by either muting or deleting it in the Steps pane.
Sampling methods¶
Sampling can be based on a percentage of your dataset or a specific number of rows in the dataset.
-
Percentage-based sampling: Perform a random and repeatable sample across your dataset based on the percentage you specify. You can also choose to specify a column in your dataset that is used for generating the sample. In this case, only the data in the column is used for determining the sample.
-
Row-based sampling: Perform a random and repeatable sample across your dataset based on the number of rows you specify. The number of rows you specify is divided by the total number of rows in your dataset. A subset sample of your data is returned. If you are performing row-based sampling as a data prep step in your project, the number of rows you specify is divided by the total number of rows in your dataset from the previous step.
For both types of sampling, you can save the "sampling seed" number to ensure that you can repeat your sampled subset of data. You can also click the green reseed icon to produce a different subset sample of your data. For an optimal sample, your dataset should exceed 100k rows.
Sample using percentage¶
To create a sample based on percentage of your dataset:
-
From the Tools bar, click columns.
The Sample using pane appears.
-
Click Percentage if it is not already selected.
-
Optionally select a column.
The sampling percentage is based on the selected column
-
In the By Percentage field, enter the percentage of the dataset that you want included in the sample.
-
Optionally click the green reseed
icon.
-
Click Save.
Sample using rows¶
To create a sample based on percentage of your dataset:
-
From the Tools bar, click columns.
The Sample using pane appears.
-
Click Percentage if it is not already selected.
-
Optionally select a column.
The sampling percentage is based on the selected column
-
In the By Percentage field, enter the percentage of the dataset that you want included in the sample.
-
Optionally click the green reseed
icon.
-
Click Save.