Your Data Prep Administrator must enable this feature in your application.
When you profile a dataset, you generate statistics about the data in that dataset. The results are displayed on the dataset's Profile page:
Your profile is also automatically saved in the library with a name to indicate that it's a profile type AnswerSet:
How can I use profiles of my data?¶
Getting data is like getting a package, only you don't know what's in the package. With a package, you have a packing slip so you don't have to dig in and tear through everything to know what is in it. With Data Prep, you can create a profile of your data so that you can quickly understand it.
Data profiles are essential for determining the quality of data in a dataset before you begin working with that data. For example, you can quickly determine if there are mixed types in the data, nulls, non-printable characters, and patterns that don’t belong.
Based on the data profile, you can address quality issues by bringing the data into a Data Prep project.
As you continue to update versions of a dataset in the library—through either manual or automated import—you can continue to profile each subsequent version. In this way, you can monitor the data quality, version over version, and you can remediate as necessary.
What is the meaning of each column that I see in a profile's AnswerSet?¶
When you profile a dataset, the result is an AnswerSet that has a row to represent each column in your dataset. Each column in the profile AnswerSet provides statistics about the columns in your dataset.
Following are the statistics included in the data profile for each column:
|# of Rows||Total number of rows in the dataset|
|% Blank||Percentage of blanks in the column|
|% Text||Percentage of text values in the column|
|% Number||Percentage of numeric values in the column|
|% Date||Percentage of date values in the column|
|% Boolean||Percentage of boolean values in the column|
|% of Predominant Data Type||Percentage of values in the column that contain the most dominant data type|
|Predominant Data Type||Most dominant data type in the column|
|# of Unique Values||Number of unique values in the column|
|# of Phonetically Unique Values (Metaphone)||Number of unique values in the column after clustering like values using the metaphone (sounds like) algorithm—for example "Good Samaritan" and "Good Samaritan" are clustered to count as the same value.|
|% Possible Phonetic Duplicates (Metaphone)||Number of phonetically unique values (metaphone) / the number of unique values. This ratio indicates the possibility of duplicates in the column. A higher number indicates a higher probability of values that are potential duplicates, and indicates that you may need to do a cluster-and-edit operation on the column to identify duplicate values.|
|Top 5||Top five most common values in the column|
|Min String Length||Shortest string length in the column|
|Max String Length||Longest string length in the column|
|Avg String Length||Average string length for strings in the column|
|# of NAs or NONEs or NULLs||Number of times the column contains "na"or "none" or null"|
|% All Upper Case||Percentage of cells containing all upper case characters|
|% All Lower Case||Percentage of cells containing all lower case characters|
|% with Non Standard ASCII Chars||Percentage of cells containing non printable characters, for example the control character|
|% with HTML Tags||Percentage of cells containing HTML tags|
|Avg # of Consecutive Spaces||Average number of consecutive spaces that are found in the column|
|% Negative Numbers||Percentage of cells containing negative|
|% Zeros||Percentage of cells containing the value zero|
Create a profile for a dataset¶
To create a data profile:
On the Library page, hover over the dataset for which you want to create a data profile.
Click More Actions and select profile.
On the Profile page, click Generate Profile on the top right.
The profile appears in the Profile pane. In addition, the profile is automatically saved as an AnswerSet in the library.
The library preview of the AnswerSet is limited to the first 100 rows of the profile.