Data type transformations¶
Data Prep's data transformation capabilities support heterogenous data types—meaning that data types are automatically identified at the cell level when importing a dataset into the Data Prep library. The ability to support heterogenous data types, and within the same Data Prep project column, is powerful because it enables you to bring all of the data into your project. The mixed data types with inherent data quality issues can then be easily homogenized and harmonized with Data Prep, as explained in the Best Practices section below. In other strongly-typed applications that do not support heterogenous data, the source data must be homogenized, using a different tool, before your data prep work can even begin.
The purpose of this article is to explain when and how Data Prep determines data types for cells and columns, and the best practices for working with heterogeneous data in a Data Prep project.
How does Data Prep identify data types?¶
When you import a dataset into the Data Prep library, all data, in all cells, is automatically identified as one of the following data types during the import process:
String or Text
Date Time (under conditions described below)
Data Prep does this through an algorithm that follows these rules:
If the value is null, ignore the value.
If the value is, exactly, either “true” or “false”, treat the value as a Boolean.
If the value can be programmatically read as a number, treat it as Numeric.
All other values default to String.
For example, you have a dataset with 10 columns and 1 million rows of data. This translates to a total of 10 million cells. In this case, Data Prep identifies the data type for each one of the 10 million cells following the algorithm rules above.
Then, the data type for each column is cast based on the predominant data type that resides within each cell of a column.
Date Time values¶
As a rule, due to the inherent complexities required to analyze and resolve for the multitude of date time formats, Data Prep does not identify date time values in flat files. However, there are some exceptions to this rule. Under the following conditions, cell data will be recognized as type date time when imported from:
Database table (JDBC, Hive, etc) and the database provides a schema
Parquet file and the Parquet format provides the schema
Microsoft Excel file and the Excel format specifies a data type for each cell
How does Data Prep determine the column type for heterogeneous data within the same column? Returning to the example above, in 1 million rows of data, there's a good possibility that, within the same column, there is data belonging to different data types—for example string and numeric values could be mixed in the same column. In this case, Data Prep has further logic to determine how to cast the column's data type. Let's use another very simple example to illustrate that logic.
Here is a column of data that has 15 rows. The first 9 rows are identified as type Numeric, and the remaining 6 values are identified as type String. Notice the column type has been cast as Numeric:
During the import process, the algorithm calculation is automatically performed to detect that 60% of values are Numeric and 40% Strings. The predominant data type in a column then determines how Data Prep will cast the column type—in this case, the column is cast as type Numeric.
What if there's a tie for data types in a column?¶
In the event there is a tie—meaning 50% of the column values are one type while 50% are another type—the calculation logic provides these additional rules to break the tie:
|Column data mix||Winner|
|50% Boolean and 50% Date||Boolean|
|50% Boolean and 50% Numeric||Boolean|
|50% Boolean and 50% String||Boolean|
|50% Date and 50% Numeric||Numeric|
|50% Date and 50% String||String|
|50% Numeric and 50% String||String|
To summarize, in order of predominance, ties are broken are as follows:
Effects of adding data¶
What happens if the predominant data type in a column changes when you bring in new data to my Data Prep library or an existing project? Column type inferencing and subsequent casting only occurs during the import process to the Data Prep library.
There are two potential scenarios when a column type may not accurately reflect the predominant data type in a column:
During import into the Data Prep library, the first 1,000 rows of data are used to inference the column type. As a general rule, Data Prep has found that 1,000 rows of data—a configurable value—are sufficient to accurately inference and cast column types for your datasets. These first 1,000 rows are informally known as the "preview" state and it's the state you see in the application while a dataset is loading—either for the first time, or as an updated version for an existing dataset:
There may be unusual cases in which the predominant data type for a column changes after the first 1,000 rows. In this case, the column will remain cast using those first 1,000 rows. You can configure the number of rows in the preview by changing the value of the Rows to process for schema field in the options pane. Although the number of rows is configurable during import, Data Prep best practices recommend that you use Filtergrams to identify and address data quality issues. See Best Practices for details.
After a lookup or an append operation in an existing project, the predominant type for a column may change based on the data that comes into the column as a result of the operation. Because inferencing for column type occurs only during the import process, the column type, as it was originally cast, will remain despite the new predominant type. However, Data Prep best practices recommends that, as part of your standard data harmonization practices, you always use Filtergrams after blending your data from multiple sources to identify and address data quality issues.
How do I use Data Prep to locate and remediate data typing issues in my data?¶
Data Prep was built from the very beginning to identify and address such data quality issues. Typically, as soon as a dataset is imported into the library or appended in a project, the next recommended step is to harmonize the data type so that data quality is enhanced. Data harmonization is one of the key aspects of data preparation and Data Prep provides you with visual indicators and tools like Filtergrams for your harmonization exercise.
Following an append into an existing project, the predominant data type changes for this column from "numeric" to "string". The column was initially and correctly cast as numeric during the import process and subsequent use in this project. After the append operation, notice the column type remains string type, though the predominant type is now numeric. However, a visual indicator—numeric values are right-justified—allows you to quickly notice a typing disparity in the column:
When opening a Filtergram on this column, you can quickly determine the values in the column that are not "valid" for this type of column:
The red bar on the Filtergram indicates there are non-conforming data types. Click the Invalid link to view only the values that do not conforn.
After you've filtered the display to identify the "invalid" data types—all types other than the predominant type—you can create a lens in your project to generate an AnswerSet that lists only those non-conforming values. You can then use the AnswerSet to assist in your remediation process for those values. If after reviewing the "invalid" types, you want to convert the column type to another data type, you can do this using the column menu: