Normalize using clustering¶
You can use the Data Prep Cluster + Edit operation to quickly normalize column data. It is especially useful in spotting inconsistencies and errors in a column.
When you execute Cluster + Edit on a column in your dataset:
- Data Prep searches all column values and groups closely matching values together in a cluster.
- Each cluster is listed in the Cluster + Edit pane, along with its cluster size (the number of unique values in the cluster) and its row count (the number of times each unique value occurs in the column.)
Based on the clustered data, Data Prep suggests a single replacement value to normalize all of the values in the cluster. You can accept the suggestion or specify another value to use for normalizing the cluster.
The following is the Cluster + Edit pane with descriptions of its components:
|Column field||The column on which you want to perform the operation.|
|using||Use the dropdown menu to select an algorithm to use for the clustering operation. See Clustering algorithms for details.|
|with||Use the dropdown menu to select one of the available algorithms to use for the output option. The algorithm you choose determines the new value suggestion for the cluster. See Output algorithms for details.|
|Select||Click to select a cluster you want to update.|
|Size||Number of unique values in the cluster.|
|Row Count||Number of rows in the cluster.|
|Values in Cluster||Displays the value Data Prep suggests as a replacement for all values in the cluster. Data Prep's suggestion is based on the clustering algorithm you choose. To override Data Prep's suggestion, enter a different value. Click the 'X' next to a value to prevent it from being updated.|
|25 / 50 / 100||Select the number of clusters per page for bulk edits.|
|Select page||Lets you select clusters for bulk editing. The bulk edit operates on one page of clusters. Use the Page Size field to specify the number of clusters per page.|
|Cluster automatically||Performs the bulk edit on all rows in the dataset after you click Save.|
When to use Cluster + Edit¶
The following table describes common scenarios for using Cluster + Edit.
|Corrections and inconsistencies||Bulk correction of data entry mistakes, spelling errors, and use of different abbreviations or shorthand conventions.||Acme Co., Acme Company, Acme Comp.|
|Reclassification||Reclassifying detailed values into an aggregate value.||“12oz soda” and “8oz soda” both become “soda.”|
|Consolidation||Consolidating consistent, but different, values that emerge when the data from different systems is combined in a single column.||One data source refers to "soda" consistently and another data source refers to "tonic" consistently.|
Perform Cluster + Edit on a column¶
Locate the column you want to normalize.
Hover over the column menu icon and click Other > cluster + edit.
The Cluster + Edit pane opens.
In the using field, use the dropdown menu to select an algorithm to use for the clustering operation.
See Clustering algorithms for details.
In the with field, use the dropdown menu to select an algorithm to use for the output option. The algorithm you choose determines the new value suggestion for the cluster.
See Output algorithms for details.
To change the value Data Prep has suggested for the cluster, enter a term in the Use field.
To prevent Data Prep from replacing a value, click the X next to the value.
In the Select column, click the check box next to a cluster you want to update and click Save.
Continue making individual cluster edits.
Perform bulk editing on clusters¶
Use bulk editing to quickly normalize all clusters on a page.
Select 25, 50, or 100 clusters per page.
The bulk edit operation is restricted to one page of clusters.
To perform bulk editing, do one of the following:
- Click Select page to select all clusters on the page. Use this method if you want to review the suggested replacements and make edits before saving.
- Click Cluster automatically to select all of the clusters in the dataset. Use this method if you are sure you want to accept all of the suggested replacement values.
Click Save to update the clusters.
For each cluster, all values change to the suggested value.
Tools for working with clusters¶
The following tools provide visual queues to better recognize how the suggested value for a cluster was derived.
|Fixed-width font||By default, cluster values display in a variable-width font. Click this option to display cluster values in a fixed-width font. The fixed-width option aligns all text characters, which allows you to more easily identify extra spaces within a cluster value and differentiate characters across the clusters.|
|Highlight tools||Highlighting lets you recognize how the suggested cluster replacement value was derived. The Additions tool highlights the characters that will be added for all common characters. The Deletions tool indicates where deletions will be made in order to derive the common characters. Deletions are condensed into a red X. You can enable the Additions and Deletions tools simultaneously.|
The using clustering algorithm defines the values that should be grouped together.
For all clustering algorithms, blanks and nulls are not included when building a cluster.
The following algorithms are available in the application.
The metaphone algorithm, the default selection, groups words together based on their English language pronunciation. It is classified as a “phonetic” algorithm because it is based on how similar or different the text would sound if spoken. This algorithm is particularly useful when working with manually entered data (where misspellings may occur) and data appended from multiple source systems (where minor variations may occur).
The ngram algorithm breaks the data in the column into a specified number (n) of characters. These “chunks” (or grams) of text are then compared based on the probability of what might follow each. Ngram algorithms are often used by search engines—as a user enters characters into the search bar, the engine examines the probability of what form the final search terms might take and makes suggestions as the user types.
The fingerprint algorithm groups similar values into a cluster where the only differences are punctuation, word order, and capitalization. Fingerprint algorithms are often used to match names, for example, "Adèle Smith" and "SMITH, ADELE".
The with output option determines default replacement values for clusters. The output options attempt to make the best recommendation for the New Value replacement. The replacement value can always be manually edited to meet your specific business requirements.
The application provides the following algorithms.
Most Frequent Value¶
The Most Frequent Value output algorithm, the default selection, builds clusters using the value that occurs most frequently in the cluster.
All Common Words¶
The All Common Words output algorithm builds clusters using a string of matching words, starting at the beginning of the string, regardless of order. The frequency at which each string occurs then determines the New Value.
Apple Computer Corporation
Apple Computer Inc
Apple Corporation Computer
Apple Corp Computer
New Value: Apple Computer
The algorithm used to build your clusters affects the suggested value.
Because metaphone attempts to preserve the semantic meaning of the words in your clusters, you may notice that some of the suggested values do not strictly reflect all common words in your clusters. This may be the case when punctuation is included in your clusters.
The ngram algorithm must be used in order to include non-consecutive, common words in the cluster.
Consecutive Common Words¶
The Consecutive Common Words output algorithm builds clusters using the longest sequence of matching consecutive words, starting at the beginning of the string. Values that occur in less than 10% of the cluster are not included when determining the New Value recommendation. Note that most punctuation does not interrupt the sequencing for the match.
Apple Computer Corp
Apple Computer Corporation
Apple Computer Inc
New Value: Apple Computer