Word Cloud¶
Text variables often contain words that are highly indicative of the response. The Word Cloud insight displays up to 200 of the most impactful words and short phrases in word cloud format. Text coloration indicates the coefficient value for the word; the rendered size in the cloud indicates the frequency of the term's appearance in the data.
When viewing the Word Cloud, you can view individual word detail, filter the display, and export the insight.
Note
The model's Word Cloud is based on the data used to train that model, not on the entire dataset. For example, a model trained on a 64% sample size will result in a Word Cloud that reflects the same 64% of rows.
View word detail¶
Click on a term displayed in the insight to view details. For example:
Detail | Description |
---|---|
Word | The selected word. Click again to de-select and clear the details. |
Coefficient | The correlation that the word has to the target, either positively or negatively, in the context of the specified parent feature. For example, in a diabetes dataset you might see the word insulin appear in several different text columns, potentially with a different coefficient in each one. |
Count | The number of rows in which the word appears in the data, both as a raw count and a percentage. |
Feature | The feature from the data in which the word was found (the parent feature). |
Filter the display¶
Use the filtering options to set the criteria words must match to be included in the results. Once you apply the filters, the Word Cloud refreshes to show only applicable words.
Filter | Description |
---|---|
Coefficient | Use the dropdown to set a range for the coefficient value of the words displayed. Additional entry boxes become available based on your selection (any, greater or less than, in or not in). |
Count | Use the dropdown to set a value criteria for the word count. Additional entry boxes become available based on your selection (any, greater or less than). |
Feature | Use the dropdown to choose a specific parent feature. Only words that appeared in that feature column will display. |
Include stop words | Check the box to include commonly used terms that are typically excluded from searches (“to”, “of”, "the", etc.). When unchecked, common terms are removed from the display. |
Clear filters individually or clear all to return to the original display
Export¶
You can export the full Word Cloud as a CSV, PNG, or ZIP file. Note that applied filters are not reflected in the exported files; however, the removal of stop words is applied.
Text-based insight availability
If you expect to see one of these text insights and do not, view the Log tab for error messages to help understand why the models may be missing.
One common reason that text models are not built is because DataRobot removes single-character "words" when model building. It does this because the words are typically uninformative (e.g., "a" or "I"). A side-effect of this removal is that single-digit numbers are also removed. In other words, DataRobot removes "1" or "2" or "a" or "I". This common practice in text mining (for example, the Sklearn Tfidf Vectorizer selects tokens of 2 or more alphanumeric characters).
This can be an issue if you have encoded words as numbers (which some organizations do to anonymize data). For example, if you use "1 2 3" instead of "john jacob schmidt" and "1 4 3" instead of "john jingleheimer schmidt," DataRobot removes the single digits; the texts become "" and "". DataRobot returns an error if it cannot find any words for features of type text (because they are all single digits).
If you need a workaround to avoid the error, here are two solutions:
- Start numbering at 10 (e.g., "11 12 13" and "11 14 13")
- Add a single letter to each ID (e.g., "x1 x2 x3" and "x1 x4 x3").