Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Attach datasets

The Data Prep Attach tools lets you perform lookup, join, and append operations on your dataset.

Work with the Attach tools

To attach datasets, hover over the Attach tool in the project Tools bar and select the lookup, join, or append tools.

Work with the Lookup tool

The Lookup tool provides a lookup type operation, which is comparable to an MS Excel VLOOKUP. A reverse lookup is also supported.

After you select the Lookup tool, select a Lookup Source dataset from your data library. You then select the columns for the lookup operation. The Current column refers to your base dataset, and when you click in each column, the available columns from each dataset display.

Note

If you use click to select the green Detect Joins option, the Connections menu displays the columns that your two datasets share. Additionally, a percentage score provides guidance for how best to combine the datasets. See Lookup scoring calculations to learn how scores are calculated for Detect Joins operations.

After selecting the columns for the lookup, a preview of the data as it will be combined is displayed on the grid:

Next, click the Options tab to define:

  • The Lookup Type, which defines what to do with any unmatched rows.
  • The Matching Method algorithm to use for the lookup operation. Note: the Fuzzy Matching Method is only available for a Standard type lookup.

When you are satisfied with how your lookup is previewed on the grid, click the green Save button to complete the lookup operation.

Lookup scoring calculations

When Data Prep calculates lookup scores for the Detect Joins option, it considers two factors:

  • Selectivity: The extent to which each column in the lookup connection is populated with unique values
  • Overlap: The percentage of rows that match in the lookup connection

These two factors produce a percentage score to reflect the quality of the connection. The higher the percentage score, the better the connection. However, scores of less than 100% do not necessarily indicate a problem with the proposed combine operation. In fact, there are legitimate use cases where the percentage may be less than 1%.

Rather than being an absolute indicator of “right” or “wrong,” the percentage should serve as a sanity check that should match some level of expectation you have regarding the amount of data the two datasets have in common.

Following are examples of data that produce low and high scores:

Low Scores

Base (driving) dataset Lookup (source) dataset Explanation
a,b,c a,a,b,b,c,c The lookup has duplicate values.
a,b,c. . c,d,e,f The lookup has only one overlap value—'c'.

High Scores

Base (driving) dataset Lookup (source) dataset Explanation
a,b,c a,b,c,d,e,f,g All lookup values are unique.
a,a,b,b,c,c a,b,c All lookup values overlap.

Work with the Join tool

The Join tool supports the following join types:

  • Standard Join that combines all matches across both datasets—comparable to a SQL join.
  • Range Join that combines datasets based on a marker column that is matched against two columns representing a range in the joining dataset.
  • Cross Product Join that combines all rows from both datasets. Important: the Cross-Product join significantly increases the number of rows that will be added to your project because every row from the joining dataset is brought into the project for each row in the Base dataset.

Note

The Join tool must be enabled by your System Administrator in order for it to be available in the UI.

Standard Join

After you select the Join tool, select the Standard type, then select the columns for the join operation. The Current column refers to your base dataset. When you click in each column, the available columns from each dataset display. Begin by selecting the columns on which to create the join:

After selecting the columns for the join, a preview of the data as it will be combined is displayed on the grid:

Next, click the Options tab to define:

  • The Lookup Type, which defines what to do with any unmatched rows.
  • The Matching Method algorithm to use for the join operation.

When you are satisfied with how your lookup is previewed on the grid, click the green Save button to complete the lookup operation

Range Join

The Range Join feature allows you to join datasets together based on a "marker" column in your base dataset that is matched against two separate "range" columns in the adjoining dataset.

Note

The Range Join feature is only for column types numeric and date. You can always use the Change into operation in the column menu to convert columns into these types if the data in those columns can be validly converted.

Note

The Range Join feature must be enabled by your System Administrator in order for it to display as a join option in the UI.

After you select the Join tool, select the Range option, then select the columns for the join operation. The Current column refers to the marker column in your base dataset. It is used to match against the lower and upper range of values in the adjoining dataset:

There are two ways to utilize the Range Join feature:

  • Keyless: A keyless range join groups together entries that fall within a certain range of the marker value.
  • Keyed: Optional Keys are extra columns you can select from the base and adjoining dataset to more carefully increase the specificity of the join criteria.

After selecting the columns for the join, a preview of the data as it will be combined is displayed on the grid.

Example

Our base dataset is a lending club dataset that contains typical information about individual loans. To that base, we are joining the loan_riskfactors.csv dataset that contains averaged risk factor of loans based on grade and date. The goal of our project is to determine the risk factor of each loan in lending club base dataset with the industry averages provided by the risk factor dataset.

Here we are using the column "loan_date" as our marker column in the lending club base dataset and matched it against the "StartDate" and "EndDate" columns from the loan_riskfactors.csv dataset. Notice the grid displays a preview of the resulting join:

Tip

If the resulting dataset contains too many rows after the join, remember that you can use the Optional Keys to create a keyed join that increases the specificity of the join criteria, thereby reducing the number of resulting rows.

Next, click the Options tab to define:

  • The Lookup Type, which defines what to do with any unmatched rows.
  • The Matching Method algorithm to use for the join operation.

When you are satisfied with how your lookup is previewed on the grid, click the green Save button to complete the lookup operation.

Cross Product Join

After you select the Join tool, select the Cross Product type, then select the columns for the join operation. The Current column refers to your base dataset, and when you click in each column, the available columns from each dataset display. Begin by selecting the columns on which to create the join:

After selecting the columns for the join, a preview of the data as it will be combined is displayed on the grid:

Note

Unlike the other join types, there are no lookup or matching options for a Cross Product join because all rows will be matched in this operation.

When you are satisfied with how your lookup is previewed on the grid, click the green Save button to complete the lookup operation.

Note

The Cross-Product join significantly increases the number of rows that will be added to your project because every row from the joining dataset is brought into the project for each row in the Base dataset. If the number of rows exceeds your project's row limit, then the Join operation will fail with an error message. In this case, reduce the number of rows in your datasets before combining them or speak with your System Administrator regarding the project row limit.

Lookup Type options for Lookups and Joins

The Lookup Type options specify what to do with unmatched rows. There are four options: Left Lookup, Inner Lookup, Right Lookup, and Outer Lookup.

Note

When you mouse over the icons for each Lookup Type option, its associated type displays.

One of the easiest ways to understand the different Lookup Types is by using a Venn diagram:

Left Lookup

A Left Lookup returns a combined dataset containing all rows with matching values in the specified columns, as well as all rows from the left (Base) dataset that had no corresponding matching value in the right (Lookup) dataset.

Inner Lookup

An Inner Lookup returns a combined dataset containing only those rows which contain matching values within the specified columns:

Right Lookup

A Right Lookup is similar to a Left Lookup in that it returns a combined dataset containing all rows with matching values in the specified columns; however, rather than return the rows with non-matching values from the left (Base) dataset, it returns all non-matching rows from the right (Lookup) dataset instead.

Outer Lookup

An Outer Lookup returns a combined dataset containing all rows with matching values in the specified columns, as well as all rows from both datasets that contained no matching values:

Matching methods for Lookups and Joins

Select an Intellifusion™ matching method algorithm to customize how to match text values in columns that have a connection: Exact or Fuzzy. Each option is explained below.

Exact option

There are three choices for the Exact option: Automatic Match, Exact Match, and Custom Match.

Automatic Match

Automatic Match (the default option) ignores case, word order, and punctuation around words for matching text values. Generally speaking, space characters and punctuation define word boundaries, but there are important exceptions. This matching method ignores punctuation around words, so .ave. matches AVE. However, Automatic Match detects special kinds of words that have punctuation inside the word. Any punctuation inside a word does not define a word boundary, and the punctuation must match exactly in both datasets. Automatic Match detects the following words with inner punctuation:

  • Numbers with periods: Currency and floating point numbers such as 12.34. Currency symbols and punctuation around the number are ignored. Comparison is text comparison not numeric, so 3.0 and 3 do not match. Tip: To reduce issues, use the "Parse cell text into numbers" option during import or convert a column to numbers using the column drop-down menu.
  • Email addresses: Email addresses are one word including internal periods.
  • Acronyms with periods: Acronyms such as U.S.A. include inner punctuation and counts as one word. However, U.S.A. does not match USA because punctuation inside a word is not ignored when comparing words.

Example of Automatic Match

The following table shows the behavior of Automatic Match. The first two columns show example values. The third column shows if those values match with Automatic Match. For examples where Exact Match and Automatic Match return different results, the answers appears in bold.

Exact Match

Use the Exact Match option if every character in the two values must exactly match.

Custom Match

Use the Custom Match option for making precise adjustments. This option allows you to selectively choose how to handle: word order, case, whitespace, and specific punctuation values. This matching method ignores or keeps word order and case depending on your selection. These options are commonly used for data containing names. Any combination of Ignore and Keep for word order and case can be selected. Click the Ignore and Keep buttons to see an example of how word order and case is handled based on your selections.

For Whitespace, specify whether to ignore, keep or split on when whitespace is encountered. The Split On button splits your data into separate terms on whitespace. Whitespace includes characters from the space bar, tab key and carriage returns. The Whitespace options are commonly used to increase matches for data containing address information. Click the Ignore, Keep and Split On buttons to see an example of how a whitespace is handled based on your selection.

Punctuation options allow you to ignore, keep or split on for specific punctuation. The default punctuation values are: comma and hyphen. Click the white Other button to add to the default values. A new field appears for adding more punctuation. To delete the defaults or any punctuation you have added, move your mouse over a punctuation field and click the orange "X" that appears in the top of t****hat field. If you delete all of the punctuation fields, the default is to keep all punctuation.

Settings that you specify for word order, case, whitespace, and punctuation apply only to the current Lookup data prep step. These settings do not apply to other Lookup steps.

Fuzzy option

The Fuzzy option is only available for a Standard Lookup and must be enabled by your System Administrator in order for it to display as an option in the UI.

The Fuzzy option employs an edit distance algorithm to predict the probability of likely matches between two join keys that you select.

Example of a Fuzzy option

In this example, you have a base dataset with company names and you want to enrich that data with addresses for each company.

You have a second dataset, with address information, but the company names do not match closely enough to your base dataset.

In this case, you can use the Fuzzy option to quickly create a join for the company names without having to do any cleanup ahead of time on the "company name" column in either dataset:

Your base dataset has company names:

Your enrichment dataset with addresses has company names listed this way:

You tune the Fuzzy options as follows:

After you find the best threshold match, click Save to join the data.

Note that you can simultaneously create multiple join keys for the fuzzy option by adding more column names in the Connections panel:

However, both of your datasets must contain the exact same number of columns. Otherwise, you will receive an error and cannot proceed.

After the data is joined, you can apply additional options:

  • Use the Column tool to remove the Similarity Score column and one of the Company name columns.
  • Use a Filtergram on the remaining Company column to ensure the name formats are all correct. If you find names that you want to edit, use the Find and Replace option to correctly update those names.

Examples of Lookups and Joins

Example: Reverse Lookup

A Reverse Lookup operation is useful when you want to achieve a one-to-many relationship between your current dataset and an enrichment dataset. This Reverse operation uses your current base dataset as the Lookup table—meaning only the first match from the base dataset is brought into the selected enrichment dataset.

In this example, you have two datasets:

  • An Orders dataset with unique Order IDs

  • An Order Details dataset that has all of the ordering details associated with an Order ID

You create a new project with the Orders dataset as your Base dataset and you want to enrich it with data from Order Details based on the "Order ID" column in both datasets. In this case, you select the Lookup Reverse option, and then choose "Order_ID" as the column for the matching the datasets:

Note

To keep any unmatched (blank) rows containing the "Order_ID", choose an Outer or Left Lookup type. Alternatively, you can discard those rows by selecting an Inner Lookup type. Refer to Lookup Type, which defines what to do with any unmatched rows.

Important

A Reverse Lookup affects the number of rows that will be added to your base dataset and, possibly, the sort order.

Example: Join datasets

This selection combines all matches across both datasets. The Join operation is useful when you want to achieve a many-to-many relationship between two datasets. Note: this selection affects the number of rows that will be added to your project.

In this example, you have two datasets:

  • Transactions with a set of purchase transaction IDs, each row indicating a book that was purchased. Note that there may be multiple transactions IDs with the same book because that book may have sold multiple times.

  • Books with a set of books and their authors. Note that each book may have multiple authors, such that one book may be listed in multiple rows, once for each co-author.

Your objective is to determine the number of transactions per author. And because there are multiple authors for many of the books—instead of a single, unique value for author—the "Join" is the option you want to select for combining these datasets.

Create a new project with the Transactions dataset. Then do a standard Join operation with the Books dataset, and choose "Book" as the column for joining the datasets:

Note

Performing a Join can substantially increase the number of rows in the resulting dataset. If the number of rows exceeds your project's row limit, then the Join operation will fail with an error message. In this case, reduce the number of rows in your datasets before joining them or speak with your System Administrator regarding the project row limit.

Example: Cross Product Join to return all combinations

This selection combines all rows from both of the datasets in the operation.

In this example, you have three datasets:

  • Customer Master with all of your customers IDs.

  • Products Master with all of your products and associated IDs.

  • Customer Orders with the Customer IDs, and the associated Products and Quantity that were purchased.

Your objective is to create a master project that includes all customers and all products, and then determine all of the products that each customer has not purchased.

Create a new project with the Customer Master as your base dataset. Then do a Cross-Product join operation with the Products Master dataset. Choose "Customer_ID" and "Product_ID" as the columns for the cross-product operation.

You now have a master dataset that has all of your customers and all of your products:

Next, do a Standard Lookup operation with the Customer Orders dataset. Choose "Customer_ID" and "Product_ID" as the columns for the lookup, and be sure to keep all unmatched rows. Refer to Lookup Type, which defines what to do with any unmatched rows.

The resulting dataset enables you to easily identify all of the products that each customer has not yet purchased:

Each time you update the Customer Orders dataset for this project, you automatically see the data on the grid updated to reflect the new purchases. To capture all of the purchase information over time, you can create a project lens to publish an AnswerSet for each result.

Note

The Cross-Product operation significantly increases the number of rows that will be added to your project because every row from the Lookup source is brought into the project for each row in the Base dataset. If the number of rows exceeds your project's row limit, then the Join operation will fail with an error message. In this case, reduce the number of rows in your datasets before combining them or speak with your System Administrator regarding the project row limit.

Work with the Append tool

Using the Append tool, you can select an additional dataset that contains rows that you want to add to the end of your base dataset. You can customize column matching between the two datasets. If all columns from the appended dataset are matched to columns in the current dataset, the set of columns in your original dataset are unchanged in the result dataset. If you leave columns from the appended dataset unmatched, those columns become new columns in the result dataset.


Updated October 28, 2021
Back to top