Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Shape data

Data Prep provides a Shape tool that lets you:

Work with the Shape tool

To access the shape tool, click shape in the project Tools bar:

Following is an overview of the elements in the Shape using pane:

Element Description
Shape tool Click shape to access the Shape using pane.
Shape using pane Select a shaping tool in the Shape using pane:

Deduplicate

The Deduplicate function searches your data for rows that match each other exactly, and condenses them into a single row, removing duplicate entries from your data.

In the Deduplicate pane, you have the option to add or remove columns from the Columns field. Any columns added to this field are included in the deduplication process. As you add columns, they appear in the Data Preview.

Note

Only those columns added to the Columns field are included in the deduplication process and resulting output; any columns not selected are not considered during the deduplication process and are removed from the data upon its completion. Once you are satisfied with the parameters of the deduplication function, click Save to finalize the process and commit it to your project.

You can also use the Fuzzy option with the Deduplicate function. When enabled, you see this option as an alternate selection to Exact match.

Fuzzy deduplication finds matching rows using a fuzzy algorithm. This means similar values that are not exact matches are grouped together and deduplicated, for example:

Similarly, fuzzy deduplication groups similar items even if they have blank values:

Group By

The Group By function lets you use a number of different aggregate functions on existing columns within the dataset. When Group By is selected, a pane appear above the Data Preview where you can specify which columns to include in the Group By process, which columns you want to perform an aggregate function on, which aggregate function to use, as well as specify the name of the new aggregate column being created.

The Data Preview displays your selections, highlighted in blue, so that you can see how it affects the data. Remember that only those columns included in the Columns (aggregates) field remain in your data following the Group By function. Those columns included are used when identifying duplicate rows for grouping.

For a list of available aggregate functions, see Group by aggregate functions.

These operations are called aggregate because they find matching rows in the dataset and then combine them into one row. A matching row is defined as one that, excluding the reference column, share the same values in a column-by-column examination. The reference column is excluded from the column-by-column examination because its value are submitted to the aggregate function in order to produce the reference column value in the single-row result.

Transpose

The Transpose functions lets you switch your rows and columns (imagine rotating your data 90 degrees).

The transpose function allows you to select any one column to create new column headers with. The values of the selected column become the new column headers, while all other column headers shift and become row headers (unless the user removes them during the transpose process). In this new transposition, if there is more than one value that matches a particular set of headings, Data Prep displays the last available value from the original dataset.

Here we have a simple sales dataset:

After clicking the shape tool, select Transpose at the top of the Shape using pane:

You can choose one column for the column header (the values of this chosen column become the new column headers) as well as any columns you would like to include as rows in the newly transposed data. Pay attention to the grid below the Shape using pane as it displays a preview of how the data will appear based on the options you have chosen for the transpose process.

Be aware, depending on which column you choose for your transpose, you may not end up with a complete picture of your data. Notice that in the example, there is more than one instance where the "Item Sold" value is "Baseball". If you were to transpose with the "Item Sold" as your new columns, Data Prep will only display the last value in the dataset, as shown here:

Once you are satisified with the options you have chosen for the transpose process, click Save in the upper right to finalize and commit the process to your project.

Pivot

The Pivot function is similar to the Transpose function in that it allows you to rearrange your column headers into a row header. The Pivot function also allows you to perform aggregate functions on a selected column to be displayed as the body of data within your pivot table. Unlike the Transpose function, with Pivot, you can choose multiple headings as column headers.

When you open the shape tool and select Pivot, the Pivot options display. You can choose which columns to use as your column headers, as well as which columns to include as rows in the pivot table to be created. Finally, you can choose which aggregate function to apply to which column to create the body of the pivot table. If you click the + button, you can add more than one aggregation to your pivot table.

The selected aggregate function is applied to the chosen column, and then displays the total based on the intersection of your chosen column and row headers. For a list of the available aggregate functions, see Group by aggregate functions.

Pay attention to the grid below the Shape using pane as it displays a preview of how the data will appear based on the options you have chosen for the pivot process.

Once you are satisified with the options you have chosen for the transpose process, click Save in the upper right to finalize and commit the process to your project.

Depivot

The Depivot function is a powerful function that lets you take your data and stack columns into two columns—one column containing the column headers that the original data came from and a second column containing the value from that column. For users who have never used a function like Depivot, it can be difficult to understand exactly what is happening during the depivot process.

In the Depivot window, you are given multiple options to configure.

  • Row Labels: Select which columns you want to remain static in your data. These columns will not be "stacked" during the depivot process.

  • Values: Choose which columns from your data will be included (or "stacked") during the depivot process.

  • Column Label: Designate a column name for the new column containing the column labels from the columns selected in the Values field.

  • Value Label: Designate a column name for the new column containing the values from the columns selected in the Values field.

In the previous example, the preview displays how the Depivot function takes the selected columns from the Values field and stacks them into the two new columns. To visualize exactly what is happening, see the following table:

The columns have been color-coded to see where specific values are coming from and moving to during the depivot process.

In this example, the LOCATION column has been selected as the only value for the Row Label. Due to the stacking of the column values during the depivot process, there are now duplicate rows for each location. This is to allow each value from the columns selected in the Values field to have its own row. You can see that by using the Depivot function, all of the sales numbers are in a single column. In this format, Group By functions can be useful and can provide insight into your data that may otherwise go unnoticed.


Updated October 28, 2021
Back to top