Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Group by aggregate functions

Data Prep aggregate functions let you combine sets of rows and compute a specific function on the referenced column.

These operations are called aggregate because they find matching rows in the dataset and then combine them into one row. A matching row is defined as one that, excluding the reference column, shares the same values in a column-by-column examination. The reference column is excluded from the column-by-column examination because its value are submitted to the aggregate function in order to produce the reference column value in the single-row result.

The following sections describe the aggregate functions available when you use Group By shaping operations:

Function Purpose
array Condenses duplicate rows into a single row of data and assembles the reference column data into a single, comma-separated string.
average Condenses duplicate rows into a single row of data and displays the average of the numbers in the reference column.
count Condenses duplicate rows into a single row of data and displays the number of duplicate rows in the reference column.
count (numbers only) Condenses duplicate rows into a single row of data and displays the number of duplicate rows but only counts numeric values.
count distinct Condenses duplicate rows into a single row of data and displays the number of unique values in the reference column.
first Condenses duplicate rows into a single row of data and displays the first value that appeared for the duplicate rows.
last Condenses duplicate rows into a single row of data and displays the last value that appeared for the duplicate rows.
max Condenses duplicate rows into a single row of data and displays the largest of the numbers in the reference column.
min Condenses duplicate rows into a single row of data and displays the smallest of the numbers in the reference column.
median Condenses duplicate rows into a single row of data and displays the median of the numbers in the reference column.
mode Condenses duplicate rows into a single row of data and displays the mode of the numbers in the reference column.
stdev Condenses duplicate rows into a single row of data and displays the standard deviation of the numeric values in the reference column.
stdevp Condenses duplicate rows into a single row of data and displays the standard deviation for a population contained within the reference column.
sum Condenses duplicate rows into a single row of data and displays the sum of the numbers in the reference column.
var Condenses duplicate rows into a single row of data and displays the variance of the numeric values in the reference column.
varp Condenses duplicate rows into a single row of data and displays the variance of a population contained within the reference column.

array

Of the available aggregate functions available, array is unique in that it operates on both text and numeric values. Rather than perform a mathematical operation on collapsed rows, all values in the reference column (the column to which array is applied) are temporarily stored. As the unique single row is created, array assembles the reference column data from the set into a single, comma-separated string within the column.

Excluding the column to which the array is being applied, all rows are examined column-by-column in order to find identical rows. The header of the reference column receives a name change to become Array of .

Examples

The small dataset below will be used to show how array operates.

Column A Column B Column C
1 two 5
1 two 6
two two 7
1 two 4

Example 1

Applying the array function to Column C reduces the row count from four to two. The value in the column Array of Column C shows the sum of of the Column C values in the duplicate rows that were collapsed during the operation.

Column A Column B Column C
1 two 5.0, 6.0, 4.0
two two 7.0

Example 2

Applying the array function to Column A results in no row count reduction since examination of Column B and Column C values in each row reveals that each is already unique. The value in the column Array of Column A therefore displays each row with its original value—however, numbers have been converted into text.

Column A Column B Column C
1.0 two 5
two two 7
1.0 two 6
1.0 two 4

average

The average aggregate function finds an average of the numbers in the reference column (the column to which it is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which average is being applied, all rows are examined column-by-column in order to find identical rows.

A mathematical average (also called an “arithmetic mean” or simply “mean”) is calculated by adding all of the numeric values in a set and then dividing the resulting sum by the number of items the set contained. Note that attempts to apply average to a text value in the reference column will result in a 0 for that row. The header of the reference column receives a name change to become Average of .

Related aggregate functions are median and mode.

Examples

The small dataset below will be used to show how average operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A Column B Column C
1 two 5
1 two 6
two two 7
1 two 4

Example 1

Applying the average function to Column C reduces the row count from four to two. The value in the column Average of Column C shows the average of of the Column C values in the duplicate rows that were collapsed during the operation: (5 + 6 + 4) ÷ 3 = 5 while 7 ÷ 1 = 7.

Column A Column B Column C
1 two 5
two two 7

Example 2

Applying the average function to Column A results in no row count reduction since examination of Column B and Column C values in each row reveals that each is already unique. The value in the column Average of Column A therefore displays each row with its original numeric value (in this case, a number 1) since the average function found no duplicate rows that could be involved in a mathematical operation. The 0 replaces the text “two” because the average operation cannot be applied to text values.

Column A Column B Column C
1 two 5
1 two 6
1 two 4
0 two 7

count

The count aggregate function returns the number of duplicate rows in a dataset. Excluding the column to which it is being applied (the reference column), all rows are examined column-by-column in order to find duplicate rows. Those rows that contain duplicate data are collapsed into a single unique row. The reference column receives a name change to become Count of and the number that appears in the column indicates the number of duplicate rows that have been collapsed.

Examples

The small dataset below will be used to show how count operates.

Column A Column B Column C
one two 5
one two 6
two two 7
one two 4

Example 1

Applying the count function to Column C reduces the row count from four to two. The value in the column Count of Column C shows the count of (number of times) duplicate rows that were collapsed during the operation.

Column A Column B Column C
one two 3
two two 1

Example 2

Applying the count function to Column A results in no row count reduction since examination of Column B and Column C values in each row reveals that each is already unique. The value in the column Count of Column A therefore shows a value of 1 for each of the four rows.

Column A Column B Column C
1 two 5
1 two 6
1 two 7
1 two 4

count (numbers only)

The count (numbers only) aggregate function operates exactly the same as the count function. However, count (numbers only) only counts numeric values and ignores text values during the counting process.

count distinct

The count distinct aggregate function returns the number of unique values in the column being counted, as opposed to the count function which counts all values.

first

The first aggregate function searches the data for duplicate rows (based on the columns chosen by the user) and condenses them into a single row of data. The first function will then display the first value within the data that appeared for the duplicate rows. Any values following the first value will be lost during this process.

Example

The small dataset below will be used to show how first operates.

Column A Column B Column C
one two 5
one two 6
two two 7
one two 4

Applying the first function to Column C reduces the row count from four to two. The value in the column First of Column C shows the first value from Column C of the duplicate rows that were collapsed during the operation.

Column A Column B Column C
one two 5
two two 7

last

The last aggregate function searches the data for duplicate rows (based on the columns chosen by the user) and condenses them into a single row of data. The last function will then display the last value within the data that appeared for the duplicate rows. Any values prior to the last value will be lost during this process.

Example

The small dataset below will be used to show how last operates.

Column A Column B Column C
one two 5
one two 6
two two 7
one two 4

Applying the Last function to Column C reduces the row count from four to two. The value in the column Last of Column C shows the last value from Column C of the duplicate rows that were collapsed during the operation.

Column A Column B Column C
one two 4
two two 7

max

The max aggregate function returns the largest of the numbers in the reference column (the column to which it is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which max is being applied, all rows are examined column-by-column in order to find identical rows.

The counterpart to this function is min (minimum).

Example

The small dataset below will be used to show how max operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A Column B Column C
1 two 5
1 two 6
two two 7
1 two 4

Applying the max function to Column C reduces the row count from four to two. The value in the column Max of Column C shows the maximum of of the Column C values in the duplicate rows that were collapsed during the operation.

In the dataset that is returned by the max function (shown below) number 6 in the first row resulted from the set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when Column A and Column B were examined. (Column C was excluded from this examination because it is the reference column.) In this set of three numbers, 6 is the largest—therefore it became the value shown in the reference column.

The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is both the minimum and maximum of the single-row set, the function returns a 7 for that row.

min

The min aggregate function returns the smallest of the numbers in the reference column (the column to which it is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which the min is being applied, all rows are examined column-by-column in order to find identical rows.

The counterpart to this function is max (maximum).

Example

The small dataset below will be used to show how min operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A Column B Column C
1 two 5
1 two 6
two two 7
1 two 4

Applying the min function to Column C reduces the row count from four to two. The value in the column Min of Column C shows the minimum of of the Column C values in the duplicate rows that were collapsed during the operation.

In the dataset that is returned by the min function (shown below) number 4 in the first row resulted from the set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when Column A and Column B were examined. (Column C was excluded from this examination because it is the reference column.) In this set of three numbers, 4 is the smallest—therefore it became the value shown in the reference column.

The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is both the minimum and maximum of the single-row set, the function returns a 7 for that row.

Column A Column B Column C
1 two 4
two two 7

median

The median aggregate function finds the median of the numbers in the reference column (the column to which median is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which median is being applied, all rows are examined column-by-column in order to find identical rows.

A median value is one in the middle of a range of numbers ordered from lowest to highest value. This means that half of the numbers are to the “right” of the value returned and half of the numbers are to the “left” of the value returned. Where there is an even set of numbers (i.e., no single number in the middle of the collection), the function calculates an average on the pair of numbers in the middle of the range (i.e., the two numbers on either side of the mid-point).

Note that attempts to apply median to a text value in the reference column will result in a error for that row. The header of the reference column receives a name change to become Median of .

Related aggregate functions are average and mode.

Example

The small dataset below will be used to show how median operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A Column B Column C
one two 5
one two 6
two two 7
one two 4

Applying the median function to Column C reduces the row count from four to two. The value in the column Median of Column C shows the median of the Column C values in the duplicate rows that were collapsed during the operation.

In the dataset that is returned by the median function (shown below) number 5 in the first row resulted from the ordered set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when Column A and Column B were examined. (Column C was excluded from this examination because it is the reference column.) In this set of three numbers, 5 is the middle value with one number in the set on either side of it.

The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is in the middle of the set (there are zero numbers on either side of it) the function returns a 7 for that row.

Column A Column B Column C
one two 5
two two 7

mode

A mode is the value that occurs most frequently in a set of numbers. The mode aggregate function finds the most frequently occurring number in the reference column (the column to which mode is applied) among those rows that are identical. All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which mode is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of a set against which the mode function operates.

Although mode will still combine rows that have text in the reference column, it ignores text in finding the actual mode for the set. The header of the reference column receives a name change to become Mode of .

Important

Where there is a “tie” for the mode of a set (i.e., where multiple numbers have an equal number of occurrences and there is no other number that appears more frequently) the result of mode is unpredictable. Although one of the “tied” numbers will appear in the column Mode of for the appropriate row, it is not possible to determine which of the values will occur.

Related aggregate functions are average and median.

Example

The small dataset below will be used to show how mode operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A Column B Column C
one two 3
one two 6
two two 7
one two 3

Applying the mode function to Column C reduces the row count from four to two. The value in the column Mode of Column C shows the mode of the Column C values in the duplicate rows that were collapsed during the operation.

In the resulting dataset (shown below) number 3 in the first row resulted from the ordered set of numbers {3, 6, 3} where each member of the set was contributed by row that was duplicate and then collapsed into a single row. In this set of three numbers, 3 occurs most frequently (twice in the set of three numbers) and is therefore the mode for the set.

The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is the only number in the set, it is therefore also the most frequently occurring.

Column A Column B Column C
one two 3
two two 7

stdev

The stdev (standard deviation) aggregate function computes the standard deviation—the amount of variation from the average—that exists within a sample set of data. This aggregate function calculates the standard deviation of the numeric values in the reference column (the column to which stdev is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which stdev is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the standard deviation calculation. The header of the reference column receives a name change to become Stdev of .

If there are text values in the reference column, they are ignored within the stdev calculation. Note also that the aggregate stdev function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

The standard deviation for data is the square root of its variance. If the set under analysis represents all data points (referred to as a “population”), use of stdevp is recommended for a more accurate result. A related function that deals with statistical variance is varp.

Example

The dataset below will be used to show how stdev operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A Column B Column C
one two 0.2
one two 0.1
one two 1.1
one two 0.2
one two 0.6
one one 0.2
one one 0.27
one two 0.2
one two 0.4

As shown in the table below, applying the stdev function to Column C reduces the row count from nine to two. The value in the column Stdev of Column C shows the standard deviation of the Column C sample data values in the duplicate rows that were collapsed during the operation.

Column A Column B Column C
one two 0.3511884584284246
one one 0.049497474683058325

stdevp

The stdevp (standard deviation for a population) aggregate function computes the standard deviation—the amount of variation from the average—that exists within an entire set of data (a population). This aggregate function calculates the standard deviation for a population using the numeric values in the reference column (the column to which stdevp is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which stdevp is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the standard deviation calculation for the population. The header of the reference column receives a name change to become StdevP of .

If there are text values in the reference column, they are ignored within the stdevp calculation. Note also that the aggregate stdevp function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

If the set under analysis represents only a sample of data, use of stdev is recommended for a more accurate result. Other related functions that deal with statistical variance are var and varp.

Example

The dataset below will be used to show how stdevp operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A Column B Column C
one two 0.2
one two 0.1
one two 1.1
one two 0.2
one two 0.6
one one 0.2
one one 0.27
one two 0.2
one two 0.4

As shown in the table below, applying the stdevp function to Column C reduces the row count from nine to two. The value in the column StdevP of Column C shows the standard deviation of the population contained in Column C among the duplicate rows that were collapsed during the operation.

Column A Column B Column C
one two 0.32513733362117264
one one 0.034999999999999996

sum

The sum aggregate function executes an addition operation on the numbers in the reference column (the column to which it is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which sum is being applied, all rows are examined column-by-column in order to find identical rows. Note that attempts to apply sum to a text value in the reference column will result in a 0 for that row. The header of the reference column receives a name change to become Sum of .

Examples

The small dataset below will be used to show how sum operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A Column B Column C
1 two 5
1 two 6
two two 7
1 two 4

Example 1

Applying the sum function to Column C reduces the row count from four to two. The value in the column Sum of Column C shows the sum of the Column C values in the duplicate rows that were collapsed during the operation: 5 + 6 + 4 = 15 while 7 + 0 = 7.

Column A Column B Column C
1 two 15
two two 7

Example 2

Applying the sum function to Column A results in no row count reduction since examination of Column B and Column C values in each row reveals that each is already unique. The value in the column Sum of Column A therefore displays each row with its original numeric value (in this case, a number 1) since the sum function found no duplicate rows that could be involved in an adding operation. The 0 replaces the text “two” because the sum operation cannot be applied to text values.

Column A Column B Column C
1 two 5
1 two 6
1 two 4
0 two 7

var

The var (variance) aggregate function estimates dispersion—how much the values are spread out)—within a sample set of data. This aggregate function calculates the variance of the numeric values in the reference column (the column to which it is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which var is being applied) before they are collapsed into un ique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the variance calculation. The header of the reference column receives a name change to become Var of .

If there are text values in the reference column, they will be ignored within the var calculation. Note also that the aggregate var function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

If the set under analysis represents all data points (referred to as a “population”) use of varp is recommended for a more accurate result. Related functions that deal with statistical variance are stdev and stdevp.

Example

The dataset below will be used to show how var operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A Column B Column C
one two 0.2
one two 0.1
one two 1.1
one two 0.2
one two 0.6
one one 0.2
one one 0.27
one two 0.2
one two 0.4

As shown in the table below, applying the var function to Column C reduces the row count from nine to two. The value in the column Var of Column C shows the variance of the Column C sample data values in the duplicate rows that were collapsed during the operation.

Column A Column B Column C
one two 0.12333333333333334
one one 0.00245

varp

The varp (variance for a population) aggregate function computes dispersion—how much the values are spread out—for an entire set of data (a population). This aggregate function calculates the variance for a population contained within the reference column (the column to which it is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which varp is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the variance calculation for the population. The header of the reference column receives a name change to become VarP of .

If there are text values in the reference column, they will be ignored within the varp calculation. Note also that the aggregate varp function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

If the set under analysis represents only a sample set of data, use of var is recommended for a more accurate result. Related functions that deal with statistical variance are stdev and stdevp.

Example

The dataset below will be used to show how varp operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A Column B Column C
one two 0.2
one two 0.1
one two 1.1
one two 0.2
one two 0.6
one one 0.2
one one 0.27
one two 0.2
one two 0.4

As shown in the table below, applying the varp function to Column C reduces the row count from nine to two. The value in the column VarP of Column C shows the variance of the population contained in Column C among the duplicate rows that were collapsed during the operation.

Column A Column B Column C
one two 0.10571428571428572
one one 0.001225

Updated October 28, 2021
Back to top