Group by aggregate functions¶
Data Prep aggregate functions let you combine sets of rows and compute a specific function on the referenced column.
These operations are called aggregate because they find matching rows in the dataset and then combine them into one row. A matching row is defined as one that, excluding the reference column, shares the same values in a column-by-column examination. The reference column is excluded from the column-by-column examination because its value are submitted to the aggregate function in order to produce the reference column value in the single-row result.
The following sections describe the aggregate functions available when you use Group By shaping operations:
Function | Purpose |
---|---|
array | Condenses duplicate rows into a single row of data and assembles the reference column data into a single, comma-separated string. |
average | Condenses duplicate rows into a single row of data and displays the average of the numbers in the reference column. |
count | Condenses duplicate rows into a single row of data and displays the number of duplicate rows in the reference column. |
count (numbers only) | Condenses duplicate rows into a single row of data and displays the number of duplicate rows but only counts numeric values. |
count distinct | Condenses duplicate rows into a single row of data and displays the number of unique values in the reference column. |
first | Condenses duplicate rows into a single row of data and displays the first value that appeared for the duplicate rows. |
last | Condenses duplicate rows into a single row of data and displays the last value that appeared for the duplicate rows. |
max | Condenses duplicate rows into a single row of data and displays the largest of the numbers in the reference column. |
min | Condenses duplicate rows into a single row of data and displays the smallest of the numbers in the reference column. |
median | Condenses duplicate rows into a single row of data and displays the median of the numbers in the reference column. |
stdev | Condenses duplicate rows into a single row of data and displays the standard deviation of the numeric values in the reference column. |
stdevp | Condenses duplicate rows into a single row of data and displays the standard deviation for a population contained within the reference column. |
sum | Condenses duplicate rows into a single row of data and displays the sum of the numbers in the reference column. |
var | Condenses duplicate rows into a single row of data and displays the variance of the numeric values in the reference column. |
varp | Condenses duplicate rows into a single row of data and displays the variance of a population contained within the reference column. |
array¶
Of the available aggregate functions available, array is unique in that it operates on both text and numeric values. Rather than perform a mathematical operation on collapsed rows, all values in the reference column (the column to which array is applied) are temporarily stored. As the unique single row is created, array assembles the reference column data from the set into a single, comma-separated string within the column.
Excluding the column to which the array is being applied, all rows are examined column-by-column in order to find identical rows. The header of the reference column receives a name change to become Array of
Examples
The small dataset below will be used to show how array operates.
Column A | Column B | Column C |
---|---|---|
1 | two | 5 |
1 | two | 6 |
two | two | 7 |
1 | two | 4 |
Example 1
Applying the array function to Column C reduces the row count from four to two. The value in the column Array of Column C shows the sum of of the Column C values in the duplicate rows that were collapsed during the operation.
Column A | Column B | Column C |
---|---|---|
1 | two | 5.0, 6.0, 4.0 |
two | two | 7.0 |
Example 2
Applying the array function to Column A results in no row count reduction since examination of Column B and Column C values in each row reveals that each is already unique. The value in the column Array of Column A therefore displays each row with its original value—however, numbers have been converted into text.
Column A | Column B | Column C |
---|---|---|
1.0 | two | 5 |
two | two | 7 |
1.0 | two | 6 |
1.0 | two | 4 |
average¶
The average aggregate function finds an average of the numbers in the reference column (the column to which it is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which average is being applied, all rows are examined column-by-column in order to find identical rows.
A mathematical average (also called an “arithmetic mean” or simply “mean”) is calculated by adding all of the numeric values in a set and then dividing the resulting sum by the number of items the set contained. Note that attempts to apply average to a text value in the reference column will result in a 0 for that row. The header of the reference column receives a name change to become Average of
Related aggregate functions are median and mode.
Examples
The small dataset below will be used to show how average operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.
Column A | Column B | Column C |
---|---|---|
1 | two | 5 |
1 | two | 6 |
two | two | 7 |
1 | two | 4 |
Example 1
Applying the average function to Column C reduces the row count from four to two. The value in the column Average of Column C shows the average of of the Column C values in the duplicate rows that were collapsed during the operation: (5 + 6 + 4) ÷ 3 = 5 while 7 ÷ 1 = 7.
Column A | Column B | Column C |
---|---|---|
1 | two | 5 |
two | two | 7 |
Example 2
Applying the average function to Column A results in no row count reduction since examination of Column B and Column C values in each row reveals that each is already unique. The value in the column Average of Column A therefore displays each row with its original numeric value (in this case, a number 1) since the average function found no duplicate rows that could be involved in a mathematical operation. The 0 replaces the text “two” because the average operation cannot be applied to text values.
Column A | Column B | Column C |
---|---|---|
1 | two | 5 |
1 | two | 6 |
1 | two | 4 |
0 | two | 7 |
count¶
The count aggregate function returns the number of duplicate rows in a dataset. Excluding the column to which it is being applied (the reference column), all rows are examined column-by-column in order to find duplicate rows. Those rows that contain duplicate data are collapsed into a single unique row. The reference column receives a name change to become Count of
Examples
The small dataset below will be used to show how count operates.
Column A | Column B | Column C |
---|---|---|
one | two | 5 |
one | two | 6 |
two | two | 7 |
one | two | 4 |
Example 1
Applying the count function to Column C reduces the row count from four to two. The value in the column Count of Column C shows the count of (number of times) duplicate rows that were collapsed during the operation.
Column A | Column B | Column C |
---|---|---|
one | two | 3 |
two | two | 1 |
Example 2
Applying the count function to Column A results in no row count reduction since examination of Column B and Column C values in each row reveals that each is already unique. The value in the column Count of Column A therefore shows a value of 1 for each of the four rows.
Column A | Column B | Column C |
---|---|---|
1 | two | 5 |
1 | two | 6 |
1 | two | 7 |
1 | two | 4 |
count (numbers only)¶
The count (numbers only) aggregate function operates exactly the same as the count function. However, count (numbers only) only counts numeric values and ignores text values during the counting process.
count distinct¶
The count distinct aggregate function returns the number of unique values in the column being counted, as opposed to the count function which counts all values.
first¶
The first aggregate function searches the data for duplicate rows (based on the columns chosen by the user) and condenses them into a single row of data. The first function will then display the first value within the data that appeared for the duplicate rows. Any values following the first value will be lost during this process.
Example
The small dataset below will be used to show how first operates.
Column A | Column B | Column C |
---|---|---|
one | two | 5 |
one | two | 6 |
two | two | 7 |
one | two | 4 |
Applying the first function to Column C reduces the row count from four to two. The value in the column First of Column C shows the first value from Column C of the duplicate rows that were collapsed during the operation.
Column A | Column B | Column C |
---|---|---|
one | two | 5 |
two | two | 7 |
last¶
The last aggregate function searches the data for duplicate rows (based on the columns chosen by the user) and condenses them into a single row of data. The last function will then display the last value within the data that appeared for the duplicate rows. Any values prior to the last value will be lost during this process.
Example
The small dataset below will be used to show how last operates.
Column A | Column B | Column C |
---|---|---|
one | two | 5 |
one | two | 6 |
two | two | 7 |
one | two | 4 |
Applying the Last function to Column C reduces the row count from four to two. The value in the column Last of Column C shows the last value from Column C of the duplicate rows that were collapsed during the operation.
Column A | Column B | Column C |
---|---|---|
one | two | 4 |
two | two | 7 |
max¶
The max aggregate function returns the largest of the numbers in the reference column (the column to which it is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which max is being applied, all rows are examined column-by-column in order to find identical rows.
The counterpart to this function is min (minimum).
Example
The small dataset below will be used to show how max operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.
Column A | Column B | Column C |
---|---|---|
1 | two | 5 |
1 | two | 6 |
two | two | 7 |
1 | two | 4 |
Applying the max function to Column C reduces the row count from four to two. The value in the column Max of Column C shows the maximum of of the Column C values in the duplicate rows that were collapsed during the operation.
In the dataset that is returned by the max function (shown below) number 6 in the first row resulted from the set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when Column A and Column B were examined. (Column C was excluded from this examination because it is the reference column.) In this set of three numbers, 6 is the largest—therefore it became the value shown in the reference column.
The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is both the minimum and maximum of the single-row set, the function returns a 7 for that row.
min¶
The min aggregate function returns the smallest of the numbers in the reference column (the column to which it is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which the min is being applied, all rows are examined column-by-column in order to find identical rows.
The counterpart to this function is max (maximum).
Example
The small dataset below will be used to show how min operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.
Column A | Column B | Column C |
---|---|---|
1 | two | 5 |
1 | two | 6 |
two | two | 7 |
1 | two | 4 |
Applying the min function to Column C reduces the row count from four to two. The value in the column Min of Column C shows the minimum of of the Column C values in the duplicate rows that were collapsed during the operation.
In the dataset that is returned by the min function (shown below) number 4 in the first row resulted from the set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when Column A and Column B were examined. (Column C was excluded from this examination because it is the reference column.) In this set of three numbers, 4 is the smallest—therefore it became the value shown in the reference column.
The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is both the minimum and maximum of the single-row set, the function returns a 7 for that row.
Column A | Column B | Column C |
---|---|---|
1 | two | 4 |
two | two | 7 |
median¶
The median aggregate function finds the median of the numbers in the reference column (the column to which median is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which median is being applied, all rows are examined column-by-column in order to find identical rows.
A median value is one in the middle of a range of numbers ordered from lowest to highest value. This means that half of the numbers are to the “right” of the value returned and half of the numbers are to the “left” of the value returned. Where there is an even set of numbers (i.e., no single number in the middle of the collection), the function calculates an average on the pair of numbers in the middle of the range (i.e., the two numbers on either side of the mid-point).
Note that attempts to apply median to a text value in the reference column will result in a error for that row. The header of the reference column receives a name change to become Median of
Related aggregate functions are average and mode.
Example
The small dataset below will be used to show how median operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.
Column A | Column B | Column C |
---|---|---|
one | two | 5 |
one | two | 6 |
two | two | 7 |
one | two | 4 |
Applying the median function to Column C reduces the row count from four to two. The value in the column Median of Column C shows the median of the Column C values in the duplicate rows that were collapsed during the operation.
In the dataset that is returned by the median function (shown below) number 5 in the first row resulted from the ordered set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when Column A and Column B were examined. (Column C was excluded from this examination because it is the reference column.) In this set of three numbers, 5 is the middle value with one number in the set on either side of it.
The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is in the middle of the set (there are zero numbers on either side of it) the function returns a 7 for that row.
Column A | Column B | Column C |
---|---|---|
one | two | 5 |
two | two | 7 |
stdev¶
The stdev (standard deviation) aggregate function computes the standard deviation—the amount of variation from the average—that exists within a sample set of data. This aggregate function calculates the standard deviation of the numeric values in the reference column (the column to which stdev is applied) among those rows that are identical.
All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which stdev is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the standard deviation calculation. The header of the reference column receives a name change to become Stdev of
If there are text values in the reference column, they are ignored within the stdev calculation. Note also that the aggregate stdev function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.
The standard deviation for data is the square root of its variance. If the set under analysis represents all data points (referred to as a “population”), use of stdevp is recommended for a more accurate result. A related function that deals with statistical variance is varp.
Example
The dataset below will be used to show how stdev operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.
Column A | Column B | Column C |
---|---|---|
one | two | 0.2 |
one | two | 0.1 |
one | two | 1.1 |
one | two | 0.2 |
one | two | 0.6 |
one | one | 0.2 |
one | one | 0.27 |
one | two | 0.2 |
one | two | 0.4 |
As shown in the table below, applying the stdev function to Column C reduces the row count from nine to two. The value in the column Stdev of Column C shows the standard deviation of the Column C sample data values in the duplicate rows that were collapsed during the operation.
Column A | Column B | Column C |
---|---|---|
one | two | 0.3511884584284246 |
one | one | 0.049497474683058325 |
stdevp¶
The stdevp (standard deviation for a population) aggregate function computes the standard deviation—the amount of variation from the average—that exists within an entire set of data (a population). This aggregate function calculates the standard deviation for a population using the numeric values in the reference column (the column to which stdevp is applied) among those rows that are identical.
All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which stdevp is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the standard deviation calculation for the population. The header of the reference column receives a name change to become StdevP of
If there are text values in the reference column, they are ignored within the stdevp calculation. Note also that the aggregate stdevp function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.
If the set under analysis represents only a sample of data, use of stdev is recommended for a more accurate result. Other related functions that deal with statistical variance are var and varp.
Example
The dataset below will be used to show how stdevp operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.
Column A | Column B | Column C |
---|---|---|
one | two | 0.2 |
one | two | 0.1 |
one | two | 1.1 |
one | two | 0.2 |
one | two | 0.6 |
one | one | 0.2 |
one | one | 0.27 |
one | two | 0.2 |
one | two | 0.4 |
As shown in the table below, applying the stdevp function to Column C reduces the row count from nine to two. The value in the column StdevP of Column C shows the standard deviation of the population contained in Column C among the duplicate rows that were collapsed during the operation.
Column A | Column B | Column C |
---|---|---|
one | two | 0.32513733362117264 |
one | one | 0.034999999999999996 |
sum¶
The sum aggregate function executes an addition operation on the numbers in the reference column (the column to which it is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which sum is being applied, all rows are examined column-by-column in order to find identical rows. Note that attempts to apply sum to a text value in the reference column will result in a 0 for that row. The header of the reference column receives a name change to become Sum of
Examples
The small dataset below will be used to show how sum operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.
Column A | Column B | Column C |
---|---|---|
1 | two | 5 |
1 | two | 6 |
two | two | 7 |
1 | two | 4 |
Example 1
Applying the sum function to Column C reduces the row count from four to two. The value in the column Sum of Column C shows the sum of the Column C values in the duplicate rows that were collapsed during the operation: 5 + 6 + 4 = 15 while 7 + 0 = 7.
Column A | Column B | Column C |
---|---|---|
1 | two | 15 |
two | two | 7 |
Example 2
Applying the sum function to Column A results in no row count reduction since examination of Column B and Column C values in each row reveals that each is already unique. The value in the column Sum of Column A therefore displays each row with its original numeric value (in this case, a number 1) since the sum function found no duplicate rows that could be involved in an adding operation. The 0 replaces the text “two” because the sum operation cannot be applied to text values.
Column A | Column B | Column C |
---|---|---|
1 | two | 5 |
1 | two | 6 |
1 | two | 4 |
0 | two | 7 |
var¶
The var (variance) aggregate function estimates dispersion—how much the values are spread out)—within a sample set of data. This aggregate function calculates the variance of the numeric values in the reference column (the column to which it is applied) among those rows that are identical.
All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which var is being applied) before they are collapsed into un ique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the variance calculation. The header of the reference column receives a name change to become Var of
If there are text values in the reference column, they will be ignored within the var calculation. Note also that the aggregate var function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.
If the set under analysis represents all data points (referred to as a “population”) use of varp is recommended for a more accurate result. Related functions that deal with statistical variance are stdev and stdevp.
Example
The dataset below will be used to show how var operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.
Column A | Column B | Column C |
---|---|---|
one | two | 0.2 |
one | two | 0.1 |
one | two | 1.1 |
one | two | 0.2 |
one | two | 0.6 |
one | one | 0.2 |
one | one | 0.27 |
one | two | 0.2 |
one | two | 0.4 |
As shown in the table below, applying the var function to Column C reduces the row count from nine to two. The value in the column Var of Column C shows the variance of the Column C sample data values in the duplicate rows that were collapsed during the operation.
Column A | Column B | Column C |
---|---|---|
one | two | 0.12333333333333334 |
one | one | 0.00245 |
varp¶
The varp (variance for a population) aggregate function computes dispersion—how much the values are spread out—for an entire set of data (a population). This aggregate function calculates the variance for a population contained within the reference column (the column to which it is applied) among those rows that are identical.
All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which varp is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the variance calculation for the population. The header of the reference column receives a name change to become VarP of
If there are text values in the reference column, they will be ignored within the varp calculation. Note also that the aggregate varp function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.
If the set under analysis represents only a sample set of data, use of var is recommended for a more accurate result. Related functions that deal with statistical variance are stdev and stdevp.
Example
The dataset below will be used to show how varp operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.
Column A | Column B | Column C |
---|---|---|
one | two | 0.2 |
one | two | 0.1 |
one | two | 1.1 |
one | two | 0.2 |
one | two | 0.6 |
one | one | 0.2 |
one | one | 0.27 |
one | two | 0.2 |
one | two | 0.4 |
As shown in the table below, applying the varp function to Column C reduces the row count from nine to two. The value in the column VarP of Column C shows the variance of the population contained in Column C among the duplicate rows that were collapsed during the operation.
Column A | Column B | Column C |
---|---|---|
one | two | 0.10571428571428572 |
one | one | 0.001225 |