SFMC SQL Aggregate Functions
Group rows and apply aggregate functions to see the big picture in your data.
GROUP BY
GROUP BY allows you to create aggregation by grouping the input data on selected column(s).
It limits the data you can SELECT to either GROUP BY column(s) or aggregate functions.
Let's say you want to check which email domains are most popular in your database.
- Query
- Outcome
SELECT
Domain
, COUNT(*) AS SubscriberCount
FROM _Subscribers
GROUP BY Domain
| Domain | SubscriberCount |
|---|---|
| mateuszdabrowski.pl | 1 |
| gmail.com | 1234 |
| yahoo.com | 987 |
| outlook.com | 1111 |
| ... | ... |
With this simple query, you have all subscribers grouped by their domain with the number of matching rows thanks to the COUNT function.
You can quickly improve this query by limiting the records with TOP and sorting ORDER BY to leverage Pareto's principle and pick 20% of the records responsible for 80% of the results. In this case - 20% domains covering approximately 80% subscribers.
- Query
- Outcome
SELECT TOP 20 PERCENT
Domain
, COUNT(*) AS SubscriberCount
FROM _Subscribers
GROUP BY Domain
ORDER BY COUNT(*) DESC
| Domain | SubscriberCount |
|---|---|
| gmail.com | 1234 |
| outlook.com | 1111 |
| yahoo.com | 987 |
With such data, you can assess your mailing list and decide, for example, whether using AMP for Email will be worth the effort for your business.
It is, however, just the beginning of what you can do with GROUP BY. The real magic happens with multiple columns used for the grouping.
Let's check how the domain numbers are changing in years with the help of the DATEPART function.
- Query
- Outcome
SELECT TOP 20 PERCENT
Domain
, DATEPART(YEAR, DateJoined) AS YearJoined
, COUNT(*) AS SubscriberCount
FROM _Subscribers
GROUP BY
Domain
, DATEPART(YEAR, DateJoined)
ORDER BY COUNT(*) DESC
| Domain | YearJoined | SubscriberCount |
|---|---|---|
| gmail.com | 2021 | 642 |
| outlook.com | 2021 | 578 |
| gmail.com | 2020 | 567 |
| outlook.com | 2020 | 456 |
| yahoo.com | 2021 | 432 |
| yahoo.com | 2020 | 345 |
As you can see, by just adding more columns after a comma (order is irrelevant), we can make even more exciting groupings. In this case, we will have a separate row for each domain and year. Easy to see how many Subscribers using Gmail joined your database in 2021 vs 2020.
GROUP BY will group NULL values as if they are equal.
GROUP BY ROLLUP
ROLLUP takes the multi-column grouping to the next level by showing totals and subtotals in the outcomes. Not useful for campaign segmentation, but excellent for quick analysis and reporting.
To use it - add the ROLLUP keyword and wrap grouping columns in parentheses.
- Query
- Outcome
SELECT TOP 20 PERCENT
Domain
, DATEPART(YEAR, DateJoined) AS YearJoined
, COUNT(*) AS SubscriberCount
FROM _Subscribers
GROUP BY ROLLUP (
Domain
, DATEPART(YEAR, DateJoined)
)
ORDER BY COUNT(*) DESC
| Domain | YearJoined | SubscriberCount |
|---|---|---|
| null | null | 12304 |
| gmail.com | null | 1234 |
| outlook.com | null | 1111 |
| yahoo.com | null | 987 |
| gmail.com | 2021 | 642 |
| outlook.com | 2021 | 578 |
| gmail.com | 2020 | 567 |
| outlook.com | 2020 | 456 |
| yahoo.com | 2021 | 432 |
| yahoo.com | 2020 | 345 |
In the outcomes, you will see some rows with NULL as columns - those are rows for total and subtotals.
Notice that the NULLs are either in all columns (total) or the YearJoined column (subtotals). There are no rows with NULL only in Domain. It is because the ROLLUP rolls columns provided in parentheses from right to left when calculating subtotals.
Order of columns will have a massive impact on outcomes!
You can also perform partial rollup by keeping some columns outside of the ROLLUP operator. In the example below, I excluded Domain from the ROLLUP resulting in the lack of total in the outcomes.
- Query
- Outcome
SELECT TOP 20 PERCENT
Domain
, DATEPART(YEAR, DateJoined) AS YearJoined
, COUNT(*) AS SubscriberCount
FROM _Subscribers
GROUP BY
Domain
, ROLLUP(DATEPART(YEAR, DateJoined))
ORDER BY COUNT(*) DESC
| Domain | YearJoined | SubscriberCount |
|---|---|---|
| gmail.com | null | 1234 |
| outlook.com | null | 1111 |
| yahoo.com | null | 987 |
| gmail.com | 2021 | 642 |
| outlook.com | 2021 | 578 |
| gmail.com | 2020 | 567 |
| outlook.com | 2020 | 456 |
| yahoo.com | 2021 | 432 |
| yahoo.com | 2020 | 345 |
GROUP BY CUBE
If you thought that right-to-left execution of the ROLLUP is not helping your use case and instead you would rather have all possible groupings - CUBE is here to help.
- Query
- Outcome
SELECT TOP 20 PERCENT
Domain
, DATEPART(YEAR, DateJoined) AS YearJoined
, COUNT(*) AS SubscriberCount
FROM _Subscribers
GROUP BY CUBE (
Domain
, DATEPART(YEAR, DateJoined)
)
ORDER BY COUNT(*) DESC
| Domain | YearJoined | SubscriberCount |
|---|---|---|
| null | null | 12304 |
| null | 2021 | 6201 |
| null | 2020 | 4321 |
| gmail.com | null | 1234 |
| outlook.com | null | 1111 |
| yahoo.com | null | 987 |
| gmail.com | 2021 | 642 |
| outlook.com | 2021 | 578 |
| gmail.com | 2020 | 567 |
| outlook.com | 2020 | 456 |
| yahoo.com | 2021 | 432 |
| yahoo.com | 2020 | 345 |
It rolls data like ROLLUP, but for all combinations of columns. Notice in the outcomes the second and third rows with NULLs in first column and count for the whole year.
COUNT
COUNT function allows you to count the number of rows that matches a query. It is excellent for both ad hoc data exploration and creating administrative queries that help maintain your instance. It's also a must-have for GROUP BY.
Let's get the count of our Subscribers in Master Data Extension.
SELECT COUNT(*)
FROM ContactMasterDE
Nice, but it does not give us any information that we couldn't find more comfortable in the User Interface. So, where is the added value?
For example, in seeing whether you have potential duplicate contacts. Instead of looking for all rows with COUNT(*), we will change the function argument to the specific column to count only its non-NULL values.
SELECT
EmailAddress
, COUNT(EmailAddress) AS EmailCount
FROM ContactMasterDE
GROUP BY EmailAddress
HAVING COUNT(EmailAddress) > 1
With GROUP BY and HAVING above, we can simplify the output by showing only the duplicate values with corresponding counts.
You can take it to the next level with the subquery approach:
SELECT
subscribers.SubscriberKey
, subscribers.EmailAddress
, counted.EmailCount
FROM ContactMasterDE AS subscribers
INNER JOIN (
SELECT
contacts.EmailAddress AS EmailAddress
, COUNT(contacts.EmailAddress) AS EmailCount
FROM ContactMasterDE AS contacts
GROUP BY contacts.EmailAddress
HAVING COUNT(contacts.EmailAddress) > 1
) AS counted
ON counted.EmailAddress = subscribers.EmailAddress
This extended query will not only tell you that there are potential duplicates. It will also give you Subscriber Keys of duplicate contacts so that you can investigate the reason and clean up records after fixing the underlying issue.
Pack it into Automation with Verification Activity on top and keep your database clean with ease.
COUNT *, ALL and DISTINCT
There are three approaches to using COUNT that differ slightly from each other.
COUNT(*)will return the number of items, includingNULLvalues and duplicates.COUNT(FieldName)/COUNT(ALL FieldName)will return the number on non-NULLvalues including duplicates.COUNT(DISTINCT FieldName)will return the number of non-NULLand non-duplicate values.
SELECT
COUNT(EmailAddress) AS EmailCount
, COUNT(DISTINCT EmailAddress) AS UniqueEmailCount
FROM _Subscribers
HAVING
HAVING works nearly the same as WHERE and allows you to filter outcomes of your GROUP BY.
SELECT
EmailAddress
, COUNT(EmailAddress) AS EmailCount
FROM _Subscribers
GROUP BY EmailAddress
HAVING COUNT(EmailAddress) > 1
You can leverage all filtering tricks from WHERE. In practice, I mostly use it along with aggregate functions like COUNT.
HAVING vs WHERE
I just wrote that HAVING and WHERE work similarly, and both filter the outcomes. What is the difference?
WHEREfilters individual data (rows beforeGROUP BYis applied)HAVINGfilters aggregated data (rows afterGROUP BYis applied)
SELECT
EmailAddress
, COUNT(EmailAddress) AS EmailCount
FROM _Subscribers
WHERE Status = 'active'
GROUP BY EmailAddress
HAVING COUNT(EmailAddress) > 1
Thanks to using both WHERE and HAVING in this query, we will look only for duplicate active Subscribers. If we have two subscribers with the same email, but one is inactive, this query won't output it.
You can use both in the same query and even mix it with DISTINCT.
However, when building such layered filters, keep in mind the SQL's order of execution for clauses:
FROM > WHERE > GROUP BY > HAVING > DISTINCT > ORDER BY
It means that rows filtered with WHERE will not be taken into consideration for GROUP BY and that DISTINCT deduplication will be applied to results grouped and filtered by HAVING - not the source data.
MIN and MAX
MIN and MAX functions can help you find the minimum and maximum values within a group.
As they are aggregate functions, we can use them in both the SELECT part of the query and for filtering with WHERE or HAVING.
- Query
- Outcome
SELECT TOP 10
Domain
, MIN(DATEDIFF(DAY, DateJoined, GETDATE())) AS YoungestSubscriberDaysOfService
, MAX(DATEDIFF(DAY, DateJoined, GETDATE())) AS OldestSubscriberDaysOfService
FROM _Subscribers
GROUP BY Domain
ORDER BY MIN(DATEDIFF(DAY, DateJoined, GETDATE())) DESC
| Domain | YoungestSubscriberDaysOfService | OldestSubscriberDaysOfService |
|---|---|---|
| mateuszdabrowski.pl | 784 | 784 |
| test.com | 530 | 743 |
| yahoo.com | 253 | 723 |
| ... | ... | ... |
Learn more about MIN and MAX in Numeric Functions documentation.
AVG
AVG function allows you to find the average of values within the group.
It ignores NULL values for the calculation and we can use it in SELECT, WHERE or HAVING part of the query.
- Query
- Outcome
SELECT TOP 10
Domain
, AVG(BounceCount) AS AverageBounceCount
FROM _Subscribers
GROUP BY Domain
HAVING AVG(BounceCount) > 0
ORDER BY AVG(BounceCount) DESC
| Domain | AverageBounceCount |
|---|---|
| test.com | 164 |
| yahoo.com | 2 |
| outlook.com | 1 |
| gmail.com | 1 |
| ... | ... |
Learn more about AVG in Numeric Functions documentation.
SUM
SUM function allows you to find the sum of values within the group.
It ignores NULL values for the calculation and we can use it in SELECT, WHERE or HAVING part of the query.
- Query
- Outcome
SELECT TOP 3
ProductCategory
, COUNT(Orders) AS TotalOrders
, SUM(Price) AS TotalIncome
FROM Purchases
WHERE Status = 'Complete'
GROUP BY ProductCategory
HAVING SUM(Price) > 1000
ORDER BY SUM(Price) DESC
| ProductCategory | TotalOrders | TotalIncome |
|---|---|---|
| Course | 12 | 5500 |
| Webinar | 76 | 5120 |
| Ebook | 136 | 3450 |