Skip to main content

SFMC SQL Aggregate Functions

Group rows and apply aggregate functions to see the big picture in your data.

GROUP BY

GROUP BY allows you to create aggregation by grouping the input data on selected column(s).

It limits the data you can SELECT to either GROUP BY column(s) or aggregate functions.

Let's say you want to check which email domains are most popular in your database.

SELECT
Domain
, COUNT(*) AS SubscriberCount
FROM _Subscribers
GROUP BY Domain

With this simple query, you have all subscribers grouped by their domain with the number of matching rows thanks to the COUNT function.

You can quickly improve this query by limiting the records with TOP and sorting ORDER BY to leverage Pareto's principle and pick 20% of the records responsible for 80% of the results. In this case - 20% domains covering approximately 80% subscribers.

SELECT TOP 20 PERCENT
Domain
, COUNT(*) AS SubscriberCount
FROM _Subscribers
GROUP BY Domain
ORDER BY COUNT(*) DESC

With such data, you can assess your mailing list and decide, for example, whether using AMP for Email will be worth the effort for your business.

It is, however, just the beginning of what you can do with GROUP BY. The real magic happens with multiple columns used for the grouping.

Let's check how the domain numbers are changing in years with the help of the DATEPART function and DateJoined field.

SELECT TOP 20 PERCENT
Domain
, DATEPART(YEAR, DateJoined) AS YearJoined
, COUNT(*) AS SubscriberCount
FROM _Subscribers
GROUP BY
Domain
, DATEPART(YEAR, DateJoined)
ORDER BY COUNT(*) DESC

As you can see, by just adding more columns after a comma (order is irrelevant), we can make even more exciting groupings. In this case, we will have a separate row for each domain and year. Easy to see how many Subscribers using Gmail joined your database in 2021 vs 2020.

You Should Know

GROUP BY will group NULL values as if they are equal.

GROUP BY ROLLUP

ROLLUP takes the multi-column grouping to the next level by showing totals and subtotals in the outcomes. Not useful for campaign segmentation, but excellent for quick analysis and reporting.

To use it - add the ROLLUP keyword and wrap grouping columns in parentheses.

SELECT TOP 20 PERCENT
Domain
, DATEPART(YEAR, DateJoined) AS YearJoined
, COUNT(*) AS SubscriberCount
FROM _Subscribers
GROUP BY ROLLUP (
Domain
, DATEPART(YEAR, DateJoined)
)
ORDER BY COUNT(*) DESC

In the outcomes, you will see some rows with NULL as columns - those are rows for total and subtotals.

Notice that the NULLs are either in all columns (total) or the YearJoined column (subtotals). There are no rows with NULL only in Domain. It is because the ROLLUP rolls columns provided in parentheses from right to left when calculating subtotals.

Order of columns will have a massive impact on outcomes!

You Should Know

You can also perform partial rollup by keeping some columns outside of the ROLLUP operator. In the example below, I excluded Domain from the ROLLUP resulting in the lack of total in the outcomes.

SELECT TOP 20 PERCENT
Domain
, DATEPART(YEAR, DateJoined) AS YearJoined
, COUNT(*) AS SubscriberCount
FROM _Subscribers
GROUP BY
Domain
, ROLLUP(DATEPART(YEAR, DateJoined))
ORDER BY COUNT(*) DESC

GROUP BY CUBE

If you thought that right-to-left execution of the ROLLUP is not helping your use case and instead you would rather have all possible groupings - CUBE is here to help.

SELECT TOP 20 PERCENT
Domain
, DATEPART(YEAR, DateJoined) AS YearJoined
, COUNT(*) AS SubscriberCount
FROM _Subscribers
GROUP BY CUBE (
Domain
, DATEPART(YEAR, DateJoined)
)
ORDER BY COUNT(*) DESC

It rolls data like ROLLUP, but for all combinations of columns. Notice in the outcomes the second and third rows with NULLs in first column and count for the whole year.

COUNT

COUNT function allows you to count the number of rows that matches a query. It is excellent for both ad hoc data exploration and creating administrative queries that help maintain your instance. It's also a must-have for GROUP BY.

Let's get the count of our Subscribers in Master Data Extension.

Count all rows in the _Subscribers Data View
SELECT COUNT(*)
FROM ContactMasterDE

Nice, but it does not give us any information that we couldn't find more comfortable in the User Interface. So, where is the added value?

For example, in seeing whether you have potential duplicate contacts. Instead of looking for all rows with COUNT(*), we will change the function argument to the specific column to count only its non-NULL values.

This query will show you which Email Addresses exists on more than one subscriber
SELECT
EmailAddress
, COUNT(EmailAddress) AS EmailCount
FROM ContactMasterDE
GROUP BY EmailAddress
HAVING COUNT(EmailAddress) > 1

With GROUP BY and HAVING above, we can simplify the output by showing only the duplicate values with corresponding counts.

You Should Know

You can take it to the next level with the subquery approach:

Returns Subscriber Key, Email Address and Duplicate Count of your database
SELECT
subscribers.SubscriberKey
, subscribers.EmailAddress
, counted.EmailCount
FROM ContactMasterDE AS subscribers
INNER JOIN (
SELECT
contacts.EmailAddress AS EmailAddress
, COUNT(contacts.EmailAddress) AS EmailCount
FROM ContactMasterDE AS contacts
GROUP BY contacts.EmailAddress
HAVING COUNT(contacts.EmailAddress) > 1
) AS counted
ON counted.EmailAddress = subscribers.EmailAddress

This extended query will not only tell you that there are potential duplicates. It will also give you Subscriber Keys of duplicate contacts so that you can investigate the reason and clean up records after fixing the underlying issue.

Pack it into Automation with Verification Activity on top and keep your database clean with ease.

COUNT *, ALL and DISTINCT

There are three approaches to using COUNT that differ slightly from each other.

  • COUNT(*) will return the number of items, including NULL values and duplicates.
  • COUNT(FieldName) / COUNT(ALL FieldName) will return the number on non-NULL values including duplicates.
  • COUNT(DISTINCT FieldName) will return the number of non-NULL and non-duplicate values.
Let's find out whether we have multiple Subscribers with the same Email Address
SELECT
COUNT(EmailAddress) AS EmailCount
, COUNT(DISTINCT EmailAddress) AS UniqueEmailCount
FROM _Subscribers

HAVING

HAVING works nearly the same as WHERE and allows you to filter outcomes of your GROUP BY.

This query will show you which Email Addresses exists on more than one subscriber
SELECT
EmailAddress
, COUNT(EmailAddress) AS EmailCount
FROM _Subscribers
GROUP BY EmailAddress
HAVING COUNT(EmailAddress) > 1

You can leverage all filtering tricks from WHERE. In practice, I mostly use it along with aggregate functions like COUNT.

HAVING vs WHERE

I just wrote that HAVING and WHERE work similarly, and both filter the outcomes. What is the difference?

  1. WHERE filters individual data (rows before GROUP BY is applied)
  2. HAVING filters aggregated data (rows after GROUP BY is applied)
This query will show you which Email Addresses exists on more than one active subscriber
SELECT
EmailAddress
, COUNT(EmailAddress) AS EmailCount
FROM _Subscribers
WHERE Status = 'active'
GROUP BY EmailAddress
HAVING COUNT(EmailAddress) > 1

Thanks to using both WHERE and HAVING in this query, we will look only for duplicate active Subscribers. If we have two subscribers with the same email, but one is inactive, this query won't output it.

You Should Know

You can use both in the same query and even mix it with DISTINCT.

However, when building such layered filters, keep in mind the SQL's order of execution for clauses:

FROM > WHERE > GROUP BY > HAVING > DISTINCT > ORDER BY

It means that rows filtered with WHERE will not be taken into consideration for GROUP BY and that DISTINCT deduplication will be applied to results grouped and filtered by HAVING - not the source data.

MIN and MAX

MIN and MAX functions can help you find the minimum and maximum values within a group.

As they are aggregate functions, we can use them in both the SELECT part of the query and for filtering with WHERE or HAVING.

Let's find domains that are no longer getting new subscribers
SELECT TOP 10
Domain
, MIN(DATEDIFF(DAY, DateJoined, GETDATE())) AS YoungestSubscriberDaysOfService
, MAX(DATEDIFF(DAY, DateJoined, GETDATE())) AS OldestSubscriberDaysOfService
FROM _Subscribers
GROUP BY Domain
ORDER BY MIN(DATEDIFF(DAY, DateJoined, GETDATE())) DESC

Learn more about MIN and MAX in Numeric Functions documentation.

AVG

AVG function allows you to find the average of values within the group.

It ignores NULL values for the calculation and we can use it in SELECT, WHERE or HAVING part of the query.

Let's find domain that are bouncing for most users
SELECT TOP 10
Domain
, AVG(BounceCount) AS AverageBounceCount
FROM _Subscribers
GROUP BY Domain
HAVING AVG(BounceCount) > 0
ORDER BY AVG(BounceCount) DESC

Learn more about AVG in Numeric Functions documentation.

You Should Know

AVG will ignore rows with NULL, which can skew your results. COALESCE will be your best friend to make sense out of imperfect data.

SUM

SUM function allows you to find the sum of values within the group.

It ignores NULL values for the calculation and we can use it in SELECT, WHERE or HAVING part of the query.

Let's find product categories responsible for the highest income
SELECT TOP 3
ProductCategory
, COUNT(Orders) AS TotalOrders
, SUM(Price) AS TotalIncome
FROM Purchases
WHERE Status = 'Complete'
GROUP BY ProductCategory
HAVING SUM(Price) > 1000
ORDER BY SUM(Price) DESC