Skip to main content

SFMC SQL Aggregate Functions

Group rows and apply aggregate functions to see the big picture in your data.

GROUP BY#

GROUP BY allows you to create aggregation by grouping the input data on selected column(s).

It limits the data you can SELECT to either GROUP BY column(s) or aggregate functions.

Let's say you want to check which email domains are most popular in your database.

SELECT      Domain    , COUNT(*) AS SubscriberCountFROM _SubscribersGROUP BY Domain

With this simple query, you have all subscribers grouped by their domain with the number of matching rows thanks to the COUNT function.

You can quickly improve this query by limiting the records with TOP and sorting ORDER BY to leverage Pareto's principle and pick 20% of the records responsible for 80% of the results. In this case - 20% domains covering approximately 80% subscribers.

SELECT TOP 20 PERCENT      Domain    , COUNT(*) AS SubscriberCountFROM _SubscribersGROUP BY DomainORDER BY COUNT(*) DESC

With such data, you can assess your mailing list and decide, for example, whether using AMP for Email will be worth the effort for your business.

It is, however, just the beginning of what you can do with GROUP BY. The real magic happens with multiple columns used for the grouping.

Let's check how the domain numbers are changing in years with the help of the DATEPART function and DateJoined field.

SELECT TOP 20 PERCENT      Domain    , DATEPART(YEAR, DateJoined) AS YearJoined    , COUNT(*) AS SubscriberCountFROM _SubscribersGROUP BY      Domain    , DATEPART(YEAR, DateJoined)ORDER BY COUNT(*) DESC

As you can see, by just adding more columns after a comma (order is irrelevant), we can make even more exciting groupings. In this case, we will have a separate row for each domain and year. Easy to see how many Subscribers using Gmail joined your database in 2021 vs 2020.

GROUP BY ROLLUP#

ROLLUP takes the multi-column grouping to the next level by showing totals and subtotals in the outcomes. Not useful for campaign segmentation, but excellent for quick analysis and reporting.

To use it - add the ROLLUP keyword and wrap grouping columns in parentheses.

SELECT TOP 20 PERCENT      Domain    , DATEPART(YEAR, DateJoined) AS YearJoined    , COUNT(*) AS SubscriberCountFROM _SubscribersGROUP BY ROLLUP (          Domain        , DATEPART(YEAR, DateJoined)    )ORDER BY COUNT(*) DESC

In the outcomes, you will see some rows with NULL as columns - those are rows for total and subtotals.

Notice that the NULLs are either in all columns (total) or the YearJoined column (subtotals). There are no rows with NULL only in Domain. It is because the ROLLUP rolls columns provided in parentheses from right to left when calculating subtotals.

Order of columns will have a massive impact on outcomes!

You Should Know

You can also perform partial rollup by keeping some columns outside of the ROLLUP operator. In the example below, I excluded Domain from the ROLLUP resulting in the lack of total in the outcomes.

SELECT TOP 20 PERCENT      Domain    , DATEPART(YEAR, DateJoined) AS YearJoined    , COUNT(*) AS SubscriberCountFROM _SubscribersGROUP BY      Domain    , ROLLUP(DATEPART(YEAR, DateJoined))ORDER BY COUNT(*) DESCORDER BY COUNT(*) DESC

GROUP BY CUBE#

If you thought that right-to-left execution of the ROLLUP is not helping your use case and instead you would rather have all possible groupings - CUBE is here to help.

SELECT TOP 20 PERCENT      Domain    , DATEPART(YEAR, DateJoined) AS YearJoined    , COUNT(*) AS SubscriberCountFROM _SubscribersGROUP BY CUBE (          Domain        , DATEPART(YEAR, DateJoined)    )ORDER BY COUNT(*) DESC

It rolls data like ROLLUP, but for all combinations of columns. Notice in the outcomes the second and third rows with NULLs in first column and count for the whole year.

COUNT#

COUNT function allows you to count the number of rows that matches a query. It is excellent for both ad hoc data exploration and creating administrative queries that help maintain your instance. It's also a must-have for GROUP BY.

Let's get the count of our Subscribers.

Count all rows in the _Subscribers Data View
SELECT COUNT(*)FROM _Subscribers

Nice, but it does not give us any information that we couldn't find more comfortable in the User Interface. So, where is the added value?

For example, in seeing whether you have potential duplicate contacts. Instead of looking for all rows with COUNT(*), we will change the function parameter to the specific column to count only its values.

This query will show you which Email Addresses exists on more than one subscriber
SELECT      EmailAddress    , COUNT(EmailAddress) AS EmailCountFROM _SubscribersGROUP BY EmailAddressHAVING COUNT(EmailAddress) > 1

With GROUP BY and HAVING above, we can simplify the output by showing only the duplicate values with corresponding counts.

You Should Know

You can take it to the next level with the subquery approach:

Returns Subscriber Key, Email Address and Duplicate Count of your database
SELECT      subscribers.SubscriberKey    , subscribers.EmailAddress    , counted.EmailCountFROM _Subscribers AS subscribers    INNER JOIN (        SELECT              subscribers.EmailAddress        AS EmailAddress            , COUNT(subscribers.EmailAddress) AS EmailCount        FROM _Subscribers AS subscribers        GROUP BY subscribers.EmailAddress        HAVING COUNT(subscribers.EmailAddress) > 1    ) AS counted        ON counted.EmailAddress = subscribers.EmailAddress

This extended query will not only tell you that there are potential duplicates. It will also give you Subscriber Keys of duplicate contacts so that you can investigate the reason and clean up records after fixing the underlying issue.

Pack it into Automation with Verification Activity on top and keep your database clean with ease.

COUNT *, ALL and DISTINCT#

There are three approaches to using COUNT that differ slightly from each other.

  • COUNT(*) will return the number of items, including NULL values and duplicates.
  • COUNT(FieldName) / COUNT(ALL FieldName) will return the number on non-NULL values including duplicates.
  • COUNT(DISTINCT FieldName) will return the number of non-NULL and non-duplicate values.
Let's find out whether we have multiple Subscribers with the same Email Address
SELECT      COUNT(EmailAddress) AS EmailCount    , COUNT(DISTINCT EmailAddress) AS UniqueEmailCountFROM _Subscribers

HAVING#

HAVING works nearly the same as WHERE and allows you to filter outcomes of your GROUP BY.

This query will show you which Email Addresses exists on more than one subscriber
SELECT      EmailAddress    , COUNT(EmailAddress) AS EmailCountFROM _SubscribersGROUP BY EmailAddressHAVING COUNT(EmailAddress) > 1

You can leverage all filtering tricks from WHERE. In practice, I mostly use it along with aggregate functions like COUNT.

HAVING vs WHERE#

I just wrote that HAVING and WHERE work similarly, and both filter the outcomes. What is the difference?

  1. WHERE filters individual data (rows before GROUP BY is applied)
  2. HAVING filters aggregated data (rows after GROUP BY is applied)
This query will show you which Email Addresses exists on more than one active subscriber
SELECT      EmailAddress    , COUNT(EmailAddress) AS EmailCountFROM _SubscribersWHERE Status = 'active'GROUP BY EmailAddressHAVING COUNT(EmailAddress) > 1

Thanks to using both WHERE and HAVING in this query, we will look only for duplicate active Subscribers. If we have two subscribers with the same email, but one is inactive, this query won't output it.

You Should Know

You can use both in the same query and even mix it with DISTINCT.

However, when building such layered filters, keep in mind the SQL's order of execution for clauses:

FROM > WHERE > GROUP BY > HAVING > DISTINCT > ORDER BY

It means that rows filtered with WHERE will not be taken into consideration for GROUP BY and that DISTINCT deduplication will be applied to results grouped and filtered by HAVING - not the source data.

MIN and MAX#

MIN and MAX functions can help you find the minimum and maximum values within a group.

As they are aggregate functions, we can use them in both the SELECT part of the query and for filtering with WHERE or HAVING.

Let's find domains that are no longer getting new subscribers
SELECT TOP 10      Domain    , MIN(DATEDIFF(DAY, DateJoined, GETDATE())) AS YoungestSubscriberDaysOfService    , MAX(DATEDIFF(DAY, DateJoined, GETDATE())) AS OldestSubscriberDaysOfServiceFROM _SubscribersGROUP BY DomainORDER BY MIN(DATEDIFF(DAY, DateJoined, GETDATE())) DESC

Learn more about MIN and MAX in Numeric Functions documentation.

AVG#

AVG function allows you to find the average of values within the group.

It ignores NULL values for the calculation and we can use it in SELECT, WHERE or HAVING part of the query.

Let's find domain that are bouncing for most users
SELECT TOP 10      Domain    , AVG(BounceCount) AS AverageBounceCountFROM _SubscribersGROUP BY DomainHAVING AVG(BounceCount) > 0ORDER BY AVG(BounceCount) DESC

Learn more about AVG in Numeric Functions documentation.

SUM#

SUM function allows you to find the sum of values within the group.

It ignores NULL values for the calculation and we can use it in SELECT, WHERE or HAVING part of the query.

Let's find product categories responsible for the highest income
SELECT TOP 3      ProductCategory    , COUNT(Orders) AS TotalOrders    , SUM(Price) AS TotalIncomeFROM PurchasesWHERE Status = 'Complete'GROUP BY ProductCategoryHAVING SUM(Price) > 1000ORDER BY SUM(Price) DESC