Introduction

Windowing functions are functions in SQL that are applied over a set of rows. These functions are typically used for analytical purposes such as calculating running totals, moving averages, distributions, and so on. They are very flexible and efficient, and are generally easier to understand and write than traditional alternatives.

MCSE Training – Resources (Intense)

For example, a traditional approach to calculate a running total would look like this (using the Adventure Works 2012 data warehouse):

WITH CTE_basedata AS
(
	SELECT
		 [Month]		= YEAR([OrderDate])*100 + MONTH([OrderDate])
		,[SalesAmount]	= SUM(SalesAmount)
	FROM [dbo].[FactInternetSales]
	WHERE YEAR([OrderDate]) = 2008
	GROUP BY YEAR([OrderDate])*100 + MONTH([OrderDate])
)
SELECT
	 c1.[Month]
	,RunningTotal = SUM(c2.[SalesAmount])
FROM CTE_basedata c1
JOIN CTE_basedata c2 ON c2.[Month] <= c1.[Month]
GROUP BY c1.[Month]
ORDER BY [Month];

Using windowing functions, we can rewrite the query in a more concise format:

WITH CTE_basedata AS
(
	SELECT
		 [Month]		= YEAR([OrderDate])*100 + MONTH([OrderDate])
		,[SalesAmount]	= SUM(SalesAmount)
	FROM [dbo].[FactInternetSales]
	WHERE YEAR([OrderDate]) = 2008
	GROUP BY YEAR([OrderDate])*100 + MONTH([OrderDate])
)
SELECT
	 [Month]
	,RunningTotal = SUM([SalesAmount]) OVER (ORDER BY [Month] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM CTE_basedata;

The first implementation of windowing functions was added to the T-SQL language in SQL Server 2005. Ranking functions were fully supported, but window aggregate functions only received partial support. Full support for those window aggregates, as well as new offset and distribution functions, was added in SQL Server 2012.

This article will introduce windowing functions and their syntax, while making a distinction on what is available in SQL Server 2005 and in SQL Server 2012. The second article will focus on optimizations and on practical use cases for windowing functions. Understanding windowing functions is crucial for Exam 70-461 – Querying Microsoft SQL Server 2012.

The OVER clause

The backbone of windowing functions is the OVER clause. It can take the following form:

windowing function
OVER 	(
	[PARTITION BY expression]
	[ORDER BY expression]
	[ROW or RANGE clause]
	)

Depending on the type of windowing function, parts of the OVER clause are optional, mandatory or not allowed at all.

The PARTITION BY clause defines over which segment of rows a window is defined. You can think of it as some sort of GROUP BY clause. The ORDER BY clause defines in which order the rows are sorted in the window. The ROW or RANGE frame extent further limits the number of rows in the window on which the function needs to be applied.

Let’s illustrate this using the following example:

WITH CTE_basedata AS
(
	SELECT
		 [Month]		= YEAR([OrderDate])*100 + MONTH([OrderDate])
		,[SalesTerritoryKey]
		,[SalesAmount]	= SUM(SalesAmount)
	FROM [dbo].[FactInternetSales]
	WHERE YEAR([OrderDate]) = 2008
	GROUP BY YEAR([OrderDate])*100 + MONTH([OrderDate]), [SalesTerritoryKey]
)
SELECT
	 [Month]
	,[SalesTerritoryKey]
	,SalesAmount
	,RunningTotal = SUM([SalesAmount]) OVER (PARTITION BY [SalesTerritoryKey] ORDER BY [Month] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM CTE_basedata
ORDER BY [SalesTerritoryKey],[Month]

Here we calculate a running total for each sales territory over the different months. The diagram explains how the window function is calculated for each row:

The ROWS or RANGE frame extent can take the following format:

ROWS/RANGE BETWEEN UNBOUNDED PRECEDING
			| PRECEDING
			| FOLLOWING
			|CURRENT ROW
		AND	 UNBOUNDED FOLLOWING
| PRECEDING
			| FOLLOWING
			|CURRENT ROW

The difference between ROWS and RANGE is that the ROWS clause limits rows within a partition by specifying a fixed number of rows, while the RANGE clause limits rows logically by specifying a range of values with respect to the value in the current row. An example will make the concept clearer.

SELECT ID = 1, Category = 'A'
INTO #Temp
UNION ALL
SELECT ID = 2, Category = 'A'
UNION ALL
SELECT ID = 2, Category = 'B'
UNION ALL
SELECT ID = 2, Category = 'B'
UNION ALL
SELECT ID = 2, Category = 'B'
UNION ALL
SELECT ID = 2, Category = 'C'
UNION ALL
SELECT ID = 2, Category = 'C'
UNION ALL
SELECT ID = 2, Category = 'C';

SELECT
	 ID
	,Category
	,[RowCount] = COUNT(1) OVER (ORDER BY Category ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM #Temp;

This query with the ROWS frame extent calculates a row count. When we use the RANGE frame extent, we get a drastically different result:

SELECT
	 ID
	,Category
	,[RowCount] = COUNT(1) OVER (ORDER BY Category RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM #Temp;

When using RANGE, the frame does not include only the current row, but also other rows with the same value as the current row. For category A – which is the first category – there are a total of two rows with the same category, so the count returns 2 instead of 1 for the first row.

PRECEDING and FOLLOWING cannot be used with RANGE. It’s important to know that when you specify an ORDER BY clause but you do not specify a frame extent, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is used as the default. This can have serious implications, which I describe in my blog post Beware the defaults! (in windowing functions). Not only can this default lead to unexpected results, as explained in the row count example, but it also can have a performance impact.

Logically, windowing functions with their OVER clause are processed when the final result set is almost known. This means that windowing functions are available in the SELECT clause and in the ORDER BY clause, but nowhere else, in order to avoid ambiguity. For more information, please refer to Logical Query Processing Phases – Order of Statement Execution. Itzik Ben-Gan has made an excellent poster describing the subject. You can download it here.

SQL Server 2005
Ranking functions

Ranking functions were immediately fully supported with the release of SQL Server 2005 and today they are commonplace in the toolkits of many SQL developers.

ROW_NUMBER

This function is pretty straightforward: it computes a sequential row number starting at 1 within a window partition, following the ordering specified with the ORDER BY. Using the same temp table as in the previous paragraph, we can construct a row number like this:

SELECT
	 ID
	,Category
	,[RowNumber] = ROW_NUMBER() OVER (ORDER BY ID)
FROM #Temp;

If the window ordering is unique, the function is deterministic. This means that it can have only one predictable output. If the ordering is not unique, the function is not deterministic. Let’s sort our example according to category:

SELECT
	 ID
	,Category
	,[RowNumber] = ROW_NUMBER() OVER (ORDER BY Category DESC)
FROM #Temp
ORDER BY [RowNumber];

There is nothing that guarantees that the results are ordered like this. Another correct result for example would be if ID 1 and 2 were switched around. If you need a deterministic result, a tie-breaker must be added to the ORDER BY clause. In our case, this could be the ID column:

SELECT
	 ID
	,Category
	,[RowNumber] = ROW_NUMBER() OVER (ORDER BY Category DESC)
FROM #Temp
ORDER BY [RowNumber];

NTILE

The NTILE function distributes the rows of the result set in roughly equal sized buckets. For example, if I have 100 rows and I specify NTILE(10), I will end up with 10 buckets with 10 rows each. If I have 100 rows and I specify NTILE(9), the first bucket will have 12 rows and the other 8 buckets will have 11 rows.

WITH CTE_Tally AS
(
	SELECT TOP 100 1 AS N
	FROM sys.columns c1
	CROSS JOIN sys.columns c2
)
SELECT N, buckets = NTILE(9) OVER(ORDER BY N)
FROM CTE_Tally;

Each bucket is assigned a sequential ID, starting at 1. Just as with ROW_NUMBER(), determinism is only applicable if the ordering is unique.

RANK and DENSE_RANK

Both functions are similar to ROW_NUMBER(), but they don’t have to produce unique values. When the ordering is ascending, RANK calculates one more than the number of rows with an ordering value less than the current one in the partition. DENSE_RANK calculates one more than the number of distinct ordering values less than the current one in the partition. And of course, it’s the other way around if the ordering is descending.

In other words, RANK and DENSE_RANK can both have ties, but DENSE_RANK doesn’t have gaps in its sequence. Using the same temp table as before:

SELECT
	 ID
	,Category
	,[Rank]			= RANK() OVER (ORDER BY Category DESC)
	,[DenseRank]	= DENSE_RANK() OVER (ORDER BY Category DESC)
FROM #Temp;

By definition, both functions are deterministic, since ties get the same ranking. However, the official documentation labels both functions as nondeterministic. If the ordering is unique, both functions are similar to the ROW_NUMBER function.

Partial support for aggregate functions

SQL Server 2005 only had limited support for regular aggregate functions (such as SUM, AVG, MIN and MAX, but the OVER clause is not allowed for GROUPING and GROUPING_ID): the partitioning clause could be used, but the ordering and framing options were absent.

This means it was easier to calculate subtotals or grand totals for example. Using the AdventureWorks database again, the following query calculates a subtotal of sales amount per territory, the grand total of sales amount and the “percentage of whole” (subtotal divided by grand total):

WITH CTE_basedata AS
(
	SELECT
		 [Month]		= YEAR([OrderDate])*100 + MONTH([OrderDate])
		,[SalesTerritoryKey]
		,[SalesAmount]	= SUM(SalesAmount)
	FROM [dbo].[FactInternetSales]
	WHERE YEAR([OrderDate]) = 2008
	GROUP BY YEAR([OrderDate])*100 + MONTH([OrderDate]), [SalesTerritoryKey]
)
SELECT
	 [Month]
	,[SalesTerritoryKey]
	,SalesAmount
	,SubtotalByTerritory	= SUM([SalesAmount]) OVER (PARTITION BY [SalesTerritoryKey])
	,GrandTotal				= SUM([SalesAmount]) OVER ()
	,PercentageOfWhole		= 1.0 * SUM([SalesAmount]) OVER (PARTITION BY [SalesTerritoryKey]) / SUM([SalesAmount]) OVER ()
FROM CTE_basedata
ORDER BY [SalesTerritoryKey],[Month]

It’s easy to see that even the limited support for window aggregate functions already allows great analytical possibilities.

SQL Server 2012
Full support for aggregate functions

With the release of SQL Server 2012, support for the ORDER BY clause and the frame extent was added to the aggregate functions. However, you can still write the aggregate functions the same as in SQL Server 2005 and it will behave just the same.

Once you specify an ORDER BY clause though, the frame extent becomes active, except for ranking functions, which are unchanged since SQL Server 2005. If you do not specify a frame extent when the ORDER BY clause is specified, the default of RANGE UNBOUNDED PRECEDING AND CURRENT ROW is used, as mentioned earlier.

Distribution functions

SQL Server 2012 also released brand new functions, such as the distribution functions. These functions deal with distribution of data and are mostly used for statistical analysis.

PERCENT_RANK and CUME_DIST

The first two functions are PERCENT_RANK and CUME_DIST, also called rank distribution functions. They are calculated as follows:

  • rk = the rank of the row in the partition
  • nr = the number of rows in the partition
  • np = the number of rows that precede or are the same as the current row

PERCENT_RANK =

CUME_DIST =

In other words, percent rank is the number of rows that have a lower rank than the current rank, and cumulative distribution is the percentage of rows with a lower rank or the same rank as the current rank.

Let’s illustrate this with our earlier category data:

SELECT
	 ID
	,Category
	,PercentRank	= PERCENT_RANK() OVER (ORDER BY Category)
	,CumeDist	= CUME_DIST() OVER (ORDER BY Category)
FROM #Temp;

For category A: rk = 1; np = 2, nr = 8 (no partition was specified). This results in PercentRank = 0 and CumeDist = 2/8 = 0.25.

For category B: rk = 3; np = 5; nr = 8. PercentRank = 2/7 and CumeDist = 5/8.

For category C: rk = 6; np = 8; nr = 8. PercentRank = 5/7 and CumeDist = 8/8 = 1.

PERCENTILE_DISC

and

PERCENTILE_CONT

The other two functions, PERCENTILE_DISC and PERCENTILE_CONT, both calculate a percentile. They are also called inverse distribution functions, because they accept a parameter as input and they return a value from the partition – or an interpolation from values – as output. The most common known percentile is the fiftieth percentile, also known as the median.

The difference between PERCENTILE_DISC and PERCENTILE_CONT is that the former is a discrete distribution model and will always return a value from the partition. More specifically, it will return the value with the smallest CUME_DIST value that is greater than or equal to a percentile value P.

PERCENTILE_CONT is a continuous distribution model and will return the exact percentile value, which is an interpolation. This means that the value returned might not actually be present in the column. The easiest way to understand this is how you calculate the median when you have an even number of values: the median is the average of the two middle values.

The two inverse distribution functions differ from all other windowing functions because they need a WITHIN GROUP (ORDER BY expression) part. In contrast with other windowing functions, sorting is specified there and not in the OVER clause. The OVER clause, on the other hand, can only contain an optional PARTITION BY clause. Also, only numeric data types are supported for the sorting. All distribution functions do not allow the ROW/RANGE clause. This means that it is not possible for example to calculate a moving median using the inverse distribution functions.

An example using the same data set:

SELECT
	 ID
	,Category
	,PercDisc	= PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY ID) OVER ()
	,PercCont	= PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY ID) OVER ()
FROM #Temp;

Offset functions

The last set of functions introduced in SQL Server 2012 are the offset functions, which allow you to retrieve rows relative to a certain position in the partition window.

LAG and LEAD

The first two functions have an offset that is relative to the current row. LAG accesses data from a previous row, while LEAD accesses data from a subsequent row. The syntax is like this:

LAG/LEAD (expression [, offset] ,[ default]) OVER ([partition by clause] ORDER BY clause)

The ORDER BY is obligated of course to give meaning to previous and subsequent rows in the current partition. The default offset is 1. When the edge of a partition is reached and there isn’t a previous or subsequent row, NULL is returned unless the third parameter is supplied. Framing is not supported for these two functions.

An example:

SELECT
	 ID
	,Category
	,Previous	= LAG(Category,1) OVER (ORDER BY ID)
	,[Next]		= LEAD(Category,1) OVER (ORDER BY ID)
FROM #Temp;

These two functions are really powerful and important as they can replace a lot of T-SQL solutions that make use of the inefficient cursor structure and thus enable you to avoid looping.

FIRST_VALUE and LAST_VALUE

The last two window functions, FIRST_VALUE and LAST_VALUE, have an offset that is relative to the current partition and frame extent. The names of the functions pretty much explain their behaviour. Their syntax looks like this:

FIRST_VALUE/LAST_VALUE (expression) OVER ([partition by clause] ORDER BY clause [rows or range clause])

As with the other two offset functions, the ORDER BY is mandatory as sorting is necessary to be able to calculate the function. The rows or range clause is supported though and has an impact over which ordered set the function is applied.

An example:

SELECT
	 ID
	,Category
	,[First]	= FIRST_VALUE(Category) OVER (ORDER BY ID ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
	,[Last]		= LAST_VALUE(Category) OVER (ORDER BY ID ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM #Temp;

Do not forget to specify a frame extent, otherwise RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is used, which can lead to extraordinary results for the LAST_VALUE function.

Conclusion

This article introduced the various windowing functions and their syntax. Since this is just an introduction, it may not yet be clear how enormously powerful, concise and efficient they can be. The second part of this series will go deeper on optimizations and on practical use cases.