Introduction

In the previous parts of this article – read them here and here – the T-SQL windowing functions were introduced, explained and we took a look at how we could improve their performance. This final part of the article will provide you with a few interesting use cases where those functions really prove their worth.

MCSE Training – Resources (Intense)

Moving Averages and Running Totals

The most obvious use cases are calculating moving averages – easily accomplished using the OVER clause with a frame extent like ROWS BETWEEN x PRECEDING AND y FOLLOWING – and running totals, which can be accomplished with frame extents like ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. In the two previous parts of this article, enough examples were shown, so I will not repeat them here.

Generating a Tally Table

A tally table is an auxiliary table that contains only a single column with sequential integers (forget for a moment that rows in a table actually don’t have an order), typically starting at 0 and ending at a very high number. Using such a numbers table, it becomes easier to do some complex calculations in a set-based manner.

Starting with two rows with the value 1 and a combination of cross joins, we can generate over 4 billion rows – about the range of a 4-byte integer – with only a few lines of code. At the end, the ROW_NUMBER() function is used to create the sequential numbers.

WITH T0 AS (SELECT N	 FROM (VALUES (1),(1)) AS tmp(N))
	,T1 AS (SELECT N = 1 FROM T0 AS a CROSS JOIN T0 AS b)
	,T2 AS (SELECT N = 1 FROM T1 AS a CROSS JOIN T1 AS b)
	,T3 AS (SELECT N = 1 FROM T2 AS a CROSS JOIN T2 AS b)
	,T4 AS (SELECT N = 1 FROM T3 AS a CROSS JOIN T3 AS b)
	,T5 AS (SELECT N = 1 FROM T4 AS a CROSS JOIN T4 AS b) -- over 4 billion rows
	,Tally AS (SELECT N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM T5)
SELECT TOP 500000 N FROM Tally;

With the TOP clause, the output is limited to only 500,000 rows in this example. Because the OVER clause sorts on (SELECT NULL), no actual sorting takes place which you typically want to avoid on 4 billion rows. The optimizer is smart enough to realize that only 500,000 rows are needed, so the plan doesn’t actually generate 4 billion rows but rather stops a bit over 500,000.

This code is very fast and the half million rows are generated in a little over one second.

Now what can we do with our tally table? You can for example quickly generate a list of sequential dates. The following code generates a list of all dates of the year 2015:

DECLARE @StartDate DATE = '2015-01-01';

WITH T0 AS (SELECT N	 FROM (VALUES (1),(1)) AS tmp(N))
	,T1 AS (SELECT N = 1 FROM T0 AS a CROSS JOIN T0 AS b)
	,T2 AS (SELECT N = 1 FROM T1 AS a CROSS JOIN T1 AS b)
	,T3 AS (SELECT N = 1 FROM T2 AS a CROSS JOIN T2 AS b)
	,T4 AS (SELECT N = 1 FROM T3 AS a CROSS JOIN T3 AS b)
	,T5 AS (SELECT N = 1 FROM T4 AS a CROSS JOIN T4 AS b) -- over 4 billion rows
	,Tally AS (SELECT N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM T5)
SELECT TOP 365
	SequentialDate = DATEADD(DAY,N-1,@StartDate)
FROM Tally;

With this list of dates, we could do other useful things. Suppose we have a table detailing when a contractor was working on a certain project. However, only the start date and the end date are provided. In the enterprise data warehouse, data is kept at the daily level so we need to have all the different days on which the contractor was working on the project. Using the generated list of dates, we can easily “explode” the data to the daily level.

CREATE TABLE #Contractors
	(ID INT NOT NULL
	,ContractorName VARCHAR(100) NOT NULL
	,ProjectName VARCHAR(100) NOT NULL
	,StartDate DATE NOT NULL
	,EndDate DATE NULL);

INSERT INTO #Contractors VALUES(1,'Koen','Project X','2015-01-05','2015-01-21');

DECLARE @StartDate DATE = '2015-01-01';

WITH T0 AS (SELECT N	 FROM (VALUES (1),(1)) AS tmp(N))
	,T1 AS (SELECT N = 1 FROM T0 AS a CROSS JOIN T0 AS b)
	,T2 AS (SELECT N = 1 FROM T1 AS a CROSS JOIN T1 AS b)
	,T3 AS (SELECT N = 1 FROM T2 AS a CROSS JOIN T2 AS b)
	,T4 AS (SELECT N = 1 FROM T3 AS a CROSS JOIN T3 AS b)
	,T5 AS (SELECT N = 1 FROM T4 AS a CROSS JOIN T4 AS b) -- over 4 billion rows
	,Tally AS (SELECT N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM T5)
	,DateList AS
(
	SELECT TOP 365
		SequentialDate = DATEADD(DAY,N-1,@StartDate)
	FROM Tally
)
SELECT
	 c.ContractorName
	,c.ProjectName
	,ActiveDate = d.SequentialDate
FROM DateList		d
JOIN #Contractors	c ON	d.SequentialDate >= c.StartDate
						AND	d.SequentialDate <= c.EndDate
						AND c.EndDate IS NOT NULL -- only take closed periods

For another great example on using the tally table, check out the excellent article by Jeff Moden on creating a function in SQL Server to split delimited strings: Tally OH! An Improved SQL 8K “CSV Splitter” Function.

Retrieving Top N from a Group

A common query problem is retrieving a fixed number of rows from the various partitions. For example, get the 5 most recent orders for each client. Another example is logging the various statuses of an item in a lifecycle. When grouping on a daily basis, only last logged status of the day should be retrieved. Check out the following sample data:

CREATE TABLE #StatusLog
	(ID					INT IDENTITY(1,1) NOT NULL
	,Item				VARCHAR(20) NOT NULL
	,StatusMessage		VARCHAR(100) NOT NULL
	,DateLogged			DATE NOT NULL
	,TimeLogged			TIME NOT NULL
	);

INSERT INTO #StatusLog
VALUES	 ('ItemA','Created'		,'2015-01-05','14:30:45')
		,('ItemA','Payed'		,'2015-01-05','14:30:55')
		,('ItemA','Handled'		,'2015-01-05','20:05:12')
		,('ItemA','Shipped'		,'2015-01-06','06:00:57')
		,('ItemB','Created'		,'2015-01-04','23:30:41')
		,('ItemB','Cancelled'	,'2015-01-05','00:02:08');

If we want to retrieve the last status of the day for each item, we expect four rows to be returned. Using ROW_NUMBER, we can assign an order to all of the events of an item in one specific day. By sorting the rows on descending order according to the timestamp, the most recent item status for a specific day will get row number 1.

SELECT
	 Item
	,DateLogged
	,TimeLogged
	,RID = ROW_NUMBER() OVER (PARTITION BY Item,DateLogged ORDER BY TimeLogged DESC)
FROM #StatusLog;

All what is left to do is to filter out all the rows we don’t need with a subquery.

SELECT Item, DateLogged
FROM
(
	SELECT
		 Item
		,DateLogged
		,TimeLogged
		,RID = ROW_NUMBER() OVER (PARTITION BY Item,DateLogged ORDER BY TimeLogged DESC)
	FROM #StatusLog
) tmp
WHERE RID = 1;

Using this method, you can also filter out duplicates from a table. I leave this as an exercise for the reader.

Maximum Concurrent Intervals

Suppose you have to calculate the maximum number of sessions that were active during a time period for a specific website, or the maximum number of viewers of a video on a streaming site. This type of problem is named the Maximum Concurrent Intervals. Let’s take a look at the following source data where we capture the user, the video he or she was watching, the start time and the end time.

CREATE TABLE #UserLog
	(ID			INT IDENTITY(1,1) NOT NULL
	,UserName	VARCHAR(50) NOT NULL
	,VideoID	INT NOT NULL
	,StartTime	DATETIME2(0) NOT NULL
	,EndTime	DATETIME2(0) NOT NULL);

INSERT INTO #UserLog
VALUES	 ('Bruce',1,'2015-01-05 12:30:45','2015-01-05 12:31:58')
		,('Clark',1,'2015-01-05 12:30:54','2015-01-05 12:31:20')
		,('Selena',1,'2015-01-05 12:35:01','2015-01-05 12:38:43')
		,('Dick',1,'2015-01-05 12:34:45','2015-01-05 12:39:23')
		,('Gordon',1,'2015-01-05 12:38:58','2015-01-05 12:40:33')
		,('Harley',1,'2015-01-05 12:39:15','2015-01-05 12:41:02');

Looking at the start and end times, we can determine that the output should be three. How do we solve this problem using a query? The solution is found by arranging all the timestamps chronologically. When a user starts watching, the number of concurrent viewers is increment with 1. However, if a user stops watching, the count is decremented with 1.

We can simulate this with the following UNION ALL:

SELECT VideoID, [TimeStamp] = StartTime, [Type] = 1
FROM #UserLog
UNION ALL 
SELECT VideoID, [TimeStamp] = EndTime, [Type] = -1
FROM #UserLog
ORDER BY [TimeStamp];

When we now calculate a running total over the type column, we can find the concurrent users.

WITH CTE_Timestamps AS
(
SELECT VideoID, [TimeStamp] = StartTime, [Type] = 1
FROM #UserLog
UNION ALL 
SELECT VideoID, [TimeStamp] = EndTime, [Type] = -1
FROM #UserLog
)
SELECT
	 VideoID
	,ConcurrentUsers = SUM([Type]) OVER (PARTITION BY VideoID
											ORDER BY [TimeStamp],[Type]
											ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM CTE_Timestamps;

The Type column is included in the sorting to consistently handle ties, e.g. when a user leaves at the same time a new user starts watching.

All that is left is to find the maximum among the concurrent users.

WITH CTE_Timestamps AS
(
SELECT VideoID, [TimeStamp] = StartTime, [Type] = 1
FROM #UserLog
UNION ALL 
SELECT VideoID, [TimeStamp] = EndTime, [Type] = -1
FROM #UserLog
)
SELECT VideoID, MaxConcurrentUsers = MAX(ConcurrentUsers)
FROM
(
	SELECT
		 VideoID
		,ConcurrentUsers = SUM([Type]) OVER (PARTITION BY VideoID
												ORDER BY [TimeStamp],[Type]
												ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
	FROM CTE_Timestamps
) cu
GROUP BY VideoID;

Gaps and Islands

Gaps and Islands is a well-known problem in SQL. It is about discovering holes in sequences or about finding ranges of consecutive values. One possible manifestation of gaps is finding out which days a student didn’t attend his classes. The islands problem is the other way around. An example could be identifying the periods of time where there was no work related accident at the factory.

Gaps

For the sake of simplicity of the example, assume for a moment that students have to show up on weekends and on holidays. The following query creates some sample data for a particular student. It contains the days where the student has come to class. Now we want to find all date ranges for which the student was not present.

CREATE TABLE #Attendance
	(ID			INT IDENTITY(1,1) NOT NULL
	,Student	VARCHAR(50) NOT NULL
	,DayPresent	DATE NOT NULL);

INSERT INTO #Attendance
VALUES	 ('StudentA','2015-01-01')
		,('StudentA','2015-01-02')
		,('StudentA','2015-01-03')
		,('StudentA','2015-01-04')
		,('StudentA','2015-01-05')
		,('StudentA','2015-01-07')
		,('StudentA','2015-01-08')
		,('StudentA','2015-01-09')
		,('StudentA','2015-01-15')
		,('StudentA','2015-01-16')
		,('StudentA','2015-01-17')
		,('StudentA','2015-01-19')
		,('StudentA','2015-01-20');

Using LEAD, we can associate a given day from the table with the day that follows in the sequence.

SELECT [current] = DayPresent, [next] = LEAD(DayPresent) OVER (ORDER BY DayPresent)
FROM #Attendance

We can now identify the end of an interval when the difference between current and next is bigger than 1 day. For example at row 5, the difference is bigger than 2 and it represents the end of the first interval (1st of January till the 5th of January). The gap itself is identified by adding one day to current and subtracting one day from next. This gives us the following query:

SELECT
	 gapStart	= DATEADD(DAY,1,[current])
	,gapEnd		= DATEADD(DAY,-1,[next])
FROM
(
	SELECT [current] = DayPresent, [next] = LEAD(DayPresent) OVER (PARTITION BY Student ORDER BY DayPresent)
	FROM #Attendance
) tmp
WHERE DATEDIFF(DAY,[current],[next]) > 1;

This solution is a great example of how windowing functions can really simplify T-SQL code.

Islands

Now we want to identify the ranges for which the student did show up for the classes. This can be done by using the DENSE_RANK function. If we subtract the number of days indicated by the rank from the date of the current row, we can see there is a “jump” each time a gap breaks the sequence.

SELECT DayPresent, groupID = DATEADD(DAY,-1*DENSE_RANK() OVER (PARTITION BY Student ORDER BY DayPresent),DayPresent)
FROM #Attendance

By doing this, we have created some sort of group identifier for an island. Now we just need to find the minimum and maximum date of each group to find the boundaries of the island interval.

SELECT
	 islandStart	= MIN(DayPresent)
	,islandEnd		= MAX(DayPresent)
FROM
(
SELECT DayPresent, groupID = DATEADD(DAY,-1*DENSE_RANK() OVER (PARTITION BY Student ORDER BY DayPresent),DayPresent)
FROM #Attendance
) tmp
GROUP BY groupID;

Resources

If you are intrigued by the power and simplicity of the T-SQL windowing functions, make sure to check out Itzik Ben-Gan’s book Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions. It’s one of the best T-SQL books I have ever read and it has taught me a great deal about windowing functions. The book was an inspiration for this 3-part article series, but the book covers a lot more ground. There is more explanation on optimizing the queries, there are more use cases (packed intervals, calculating the mode and so on) and in overall there is certainly more depth. I can’t recommend it enough.

Conclusion

In this final part of the article series about the T-SQL window functions, we saw a couple of real-life use cases and how you could solve this problems easily using window functions. Only thing to do is to try it out yourself!