A repost of http://hashrocket.com/blog/posts/sql-window-functions by Joshua Davey.
SQL Window Functions and You
Suppose you have a storefront application that sells pictures of cats. These cat pictures are categorized in meaningful ways. For example, there are LOLcats pictures and “Classic” cat pictures. Now, on the landing page of the store, you’d like to feature one picture from each category. It can’t be a random picture from each. You need to feature the cheapest picture from each category, displaying its name and price.
Also, it turns out that some “low” prices are very common. For example, $9.99 is a common sale price for LOLcats pictures. However, we should only ever feature one picture per category. When there are multiple pictures with the same low price, we fallback to the name, and show the first one alphabetically. How can we solve this problem, while also remaining performant?
As an aside, adding a cat to a Rennaisance painting amplifies its appeal ninefold.
Let’s look at some of the ways that we can approach this problem, displaying a list of cat pictures that are the cheapest for their respective category.
Approach 1: Ruby
Implementing the solution in Ruby is fairly straightforward.
ActiveSupportEnumerable provides the
sort_bymethods on collections, and we can use those to help us cut down on some typing.
First, we group all of the cat pictures by their category. Then, for each set of pictures, we sort them by their price and name, and take only the first one.
Perhaps you are wondering if inverting the responsibility would improve the implementation, putting the mapping and reduction impetus in the Category model instead. Although it would be possible to go through the Category model to find its cheapest picture, that would lead to an “n+1”, as each category would subsequently need fetch its cat pictures. Alternatively, eager-loading all categories with their cat pictures would be expensive, and would essentially duplicate what we’ve done above with the
Either way, as you can probably imagine, the above method would become more expensive as the data set continued to grow. Additionally, we lose the ability to continue to chain ActiveRecord scopes to filter the set further: as soon as we fetch the collection from the database, all filtering has to be done in Ruby.
- Easy to grok
- All domain logic stays in application
- Expensive (all objects loaded into memory)
- No scope chaining
- Once you go Ruby, you don’t go back
Approach 2: SQL subselects
We can improve performance by doing the filtering at the database level, rather than loading all cat pictures into memory each time.
Here, we use a subselect to filter the initial set down to only those that have the cheapest price per category. In this inner query, each row will contain a
category_idand its lowest
price. In the outer query, we choose all cat pictures whose
category_idmatch a row from this inner query, using the
We would be done here, except that there still exists the possibility that there could be more than one that have that low price for a given category. So, depending on the database vendor, we can here find “distinct” rows, according the columns of interest. In Postgresql, the syntax for this is
DISTINCT ON([column,...]), which will omit duplicates of the listed columns. For our purposes, we don’t want more than one per category, so we distinct on
It is worth noting that without an
DISTINCT ONis nondeterministic: we are not guaranteed to get the same result each time. Thus, we order by
name, so that only the first cat picture alphabetically will show up.
We can improve the implementation above by making it a true chainable scope. Whereas
find_by_sqlreturns an array of objects, we can refactor this to return an ActiveRelation instead.
Functionally, this generates the exact same query as before, but allows further chaining. Using ActiveRelation’s
to_sqlmethod, we’re able to build up our inner query without actually executing it. We then interpolate that query into what was the outer query, which we’ve reduced to calls to
- More performant than Ruby method
- Scope chaining still possible
- Nested subselects
- Very difficult to read in application code
- The use of
DISTINCT ON- only some RDBMS’ have such functionality
Approach 3: Window functions
But there is still another option. The SQL standard defines a concept called window functions, which act a lot like aggregates, but don’t change the result set. From the Postgresql documentation’s excellent introduction to window functions:
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row - the rows retain their separate identities.
Let’s see how this would work with our dataset. First of all, let’s assume the following cat pictures:
Given this data, our goal is to select “Hugs not Drugs” and “Turkleton’s Folly”, which are the cheapest pictures from their categories.
Whereas a normal aggregate function with
GROUP BYwould collapse the results, a window function retains the original row. Let’s consider how this would affect the inner query from the subselect approach above:
Above, we’ve replaced the
GROUP BYclause with an
OVERclause. We have the original rows with an additional column for this aggregate data. This is useful in its own right, but the real power of window functions comes from this concept of window framing. The use of
PARTITION BYcreates a frame for each group. In our case, we have two frames, one for each
category_id. Then, all aggregate and window functions before the
OVERclause operate against this frame. Each window frame effectively has its own result set, according to the defined partition.
When a window frame is ordered, using an
ORDER BYclause, even more options are possible. For example, consider the following:
Look familiar? This is essentially the original , except we’ve added a new column: its price rank within a window partitioned by
category_id. It’s a mouthful to describe, but we’re very close to our original goal of finding the cheapest cat picture per category. All we need to do now is select rows that have a rank of 1.
Not so fast. Can you spot the issue with the above? The
rank()window function assigns the same rank to ties, but we need the first one alphabetically in the case of “ties”. We can remedy that by using a different window function,
row_number(), which guarantees different numbers.
Perfect! Looking at the rows with “1” as their “row_number”, we see what we expect, “Hugs not Drugs” and “Turkleton’s Folly”, which are the cheapest pictures from their categories. We can use an
INclause for filtering, similar to the previous approach:
The where clause above filters records that both have an id that appears in the subquery next to a rank of 1. Now that we have the SQL down, let’s convert our Ruby model to take advantage of this window function technique:
Groovy. Just like before, we can use to the power of ActiveRelation to build up our subselect, which then gets interpolated into the
whereclause. I’ve also prepended
table_name, to avoid potential ambiguous column problems.
There is one potential issue with using window functions: limited vendor support. While most of the big boys implement window functions (Oracle, Postgresql, and SQLServer, to name a few), MySQL and SQLite users are out of luck.
- Very performant (consistently twice as fast as Approach 2 on my laptop)
- Much less noise than SQL subselect stuff
- Easy to understand, assuming a basic knowledge of SQL window functions
- Not portable (window functions are not available in MySQL or SQLite)
While they may not be appropriate for every situation, window functions are a great tool for your toolbelt. They excel at filtering down rows based on aggregate data, or adding aggregate data to the rows you’d already like to select.
While writing this post, I created a sample Rails app to iterate quickly. I used TDD to write the pure-ruby approach, and reused the specs while I “refactored” the implementation to the subsequent approaches. Of particular note is the history of the CatPicture model, which mirrors the code above.
Who We Are
Hashrocket is a Ruby on Rails design & development shop based in Jacksonville Beach, FL and Chicago.
We practice pair programming, test-driven development, user-centric design, and Agile.
The Hashrocket Blog is a collection of things we’ve learned, places we’re going, and general goings-on in our world.