Groupby-by From Scratch "Part 2"

The group-by (or split-apply-combine) pattern, illustrated below, is ubiquitous in data analysis. Essentially, you have data with predefined groups and want to compute some summarizing statistics for each of those groups. For example, suppose you have a set of High School students and want to estimate the average height by grade. You would need to split the dataset by grade, apply the mean operation to each group, then combine the means of all the groups together. The figure below is from the Python Data Science Handbook.

In Python, the Pandas DataFrame library provides a fast, general implementation of this algorithm. Jake VanderPlas wrote an excellent blog post looking at implementing this algorithm from scratch using NumPy (and SciPy). He benchmarks the various approaches to help build intuition about writing fast NumPy code. This post extends that one (which is why this post is labelled “Part 2”), so please read Jake’s first then come back here.

Jake looks at the general case where you want to compute a sum within each group, then return the results as a Python dictionary. The keys for the groups can be any hashable type (strings, integers, etc.). In this post, I want to relax a few of those requirements to optimize the implementations and see how the results differ. Those relaxations are:

  1. The return value does not have to be a dictionary. It can be any type you can index into using a key to retrieve the result (e.g. a list or array).
  2. The inputs are always NumPy arrays.
  3. The keys are integers from 0 to $k$.
  4. The maximum key, $k$, is known in advance.

(1) seems reasonable, and should save us some time since we can avoid converting the container types. (2) is realistic, since you almost always want to manipulate numeric data in NumPy arrays instead of Python lists. (3) is fine since we can always map our arbitrary labels to integers in advance. (4) seems restrictive, but in practice, you often know how many groups you have in your data. For the High School students example, you know there are 4 grades (at least in Canada) apriori. If you are using a group-by to perform unsupervised clustering, many algorithms require you to decide the number of clusters in advance.

We’ll see that these relaxations allow us to simplify the implementations and improve performance.

The rest of this post is structured as follows. First, we update the Jake’s implementations according to our new constraints. The pandas/Python implementations see very little change, while the NumPy based solutions see large improvements. We do some big-O time complexity analysis to understand where we gain efficiency. Next, we show some new implementations that these relaxations afford us and go into more detail about these (as they’re not in Jake’s post). Finally, we benchmark these implementations to see if the conclusions from our analysis hold in practice.

For a quick reference, here are quicklinks to each updated implementation:

  1. Pandas
  2. Dictionary
  3. Masking
  4. np.bincount
  5. Sparse
  6. np.arange
  7. np.eye
  8. scipy.ndimage.sum
  9. np.add.at

All the code is available in this Colab. It’s is similar to Jake’s, except we use time.perf_counter() to time, seaborn to plot, and include 95% confidence bands in the line-plots. Notably, the data distributions are the same.

Updated Implementations

We use the following arrays as our example data:

import numpy as np

keys = np.asarray([0, 1, 2, 0, 1, 2])
vals = np.asarray([1, 2, 3, 4, 5, 6])
max_key = keys.max()

This is the same as Jake’s running example, except our keys are integers and we pre-compute the max key.

Non-NumPy Approaches

Our non-NumPy approaches don’t benefit much from our updated rules. First, the most idiomatic and common solution in Python: Pandas.

import pandas as pd

def pandas_groupby(keys, vals, max_key=None):
    return pd.Series(vals).groupby(keys).sum()

pandas_groupby(keys, vals, max_key)
0    5
1    7
2    9
dtype: int64

The code is nearly identical, except we remove the .to_dict(), so the result is a Pandas Series.

Since we know that the keys are from 0 to max_key in advance, we don’t have to use a defaultdict for the dictionary-based approach, and can instead prepopulate with 0s (shown below). In the end, we’ll see this doesn’t make a big difference in performance.

def dict_groupby(keys, vals, max_key=None):
    count = {k: 0 for k in range(max_key+1)}
    for key, val in zip(keys, vals):
        count[key] += val
    return count

dict_groupby(keys, vals, max_key)
{0: 5, 1: 7, 2: 9}

The itertools implementaton does not change, so I excluded the code (it’s in Jake’s post).

NumPy-based Approaches

The additional information allows us to improve the asymptotic behaviour of these NumPy-based approaches. That is, the behaviour as the dataset size ($n$) and number of groups ($k$) grows large. This post assumes you’re familiar with this concept, here is a great explanation if you’re not.

In the original problem, all these implementations require np.unique in order to determine the keys. np.unique sorts the data, which is an $O(n \log n)$ operation, where $n$ is the length of the dataset. Knowing our keys are 0 to $k$ where $k$ is known in advance allows us avoid this sort to save time.

def masking_groupby(keys, vals, max_key):
    return [vals[keys == key].sum() for key in range(max_key+1)]
    
masking_groupby(keys, vals, max_key)
[5, 7, 9]

Every boolean mask computation takes $O(n)$ time, which we do for each key, resulting in $O(nk)$ time. This is more efficient than the original implementation, which requires $O(n \log n + nk)$ time due to the sort.

The other small improvement is constructing a list instead of a dict due to our keys being integer from 0 to max_key.

The np.bincount implementation becomes very sleek:

def bincount_groupby(keys, vals, max_key=None):
    return np.bincount(keys, weights=vals, minlength=max_key+1)

bincount_groupby(keys, vals, max_key)
array([5., 7., 9.])

Since we have the max_key, we can tell bincount how much space to pre-allocate for the array (via minlength), which saves time on potential array resizing. This approach only takes $O(n)$ time, since under the hood, the function just iterates through the array once and adds each value to the appropriate bucket.

For the sparse implementation, we again we save time by skipping np.unique (and conversion to dict).

from scipy import sparse

def sparse_groupby(keys, vals, max_key=None):
    col = np.arange(len(keys))
    mat = sparse.coo_matrix((vals, (keys, col)))
    return mat.sum(1)

sparse_groupby(keys, vals, max_key)
matrix([[5],
        [7],
        [9]])

Since the sparse representation essentially stores a key-value lookup, the time complexity here is once again $O(n)$. If this was a dense matrix instead of sparse (i.e. if we summed up the 0s that this sparse matrix hides), it would be more expensive (which we’ll see below).

Our output seems a bit strange now, but this still gives us the desired property of being able to index into it to get our sum:

x = sparse_groupby(keys, vals, max_key)
x[0]
matrix([[5]])

But if you want an array, you can convert between NumPy matrices and arrays without copying memory:

np.asarray(x).ravel()
array([5, 7, 9])

New Implementations

The convenient properties of our keys afford us some alternative implementations. We’ll cover these in more detail since they were not in Jake’s post. First:

def arange_groupby(keys, vals, max_key):
    one_hot = keys == np.arange(max_key+1)[:,None]
    return np.dot(one_hot, vals)

arange_groupby(keys, vals, max_key)
array([5, 7, 9])

Let’s break this one down. Conceptually, this is almost identical to the sparse matrix approach. Instead of having integer keys, we convert to a “one hot” representation (which you may be familiar with from other machine learning tasks). Essentially, each label becomes a vector with max_key dimensions. All the values of that vector are 0 except at the index of the label. For example, if max_key = 2 and our key = 0, our one hot vector would be [1, 0, 0] (1 only in the 0th position).

Note that np.arange(max_key+1) = [0, 1, 2]. If key = 0, then (key == [0, 1, 2]) = [True, False, False], which can be interpreted as [1, 0, 0], which is our one hot representation. We want to obtain the one hot representation for every key in our collection, which we achieve using a broadcasting trick. The documentation linked provides an excellent explanation which I highly encourage you read. We use this trick to get:

one_hot = keys == np.arange(max_key+1)[:,None]
one_hot.astype(int)
array([[1, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 1]])

Here, each column in our matrix represents the one hot representation of each key. Now, when we use np.dot, we are computing the inner product between each row of this matrix and the entire values vector. This gives us the sum within each group (I’ll leave it as an exercise to you to verify that). It’s essentially the same as the summation in the sparse COO matrix case.

In terms of time complexity, the one_hot computation compares every value in our data to every potential key, resulting in $O(nk)$ time. The matrix multiplication is $O(n)$ for each row (since you’re multiplying $n$ values and summing), so it’s a total of $O(nk)$ for the entire dot product. This result in a total running time of $O(nk+nk) = O(nk)$.

Our next implementation leverages a one hot representation as well:

def eye_groupby(keys, vals, max_key):
    one_hot = np.eye(max_key+1)[keys]
    return np.dot(one_hot.T, vals)

eye_groupby(keys, vals, max_key)

Here, we compute the identity matrix using np.eye. We know row $i$ of this matrix is zero everywhere but the $i$th column, which is precisely our one hot representation. We use advanced indexing to extract the row corresponding to each label. Concretely:

np.eye(max_key+1)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Recall our keys are [0, 1, 2, 0, 1, 2]. With advanced indexing:

np.eye(max_key+1)[keys]
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

This matrix is the transpose of what we had with arange_groupby, which is why we transpose it before applying the dot product.

The time complexity is the same as the arange case. The advanced indexing hides some of the operations, but you still have to materialize $nk$ values, then do the same multiplication.

Our next implementation leverages unbuffered operations. Before showing it, consider the following:

def incorrect_at_groupby(keys, vals, max_key):
    s = np.zeros(max_key+1)
    s[keys] += vals
    return s

incorrect_at_groupby(keys, vals, max_key)
array([4., 5., 6.])

We used advanced indexing to pull out the right position in s depending on the key, then added the corresponding item in vals to it. The problem is, this is a buffered operation, which means the array is not updated sequentially, so only the last update per index is observed. Since the last occurence of 0, 1, and 2 in keys correspond to 4, 5, and 6 in vals, those are the values we see. We can leverage np.add.at to do this same operation in an unbuffered manner, which means each occurence will accumulate:

def at_groupby(keys, vals, max_key):
    s = np.zeros(max_key+1)
    np.add.at(s, keys, vals)
    return s

at_groupby(keys, vals, max_key)
array([5., 7., 9.])

Finally, we can use another built-in function for performing a group-by, this time in SciPy:

from scipy import ndimage

def ndimage_groupby(keys, vals, max_key):
    return ndimage.sum(vals, labels=keys, 
                       index=np.arange(max_key+1))

ndimage_groupby(keys, vals, max_key)
array([5., 7., 9.])

This method is implemented using np.bincount under the hood, and thus will be strictly slower (due to additional work to account for multidimensional inputs). Thus, it will not be included in our benchmarks below.

Timings

The timings were measured in Python 3.8.5 on an Intel i5-8265U CPU @ 1.60GHz with 8gb of RAM on Ubuntu 20.04. For a baseline, below are timings of the implementations provided in Jake’s blog post:

The colored bands around the lines are 95% confidence intervals (for most of the implementations it’s invisible). Notice both axes are log. As shown in Jake’s post, the Pandas implementation really starts to shine as we increase the dataset size. At smaller sizes, the NumPy-based implementations all do well, with the pure-NumPy (i.e. no Python dicts or lists) pulling ahead. Masking slows down significantly with larger group sizes: at every iteration of the masking loop (which is 1 per unique key), we have to evaluate a boolean mask for the entire dataset, which gets expensive.

Below are timings for the updated implementations:

For the updated implementations, we observe some similar trends: the Pandas implementation goes from slowest to among the fastest as our data size increases, while the other increase with generally the same slope. When we increase the number of groups, the masking and one-hot implementations suffer, since their time complexity depends on the number of groups $k$.

The pure-numpy implementations benefit the most from our relaxations. All of them used np.unique, and removing that call improved speed across the board (as predicted by our big-O analysis). In particular, the np.bincount-based implementation is now the fastest at all dataset and group sizes.

Below, we show the relative speed-up:

As we saw in the other two plots, the np.unique-based solutions saw the biggest benefit (bincount, sparse, and masking), while the others saw negligible improvements. This is expected: as the data size grows large, container and key conversions are just small overheads compared to the large computation.


Code

All the code can be found in the Colab here. The main benchmarking functions are shown below:

def time_groupby(func, n_group, size, rseed=754389, n_iter=500):
    times = np.empty(n_iter)
    rand = np.random.RandomState(rseed)
    for n in range(n_iter):
        keys = rand.randint(0, n_group, size)
        vals = rand.rand(size)

        start = time.perf_counter()
        _ = func(keys, vals, n_group)
        end = time.perf_counter()
        times[n] = end - start

    return times


def bench(funcs, n_groups, sizes, n_iter=500):
    """Run a set of benchmarks and return as a dataframe"""    
    n_groups, sizes = np.broadcast_arrays(n_groups, sizes)
    names = [func.__name__.split('_')[0] for func in funcs]
    
    dfs = []
    for func in funcs:
        for n_group, size in zip(n_groups, sizes):
            timings = time_groupby(func, n_group, size, n_iter=n_iter)
            dfs.append(pd.DataFrame({
                'Times (s)': timings,
                'Name': [func.__name__.split('_')[0]]*n_iter,
                'Num Groups': [n_group]*n_iter,
                'Size': [size]*n_iter,
            }))
            
    return pd.concat(dfs)

n_groups = (10 ** np.linspace(0, 4, 10)).astype(int)
sizes = (10 ** np.linspace(2, 6, 10)).astype(int)

funcs = [pandas_groupby, dict_groupby, itertools_groupby,
         masking_groupby, bincount_groupby, sparse_groupby,
         list_groupby, arange_groupby, eye_groupby, ndimage_groupby]
timings_sizes = bench(funcs, 10, sizes, n_iter=100)
timings_groups = bench(funcs, n_groups, 10000, n_iter=100)

Updates

2021-02-21: Removed ndimage from timings, added at_groupby.


Related posts