Dias: Automatically Rewriting Pandas for 1000x Speedups

Mar 28, 2023

Co-written with Daniel Kang and Charith Mendis.

An unsatisfied need in EDA workloads

Pandas is the main driving force behind the ever more popular Exploratory Data Analysis (EDA) workloads [1], [2], [3], [4], but there is still no good optimization techniques for Pandas for this setting. EDA workloads are special because they are ad-hoc and unrefined. The code written is diverse, and may include anything from pure Python to arbitrary combinations of multiple libraries. Furthermore, EDA is also supposed to be quick! The code should be fast to write and fast to run. EDA is not the time to learn new APIs and it is not the time to carefully optimize your code.

These characteristics make current solutions fall short in optimizing Pandas for the EDA setting. They require either learning new APIs, or they incur significant overheads even when running on powerful hardware [5]. We introduce Dias, a novel, extremely lightweight, and yet effective, optimizer of Pandas code, catered to EDA workloads.

Introducing Dias

Docs & Code

Try Dias Online

Dias is a dynamic rewriter of Pandas code. Dias is not a replacement for Pandas, but rather a process that runs in the background of your Jupyter notebook, looking for code patterns it can rewrite into faster equivalents. Upon recognizing one, Dias rewrites the code automatically and correctly.

Dias has several advantages. First, it can offer substantial performance improvements, possibly reaching 100x or 1000x speedups. Second, Dias is extremely lightweight as it incurs virtually no runtime or memory overheads. Thus, Dias generally won't make your code slower than vanilla Pandas. Third, Dias inherently does not suffer from a lack of API support, because it is not a replacement for Pandas. If Dias does not understand a piece of code, it will leave it untouched. Finally, you don't have to know anything about Dias to understand the cause of an optimization. The code Dias outputs is still standard Python/Pandas code. You can simply ask Dias to show you the rewritten version and even copy-and-paste it to a new cell and experiment.

For a more detailed analysis of Dias, please take a look at our paper.

"I'm sold, how do I start using it?"

Just pip install dias and add the %%rewrite magic function in all of your cells. That's it! Let's see an example. Suppose we have the following notebook which populates a DataFrame with random data and then calls a function 1, with apply(), on every row.

In [1]

import pandas as pd
import dias.rewriter
import numpy as np
rand_arr = np.random.rand(2_500_000,20)
df = pd.DataFrame(rand_arr)

In [2]

%%time
def weighted_rating(x, m=50, C=5.6):
    v = x[0]
    R = x[9]
    return (v/(v+m) * R) + (m/(m+v) * C)
_ = df.apply(weighted_rating, axis=1)

CPU times: user 9.97 s, sys: 90.9 ms, total: 10.1 s
Wall time: 10.1 s

The operation is quite slow, taking about 10s. Now, we will leave the cell untouched and just add the %%rewrite magic function.

In [3]

%%time
%%rewrite
def weighted_rating(x, m=50, C=5.6):
    v = x[0]
    R = x[9]
    return (v/(v+m) * R) + (m/(m+v) * C)
_ = df.apply(weighted_rating, axis=1)

CPU times: user 59.6 ms, sys: 7.57 ms, total: 67.2 ms
Wall time: 66 ms

We can see that just like that we got a 153x speedup. And this experiment was done on a laptop. We can also ask Dias to see what it did with %%rewrite verbose.

In [4]

%%rewrite verbose
def weighted_rating(x, m=50, C=5.6):
    v = x[0]
    R = x[9]
    return (v/(v+m) * R) + (m/(m+v) * C)
_ = df.apply(weighted_rating, axis=1)

Dias rewrote code:

def weighted_rating(x, m=50, C=5.6):
    v = x[0]
    R = x[9]
    return v / (v + m) * R + m / (m + v) * C
_ = weighted_rating(df)

Dias recognized that instead of calling the function individually for every row (which is what apply() does), it can simply apply the function directly to the whole DataFrame.

This is basically all you need to know to use Dias. However, we do recommend that you take a look at our documentation and examples.

"Why can't the current Pandas replacements help me with EDA?"

Given that EDA's popularity is rapidly growing, it should be of no surprise that industrial and academic communities have devoted considerable effort into optimizing Pandas, usually by shipping Pandas replacements. Examples include Modin, Dask, and Koalas, which e.g., can scale the workload out to multiple servers.

Unfortunately, these libraries can incur significant overheads. As we detail in the Dias paper, these can include significant runtime and memory overheads 2, but also an overhead in the human effort, as the user has to learn new APIs. In our opinion, these overheads are acceptable when the data preparation pipeline is fixed, because the amortized performance gain, especially when moving to huge datasets, rationalizes the trade-off.

However, at the time of EDA, this trade-off is probably not worth it, because learning new APIs, requiring resources that are quite more demanding than consumer machines (or the limited resources of Kaggle and Google Colab), and incurring significant runtime overheads, can impair the quick-and-dirty nature of EDA. This situation is what led us to a new research direction, looking for an alternative, lightweight optimization technique.

We should clarify that Dias is not a replacement for these frameworks. For example, Dias does not scale up as the number of cores, or memory, are increased, and Dias won't be able load any dataset that Pandas cannot load. Dias and these other techniques/frameworks, are simply intended for different settings (and they are also conceptually orthogonal).

Let us know what you think and help us by contributing!

Dias is an ongoing research project by the ADAPT group @ UIUC. You can help by sending us notebooks that you want to speed up and we will our best to make Dias do it automatically! Moreover, if you are aware of a pattern that can be rewritten to a faster version, please consider submitting a relevant issue.

We also welcome feedback from all backgrounds, including industry specialists, data analysts and academics. Please reach out to sb54@illinois.edu to share your opinion!

Footnotes

Note that this function is not made up. It comes from a real-world notebook. We use random data to make the code smaller, but if we use real data (e.g., look at this example), the speedups are larger.↩
Figures 11 and 12 in the paper show the interesting case of the real-world notebook named adidas-retail-eda. Dias and Pandas consume less than 2GB of memory for this notebook, while Modin skyrockets its memory consumption to almost 90GB. At the same time, Modin is at least 4x slower than both Pandas and Dias.↩