Accidentally Quadratic with DataFrames in Julia

A post recently done the rounds where it looks like GTA had a bad implementation of an algorithm that scaled in a quadratic fashion (How I cut GTA Online loading times by 70%), which echoed a Bruce Dawson quote article about how it is common for quadratically efficient processes to end up in production. Quadratic algorithms are fast enough when testing but once in production all of a sudden the performance issues catch up with you and your sat with a very inefficient process. Well that happened to me.

Enjoy these types of posts? Then you should sign up for my newsletter.

Every month I recalibrate a model using the latest data pulled from a database. I take this raw data and generate some features, fit a model and save down the results. One of those operations is to match all the id’s with the old data and new data to work out which trades need new features needed to be generated.

Basically, imagine I have a dataframe, and I want to find all the rows that match some values. In this mock example, column B contains the IDs and I’ve some new IDs that I want to filter the dataframe for.

using DataFrames, DataFramesMeta
using BenchmarkTools
using Plots

I’ll create a large mock dataframe as an example.

N = 1000000
df = DataFrame(A = rand(N), B = 1:N);

My slow implementation use the DataFramesMeta package and used the broadcasted in function to check whether each value was in the new ids. This worked without a hitch last month, but then all of a sudden seemed to be incredibly slow. This was strange as I hadn’t changed anything, did the usual reboot of the machine and start afresh but it was still painfully slow.

function slow(df, ids)
  @where(df, in(ids).(:B))
end

After a quick profiling, I found that it was the above function that was the bottleneck. So I refactored it to remove the DataFramesMeta dependancy and just used the base functions.

function quick(df, ids)
  df[findall(in(ids), df.B), :]
end

Thankfully this solved the issue, was much quicker and allowed my process to complete without a hitch. This got me thinking, how slow was my originally implementation and how much different is the new version. So onto the benchmarking.

Using the BenchmarkTools.jl package I can run multiple iterations of each function across larger and larger IDs samples.

nSamps = [1, 10, 100, 1000, 10000, 100000, 1000000]
resQuick = zeros(length(nSamps))
resSlow = zeros(length(nSamps))

for (i, n) in enumerate(nSamps)
  ids = collect(1:n) 
    
  qb = @benchmark quick($df, $ids)
  sb = @benchmark slow($df, $ids)
    
  resQuick[i] = median(qb).time
  resSlow[i] = median(sb).time
end

I’ve made sure that I compiled the original function before starting this benchmarking too.

plot(log.(nSamps), log.(resQuick), label="Quick", legend=:topleft, xlabel="log(Number of IDs selected)", ylab="log(Time)")
plot!(log.(nSamps), log.(resSlow), label="Slow")

svg

The difference in performance in remarkable. The quick function is pretty much flat and just a slight increase towards the large sizes in this log-log plot, whereas the slow version is always increasing. When we model the slow implementation performance as a power law we find that it is not quite quadratic, but more importantly, we can see that the faster method is pretty much constant, so a much scalable solution.

using GLM
lm(@formula(LogTime ~ LogSamps),
     DataFrame(LogSamps = log.(nSamps), LogTime=log.(resSlow)))

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

LogTime ~ 1 + LogSamps

Coefficients:
─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)  15.1134     0.275726   54.81    <1e-07  14.4046    15.8221
LogSamps      0.885168   0.0332117  26.65    <1e-05   0.799794   0.970541
─────────────────────────────────────────────────────────────────────────

When I first come across this issue I was ready to book out my week to rewriting the data functions to iron out any of the slow downs, so I was pretty happy that rewriting that one function made everything manageable.