Enjoy these types of posts? Then you should sign up for my newsletter.

I first encountered the Almgren Chriss model in my initial PhD year through a Microstructure and Machine Learning course. It was for 2 hours at 18:00 on a Friday night and on the other side of London from where I lived, so a bit of a pain for me to attend. This post in essence is inspired by these notes as I’ve always wanted to summarise them into a digital version. So this is a maths-heavy post that will act as a springboard for some more future content.

We have \(X\) amount of something to trade over some time\(0\) to \(T\) such that \(X_T = 0\). How should we slice and dice our trades to minimise the execution cost?

We need a model of

- How the price moves
- How our trading affects prices

then we can build a trading cost function that we then optimise in different ways.

The price evolves like \(S_t = \bar{S} _t + \eta v_t + \theta (X_0 - X_t),\)

- \(\bar{S} _t\) is the unperturbed stock price
- \(\eta \cdot v_t\) is the temporary market impact that scales with the trading speed \(v_t\)
- \(\theta \cdot (X_0 - X_T)\) is the permanent market impact

The unperturbed price is a simple Gaussian random walk with no drift: \(\mathrm{d} \bar{S} _t = \sigma S_0 \mathrm{d} W_t\)

The trading rate \(v_t = - \frac{\mathrm{d} X_t}{\mathrm{d}t} = - \dot{X} _t\) so simply the speed at which we are executing the trades.

So the fundamental price (\(\bar{S}\)) evolves as a random walk but our actions of trading means that the observed price is higher by an amount proportional to our trading speed. The signs of the components are set up such that we are buying - so the faster we trade the more we distort the price from the true price by pushing it higher

The final cost of the execution is the sum of the amount we traded multiplied by the price of all the trades. In continuous time this is simply the integral of this observed stock price multiplied by the trading speed over the execution window:

\[C_{0, T} = \int _0 ^T S_t v_t \mathrm{d} t,\]which after inserting the equation for the asset price gives us three different components

\[C_{0_,T} = \underbrace {\int _0 ^T \bar{S_t} v_t \mathrm{d} t}_\text{(1)} + \underbrace{\int_0 ^T \eta v_t ^2 \mathrm{d} t}_\text{(2)} + \underbrace{\int _0 ^T \theta (X_0 - X_t) v_t \mathrm{d}t}_\text{(3)}\]Term \((1)\) we use integration by parts:

\[\begin{align*} \int _0 ^T \bar{S_t} v_t \mathrm{d} t & =- \int _0 ^T \bar{S_t} \mathrm{d}X_t \\ & = - \left[\bar{S_t} X_t \right]_0^T + \int _0 ^T X_t \mathrm{d} \bar{S_t} \\ & = -(\bar{S}_TX_T - \bar{S}_0X_0) + \int _0 ^T X_t \sigma S_0 \mathrm{d} W_t \\ & = \bar{S_0} X_0 + \int _0 ^T X_t \sigma S_0 \mathrm{d} W_t \end{align*}\]\(\int _0 ^T \bar{S} _t v_t \mathrm{d}t = - \int _0 ^T \bar{S} _t \mathrm{d} x_t\) which with integration by parts and substituting in the GBM part

\[X_0 S_0 + \int _0 ^T x_t \sigma S_0 \mathrm{d} W_t\]For term (3)

\[\theta \int _o ^T (X_0 - X_t) v_t \mathrm{d} t= -\theta \int _0 ^T (X_0 - X_t) \mathrm{d} X_t\] \[= \frac{\theta ^2}{2}\]which gives us a formula for \(C_{0, T}\)

\[C_{0, T} = X_0 S_0 + \int _0 ^T X_t \sigma S_0 \mathrm{d} W_t + \eta \int _0 ^T v_t ^2 \mathrm{d}t + \frac{\theta ^2}{2}.\]This is our expected cost function and we want to find the \(v_t\) that minimises the final cost.

If we take expectations (we want to minimise the *average* execution
path - each path will be different as it is a stochastic problem) we
end up with just one term we can influence the expected cost:

So we minimise the expected cost by finding the trading speed that minimises this term

\[\min _{v_t} \eta \int _0 ^T v^2_t \mathrm{d} t.\]To solve this we apply the Euler-Lagrange equation to minimise the action. The action is the term inside the integral.

\[\frac{\partial f}{\partial X} = \frac{\mathrm{d}}{\mathrm{d}t} \frac{\partial f}{\partial v}\]And from the above

\[\begin{align*} f & = v^2_t \\ \frac{\partial f}{\partial X} & = 0 \\ \frac{\partial f}{\partial v} & = 2 v_t, \end{align*}\]so

\[\frac{\mathrm{d}}{\mathrm{d} t} v_t = 0,\]which means the speed of the execution must be constant \(v_t = B\).

\[X_t = A + B t.\]We have the boundary conditions

\[X_0 = A,\] \[X_T = X_0 + BT = 0,\] \[B = \frac{-X_0}{T},\] \[X_t = X_0 - \frac{X_0}{T} t.\]Putting this trading schedule back into the expected cost formula gives us an overall result

\[\int _0 ^T v_t^2\mathrm{d} t = \frac{X^2_0}{T^2} (T - 0) = \frac{X_0^2}{T}.\]When we plot this schedule we can see that the speed is constant and we are simply running a TWAP (time-weighted average price).

The maths is telling us:

- To minimise cost for an amount \(X_0\) then you should run your TWAP for an infinite amount of time.

This neglects the price risk, so sure, run a very long TWAP but don’t complain when the market trends against you!

How can we account for this price risk?

We now need to minimise both the expected cost and the *variance* of
the expected cost with our trading schedule. This means we will now be
sensitive to cases where the price moves far away from the starting
value.

We introduce a new parameter, \(\lambda\), that controls our risk aversion. So now we are worried about the price potentially running away from us if we take too long to finish the trade

\[\min _ {v_t} \left( \mathbb{E} [C] + \lambda \text{Var} [C] \right ),\]so now we want to minimise the average and the variation of the trading cost and see what schedule that produces.

When we took the expectation, only the deterministic bits remained. When we calculate the variance only the random bits remain

\[\text{Var} [C] = \mathbb{E} \left[ \sigma _0 \bar{S} _0 \int _0 ^T X_t \mathrm{d} t \right] ^2 = \sigma ^2 \bar{S}_0^2 \int _0 ^T X_t ^2 \mathrm{d} t,\]which means our minimisation problem can be written as:

\[\text{min} _{v_t} \int _0 ^T v_t ^2 \mathrm{d} t + \lambda \sigma ^2 \bar{S}_0^2 \int _0 ^T X_t ^2 \mathrm{d} t.\]Using the Euler-Lagrange equations again

\[\begin{align*} f & = A v_t^2 + B X_t^2 \\ \frac{\partial f}{\partial X} & = 2B X_t \\ \frac{\partial f}{\partial v} & = 2A v_t \\ B X_t & = A\frac{\mathrm{d} }{\mathrm{d} t} v_t \\ & = - \frac{A}{B} \frac{\mathrm{d}^2}{\mathrm{d} t^2} X_t. \end{align*}\]This is a second-order linear ordinary differential equation with solution

\[X_t = c_1 e^{\sqrt{\frac{A}{B}} t} + c_2 e ^{- \sqrt{\frac{A}{B}} t},\]Again, applying boundary conditions

\[X_0 = c_1 + c_2,\] \[X_T = 0 = c_1 e^{\sqrt{\frac{A}{B}} T} + c_2 e^{-\sqrt{\frac{A}{B}T}},\] \[X_t = X_0 \frac{\text{sinh} \sqrt{\frac{\eta}{\lambda \sigma ^2 \bar{S}_0}} T-t}{\text{sinh} \sqrt{\frac{\eta}{\lambda \sigma ^2 \bar{S}_0}} T}.\]Which is a funny expression, but underneath it is just an exponential.

We now have the additional \(\lambda\) parameter and so plot the execution schedule for different risk aversions

A higher \(\lambda\) means a higher risk tolerance so it becomes closer to the TWAP. In general, we can see that the Almgren Chriss solution is front-loaded - most of the trading is done early on in the time window.

Ok maths over, put down your pencils and breathe. We’ve gone through the full problem set-up and show how the TWAP minimises expected costs for a risk-neutral investor and how an exponential execution schedule minimises cost for a risk-sensitive investor.

Now we know the maths we can go on to do some interesting things.

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

This post was inspired by a problem on the r/quant subreddit where someone posted their interview/take-home question.

A client is considering using SGD to (proxy) hedge their exposure to a basket of other Asian currencies. Is this likely to be effective? What analysis could you produce that would help inform their decision? The client is a US Corporate. The client is exposed to medium-term changes (say monthly) in the currency. The client has equal (USD equivalent) revenues in each Asian currency. We are not considering hedging costs for this analysis (spot-only component). The data for daily close spot values against USD for each pair is provided. Which currency pairs will it work better for? Would it work for an equally weighted currency portfolio? Would another (single) currency work better? Which correlations should we consider and how reliable are these?

This is an interesting question and not too dissimilar to the occasional question I answer in my day job. So I thought I’d run through how I might answer it.

First, we need to get some data and I’ll be using Alphavantage to pull daily closing prices of the different currencies. I’ll calculate the log returns and save the data to cache it for future use. Plus AlphaVantage only lets you make 25 calls a day, so each time I mucked up I got locked out for the day - delaying the analysis. We have to start from 2014 as this is the earliest common date across all currencies.

```
function _pull_data(ccy)
println(ccy)
res = AlphaVantage.fx_daily("USD", ccy, outputsize="full", datatype="csv")
res = DataFrame(Dict(:Date=>Date.(res[1][:, 1]), :c=>Float64.(res[1][:,5]), :ccy => ccy));
res = sort(res, :Date)
res = @transform(res, :LogReturn = [0; diff(log.(:c))])
res
end
function pull_data(ccy)
if isfile("$ccy.csv")
res = CSV.read("$ccy.csv", DataFrame)
else
res = _pull_data(ccy)
CSV.write("$ccy.csv", res)
end
res
end
ccys = ["JPY", "CNH", "SGD", "THB", "HKD", "KRW", "TWD"]
res = vcat(pull_data.(ccys)...);
res = sort(res, :Date)
res = @transform(groupby(res, :ccy), :LogReturn = [0; diff(log.(:c))])
res = @subset(res, :Date .>= Date("2014-11-24"))
```

Like all good blog posts, let’s start with the plot of the cumulative returns. Only HKD stands out as something different given its peg to USD.

```
p = plot(ylabel = "Cummulative Return")
for ccy in ccys
plot!(p, res[res.ccy .== ccy, :].Date, cumsum(res[res.ccy .== ccy, :].LogReturn), label = ccy, lw = 2)
end
p
```

According to the problem, our client is long equal amounts of these Asian currencies, so it makes sense to calculate the market returns by taking the average return each day.

```
market = @combine(groupby(res, :Date), :LogReturn = mean(:LogReturn))
market[!, :ccy] .= "Market"
market[!, :c] .= NaN;
```

Which we add to the original plot.

```
p = plot!(p, market.Date, cumsum(market.LogReturn)
label = "Market", color = "black", lw = 2)
```

The client thinks that hedging with SGD alone is enough to protect against the overall market returns. We can see from the graph that this probably isn’t the case. But how do we recommend a better approach?

First, we will start with the correlation in returns between the different currencies. This will shed some light on how linked they are and is also simple to explain to the client.

```
cr = cor(Matrix(modelData[:, [:JPY, :CNH, :SGD, :THB, :HKD, :KRW, :TWD]]))
heatmap(ccys, ccys, cr .> 0.5)
```

We use a heat-map, but only highlight when two currencies have a correlation > 0.5, otherwise it’s a bit of a psychedelic nightmare.

We can see that HKD has a low correlation with most, KRW and SGD have a high correlation between each other and KRW has a high correlation with the majority of these currencies. However, we will use the covariance matrix to analyse the best hedging portfolio rather than the correlation matrix.

Principal component analysis (or PCA) is a tool that tries to find a common basis of variation in a matrix. It’s about transforming the data into uncorrelated components through linear algebra.

For this we are using the covariance matrix, so now the diagonals are the individual price series variances and the off-diagonals are the covariances between two currencies. If this were a different problem we might rescale the returns so they all had the same volatility but this would mean applying leverage, which our hypothetical customer probably wouldn’t be up for it.

We pull out the covariance matrix

```
modelData = dropmissing(unstack(res, :Date, :ccy, :LogReturn))
cm = cov(Matrix(modelData[:, [:JPY, :CNH, :SGD, :THB, :HKD, :KRW, :TWD]]))
```

The `MultivariateStats.jl`

package has the functions for doing PCA and
the appropriate functions for pulling out the right data after fitting the
PCA model.

```
pcaRes = fit(PCA, cm; maxoutdim=3)
```

Firstly the weights of all the currencies for the three principal components.

PC1 Weights | PC2 Weights | PC3 Weights | |
---|---|---|---|

JPY | 4.96845E-06 | 9.11362E-06 | -2.98467E-07 |

CNH | 2.11372E-06 | -1.1987E-06 | -4.78571E-08 |

SGD | 3.35545E-06 | -5.17405E-07 | -1.00414E-07 |

THB | 3.21579E-06 | -7.50513E-07 | 3.05907E-06 |

HKD | 4.21256E-08 | -7.74387E-08 | -1.84514E-08 |

KRW | 7.67389E-06 | -4.39207E-06 | -8.40943E-07 |

TWD | 2.42907E-06 | -2.01299E-06 | -6.01965E-07 |

- PC1 shows the weights for each currency but is unnormalised. The key thing we can see here is that HKD is magnitudes smaller than the others.
- PC2 is long JPY and short all the others
- PC3 is long THB and short all the others

Then the explained variance of the three components.

PC1 | PC2 | PC3 | |
---|---|---|---|

Eigenvalues | 1.15544e-10 | 1.08674e-10 | 1.05292e-11 |

Variance explained | 0.47267 | 0.444567 | 0.0430731 |

Cumulative variance | 0.47267 | 0.917237 | 0.96031 |

The first component can explain 49% of the variance and then including the second component 91% of the variance, with the final component making up 5% to take it to 96% in total. This means that this dataset can be broken down quite nicely into the two principal components and this explains most of the variation.

The first principal component is commonly called the ‘market’ portfolio and represents the overall combined market dynamics of the portfolio. The next portfolio (using the 2nd PC weights) is uncorrelated to the market and thus more diversified to the overall market.

In our problem then we can see that we are trying to come up with a representation of the market and use that to decide how to hedge out our currencies. So the first principal component is the most relevant.

We take these principal component weights and join them to the original dataframe to start exploring what the market portfolio looks like.

```
evFrame = DataFrame(Dict(:ccy => String.([:JPY, :CNH, :SGD, :THB, :HKD, :KRW, :TWD]),
:ev1 => eigvecs(pcaRes)[:,1],
:ev2 => eigvecs(pcaRes)[:,2]))
sort!(evFrame, :ev1)
res = leftjoin(res, dropmissing(evFrame), on = :ccy)
evFrame = sort(evFrame, :ev1);
```

Then plotting the weights by currency pair

```
bar(evFrame.ccy, evFrame.ev1 ./ sum(evFrame.ev1), label = "Eigen Weights")
```

These are the weights of the different currencies of the first eigen portfolio. This combination of currencies is what we would recommend if the client was exposed to a similar basket. The key points:

- The client is long these currencies through their business
- They short this portfolio and thus are market-neutral

We now calculate the returns of the eigen portfolios, the portfolio that only uses the largest 2 (and 3) weights.

```
evPortfolios = @combine(groupby(res, :Date),
:ReturnEV1 = sum(:LogReturn .* :ev1) ./ sum(:ev1),
:ReturnEV2 = sum(:LogReturn .* :ev2) ./ sum(:ev2));
ccy2Portfolio = @combine(groupby(res[in.(res.ccy, Ref(["KRW", "JPY"])), :], :Date),
:Return2Ccy = sum(:LogReturn .* :ev1) ./ sum(:ev1));
ccy3Portfolio = @combine(groupby(res[in.(res.ccy, Ref(["KRW", "JPY", "SGD"])), :], :Date),
:Return3Ccy = sum(:LogReturn .* :ev1) ./ sum(:ev1));
```

And plotting these returns

```
plot(market.Date, cumsum(market.LogReturn), label = "Market", color = "black", lw = 2)
plot!(evPortfolios.Date, cumsum(evPortfolios.ReturnEV1), label = "Eigen Portfolio", lw = 2)
plot!(ccy2Portfolio.Date, cumsum(ccy2Portfolio.Return2Ccy), label = "2 Ccy", lw =2)
plot!(ccy3Portfolio.Date, cumsum(ccy3Portfolio.Return3Ccy), label = "3 Ccy", lw = 2)
```

Then finally, looking at the correlation between these portfolios

Market Return | Market Eigen Portfolio | 2nd Eigen Portfolio | KRW + JPY | KRW + JPY + SGD | |
---|---|---|---|---|---|

Market Return | 1.0 | 0.99 | 0.01 | 0.93 | 0.95 |

Market Eigen Portfolio | 0.99 | 1.0 | 0.01 | 0.97 | 0.98 |

2nd Eigen Portfolio | 0.01 | 0.01 | 1.0 | 0.11 | 0.08 |

KRW + JPY | 0.93 | 0.97 | 0.11 | 1.0 | 0.99 |

KRW + JPY + SGD | 0.95 | 0.99 | 0.08 | 0.99 | 1.0 |

- The Eigen Portfolio 1 is most correlated with the equal-weighted portfolio.
- With just KRW and JPY you get to a 93% correlation with the market.
- KRW, JPY and SGD gets you to a 95% with the market.

As expected Eigen portfolio 2 is the most uncorrelated with the market.

So our final answer to the client would be:

- We have a proprietary portfolio (the market eigen portfolio) that you should hedge with - this will give you the best outcome.
- If you don’t want the full portfolio use a 60/40 ratio of KRW and JPY.
- SGD probably isn’t a great idea and will leave you exposed.

Now, we are assuming that these weightings are stable through time and haven’t changed recently and are therefore valid for the future returns too. We are ignoring transaction costs, KRW being an NDF and more expensive to trade compared to a spot currency (like JPY) means that this approach will break down if the client needs to hedge a significant amount.

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

I’ve briefly touched on mean reversion and OU processes before in my Stat Arb - An Easy Walkthrough blog post where we modelled the spread between an asset and its respective ETF. The whole concept of ‘mean reversion’ is something that comes up frequently in finance and at different time scales. It can be thought of as the first basic extension as Brownian motion and instead of things moving randomly there is now a slight structure where it be oscillating around a constant value.

The Hudson Thames group have a similar post on OU processes (Mean-Reverting Spread Modeling: Caveats in Calibrating the OU Process) and my post should be a nice compliment with code and some extensions.

As a continuous process, we write the change in \(X_t\) as an increment in time and some noise

\[\mathrm{d}X_t = \theta (\mu - x_t) \mathrm{d}t + \sigma \mathrm{d}W_t\]The amount it changes in time depends on the previous \(X_t\) and to free parameters \(\mu\) and \(\theta\).

- The \(\mu\) is the long-term drift of the process
- The \(\theta\) is the mean reversion or momentum parameter depending on the sign.

If \(\theta\) is 0 we can see the equation collapses down to a simple random walk.

If we assume \(\mu = 0\), so the long-term average is 0, then a **positive** value of \(\theta\) means we see mean reversion. Large values of \(X\) mean the next change is likely to have a negative sign, leading to a smaller value in \(X\).

A **negative** value of \(\theta\) means the opposite and we end up with a large value in X generating a further large positive change and the process explodes.
E
If discretise the process we can simulate some samples with different parameters to illustrate these two modes.

where \(W_t \sim N(0,1)\).

which is easy to write out in Julia. We can save some time by drawing the random values first and then just summing everything together.

```
using Distributions, Plots
function simulate_os(theta, mu, sigma, dt, maxT, initial)
p = Array{Float64}(undef, length(0:dt:maxT))
p[1] = initial
w = sigma * rand(Normal(), length(p)) * sqrt(dt)
for i in 1:(length(p)-1)
p[i+1] = p[i] + theta*(mu-p[i])*dt + w[i]
end
return p
end
```

We have two classes of OU processes we want to simulate, a mean reverting \(\theta > 0\) and a momentum version (\(\theta < 0\)) and we also want to simulate a random walk at the same time, so \(\theta = 0\). We will assume \(\mu = 0\) which keeps the pictures simple.

```
maxT = 5
dt = 1/(60*60)
vol = 0.005
initial = 0.00*rand(Normal())
p1 = simulate_os(-0.5, 0, vol, dt, maxT, initial)
p2 = simulate_os(0.5, 0, vol, dt, maxT, initial)
p3 = simulate_os(0, 0, vol, dt, maxT, initial)
plot(0:dt:maxT, p1, label = "Momentum")
plot!(0:dt:maxT, p2, label = "Mean Reversion")
plot!(0:dt:maxT, p3, label = "Random Walk")
```

The mean reversion (orange) hasn’t moved away from the long-term average (\(\mu=0\)) and the momentum has diverged the furthest from the starting point, which lines up with the name. The random walk, inbetween both as we would expect.

Now we have successfully simulated the process we want to try and estimate the \(\theta\) parameter from the simulation. We have two slightly different (but similar methods) to achieve this.

When we look at the generating equation we can simply rearrange it into a linear equation.

\[\Delta X = \theta \mu \Delta t - \theta \Delta t X_t + \epsilon\]and the usual OLS equation

\[y = \alpha + \beta X + \epsilon\]such that

\[\alpha = \theta \mu \Delta t\] \[\beta = -\theta \Delta t\]where \(\epsilon\) is the noise. So we just need a DataFrame with the difference between subsequent observations and relate that to the current observation. Just a `diff`

and a shift.

```
using DataFrames, DataFramesMeta
momData = DataFrame(y=p1)
momData = @transform(momData, :diffY = [NaN; diff(:y)], :prevY = [NaN; :y[1:(end-1)]])
```

Then using the standard OLS process from the `GLM`

package.

```
mdl = lm(@formula(diffY ~ prevY), momData[2:end, :])
alpha, beta = coef(mdl)
theta = -beta / dt
mu = alpha / (theta * dt)
```

Which gives us \(\mu = 0.0075, \theta = -0.3989\), so close to zero for the drift and the reversion parameter has the correct sign.

Doing the same for the mean reversion data.

```
mdl = lm(@formula(diffY ~ prevY), revData[2:end, :])
alpha, beta = coef(mdl)
theta = -beta / dt
mu = alpha / (theta * dt)
```

This time \(\mu = 0.001\) and \(\theta = 1.2797\). So a little wrong compared to the true values, but at least the correct sign.

It could be that we need more data, so we use the bootstrap to randomly sample from the population to give us pseudo-new draws. We use the DataFrames again and pull random rows with replacement to build out the data set. We do this sampling 1000 times.

```
res = zeros(1000)
for i in 1:1000
mdl = lm(@formula(diffY ~ prevY + 0), momData[sample(2:nrow(momData), nrow(momData), replace=true), :])
res[i] = -first(coef(mdl)/dt)
end
bootMom = histogram(res, label = :none, title = "Momentum", color = "#7570b3")
bootMom = vline!(bootMom, [-0.5], label = "Truth", momentum = 2)
bootMom = vline!(bootMom, [0.0], label = :none, color = "black")
```

We then do the same for the reversion data.

```
res = zeros(1000)
for i in 1:1000
mdl = lm(@formula(diffY ~ prevY + 0), revData[sample(2:nrow(revData), nrow(revData), replace=true), :])
res[i] = first(-coef(mdl)/dt)
end
bootRev = histogram(res, label = :none, title = "Reversion", color = "#1b9e77")
bootRev = vline!(bootRev, [0.5], label = "Truth", lw = 2)
bootRev = vline!(bootRev, [0.0], label = :none, color = "black")
```

Then combining both the graphs into one plot.

```
plot(bootMom, bootRev,
layout=(2,1),dpi=900, size=(800, 300),
background_color=:transparent, foreground_color=:black,
link=:all)
```

The momentum bootstrap has worked and centred around the correct value, but the same cannot be said for the reversion plot. However, it has correctly guessed the sign.

If we continue assuming that \(\mu = 0\) then we can simplify the OLS to a 1-parameter regression - OLS without an intercept. From the generating process, we can see that this is an AR(1) process - each observation depends on the previous observation by some amount.

\[\phi = \frac{\sum _i X_i X_{i-1}}{\sum _i X_{i-1}^2}\]then the reversion parameter is calculated as

\[\theta = - \frac{\log \phi}{\Delta t}\]This gives us a simple equation to calculate \(\theta\) now.

For the momentum sample:

```
phi = sum(p1[2:end] .* p1[1:(end-1)]) / sum(p1[1:(end-1)] .^2)
-log(phi)/dt
```

Givens \(\theta = -0.50184\), so very close to the true value.

For the reversion sample

```
phi = sum(p2[2:end] .* p2[1:(end-1)]) / sum(p2[1:(end-1)] .^2)
-log(phi)/dt
```

Gives \(\theta = 1.26\), so correct sign, but quite a way off.

Finally, for the random walk

```
phi = sum(p3[2:end] .* p3[1:(end-1)]) / sum(p3[1:(end-1)] .^2)
-log(phi)/dt
```

Produces \(\theta = -0.027\), so quite close to zero.

Again, values are similar to what we expect, so our estimation process appears to be working.

If you aren’t convinced I don’t blame you. Those point estimates above are nowhere near the actual values that simulated the data so it’s hard to believe the estimation method is working. Instead, what we need to do is repeat the process and generate many more price paths and estimate the parameters of each one.

To make things a bit more manageable code-wise though I’m going to
introduce a `struct`

that contains the parameters and allows to
simulate and estimate in a more contained manner.

```
struct OUProcess
theta
mu
sigma
dt
maxT
initial
end
```

We now write specific functions for this object and this allows us to simplify the code slightly.

```
function simulate(ou::OUProcess)
simulate_os(ou.theta, ou.mu, ou.sigma, ou.dt, ou.maxT, ou.initial)
end
function estimate(ou::OUProcess)
p = simulate(ou)
phi = sum(p[2:end] .* p[1:(end-1)]) / sum(p[1:(end-1)] .^2)
-log(phi)/ou.dt
end
function estimate(ou::OUProcess, N)
res = zeros(N)
for i in 1:N
p = simulate(ou)
res[i] = estimate(ou)
end
res
end
```

We use these new functions to draw from the process 1,000 times and sample the parameters for each one, collecting the results as an array.

```
ou = OUProcess(0.5, 0.0, vol, dt, maxT, initial)
revPlot = histogram(estimate(ou, 1000), label = :none, title = "Reversion")
vline!(revPlot, [0.5], label = :none);
```

And the same for the momentum OU process

```
ou = OUProcess(-0.5, 0.0, vol, dt, maxT, initial)
momPlot = histogram(estimate(ou, 1000), label = :none, title = "Momentum")
vline!(momPlot, [-0.5], label = :none);
```

Plotting the distribution of the results gives us a decent understanding of how varied the samples can be.

```
plot(revPlot, momPlot, layout = (2,1), link=:all)
```

We can see the heavy-tailed nature of the estimation process, but thankfully the histograms are centred around the correct number. This goes to show how difficult it is to estimate the mean reversion parameter even in this simple setup. So for a real dataset, you need to work out how to collect more samples or radically adjust how accurate you think your estimate is.

We have progressed from simulating an Ornstein-Uhlenbeck process to estimating its parameters using various methods. We attempted to enhance the accuracy of the estimates through bootstrapping, but we discovered that the best approach to improve the estimation is to have multiple samples.

So if you are trying to fit this type of process on some real world data, be it the spread between two stocks (Statistical Arbitrage in the U.S. Equities Market), client flow (Unwinding Stochastic Order Flow: When to Warehouse Trades) or anything else you believe might be mean reverting, then understand how much data you might need to accurately model the process.

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

In this post, I’ll go through what skew is, how it can be used as a trading strategy, and backtest the portfolio across different asset classes. We will then see if it produces any alpha (\(\alpha\)) and or if skew is just market beta (\(\beta\)). I’ll then take a deeper dive into the equity performance and how it compares to the typical factors.

I’ll be working through everything in Julia (1.9) and pulling daily data from AlpacaMarkets.

```
using AlpacaMarkets, Dates,CSV, DataFrames, DataFramesMeta, RollingFunctions
using Plots, StatsBase
using Distributions
function parse_date(t)
Date(string(split(t, "T")[1]))
end
function clean(df, x)
df = @transform(df, :Date = parse_date.(:t),
:Ticker = x, :NextOpen = [:o[2:end]; NaN], :LogReturn = [NaN; diff(log.(:c))])
@select(df, :Date, :Ticker, :c, :o, :NextOpen, :LogReturn)
end
function load(etf)
df = AlpacaMarkets.stock_bars(etf, "1Day"; startTime = now() - Year(10), limit = 10000, adjustment = "all")[1]
clean(df, etf)
end
```

Skew (or skewness) measures how symmetric the distribution is around the mean value. A distribution of values with more values to the right of the mean is a positively skewed distribution and vice versa for the left of the mean.

We can demonstrate this by generating some random values from a skewed distribution (lognormal) and unskewed (normal).

Which shows the general tilt in the x-axis across the 3 different distributions.

Skew is weird in the sense that there isn’t a single way to calculate how skewed a distribution is. For our defined distributions above we can calculate the analytical values of skew and see that it is zero for the middle graph and positive (as expected) for the right-hand graph. Given that we flip the sign of the left-hand graph, that has the negative skew.

```
skewness.([Normal(1,1), LogNormal(0, 0.5)])
```

```
2-element Vector{Float64}:
0.0
1.7501896550697178
```

In the paper, the skew of an asset is calculated as

\[S = \frac{1}{N} \sum _{i=1} ^N \frac{(r_i - \mu ) ^3}{\sigma ^3},\]where \(\mu\) is the average and \(\sigma ^2\) is the variance of the returns of an asset with a lookback window of \(N\). We can look at the skewness of the SPY ETF over a 256-day rolling window using the `RollingFunctions`

package.

```
spy = load("SPY")
spy = @transform(spy, :Avg = runmean(:LogReturn, 256), :Dev = runstd(:LogReturn, 256))
spy = @transform(spy, :SkewDay = ((:LogReturn .- :Avg) ./ :Dev) .^3)
spy = @transform(spy, :Skew = runmean(:SkewDay, 256))
spy = @subset(spy, .!isnan.(:Skew))
plot(spy.Date, spy.Skew, label = "SPY Skew", dpi=900, size=(800, 200))
hline!([0], color="black", label = :none)
```

It’s jumpy, but the jumps make sense as it’s a \(^3\) calculation, so large values will be amplified. SPY became very negatively skewed over COVID-19 as there were all the market corrections leading to large down days. In recent days it’s now more positively skewed as we’ve seen some larger positive returns.

The paper believes that skew can predict future returns and that we want to be long assets with a negative skew and short assets with a positive skew. This gives it a ‘mean reversion’ explanation for future returns, so over COVID-19 when there were lots of down days, we should be buying because the movement is likely to be overblown and the market will correct higher. Likewise, large jumps up mean that it’s a positive move that is overblown and will come back down. So again, looking at the skew of SPY in recent weeks, the skew is positive therefore we would be inclined to short this ETF.

The overall strategy is looking at **cross-sectional skew**, so how skewed an asset its relative to it’s peers rather than looking at the raw skew number on a given day. The paper looks at equity indexes across countries, bond futures across different countries, different currencies, and commodities. In our replication, we are going to be using different ETFs that look at similar themes and should capture the broad cross-section of finance.

The original paper uses futures data from 1990 up to 2017 to run the backtest, I will be instead using different ETFs and a much shorter timescale, just because that’s all the data I have available from my `AlpacaMarkets`

free account using AlpacaMarkets.jl.

Blackrock is nice enough to publish this document for their different equity funds across the globe, Around the World with iShares Country ETFs, which I use to get the different country equity performance plus some broader indexes.

For the fixed income part I just try and take a cross-section of the different types of fixed income instruments available and different durations, mixing long-term, short-term, government, corporates, etc.

Commodities, again, just trying to get a broad mix, and the Other class is mainly real-estate and whatever other cruff comes up on the ETF database website. Finally, the currency ETFs each represent a different currency, so cover that part of the paper.

```
universe = [("Equity", ["SPY", "EWU", "EWJ", "INDA", "EWG", "EWL", "EWP", "EWQ",
"VTI", "FXI", "EWZ", "EWY", "EWA", "EWC", "EWG",
"EWH", "EWI", "EWN", "EWD", "EWT", "EZA", "EWW", "ENOR", "EDEN", "TUR"]),
("FI", ["AGG", "TLT", "LQD", "JNK", "MUB", "MBB", "IAGG", "IGOV", "EMB", "BND", "BNDX", "VCIT", "VCSH", "BSV", "SRLN"]),
("Commodities", ["GLD", "SLV", "GSG", "USO", "PPLT", "UNG", "DBA"]),
("Other", ["IYR", "REET", "USRT", "ICF", "VNQ"]),
("Ccy", ["UUP", "FXY", "FXE", "FXF", "FXB", "FXA", "FXC"])
]
```

We iterate through all the asset classes and pull the most amount of daily data possible.

```
allDataRaw = Array{DataFrame}(undef, length(universe))
for (j, (assetClass, etfs)) in enumerate(universe)
println(assetClass)
resdf = Array{DataFrame}(undef, length(etfs))
for (i, etf) in enumerate(etfs)
#println(etf)
df = load(etf)
resdf[i] = df
end
resdfC = vcat(resdf...)
resdfC.AssetClass .= assetClass
allDataRaw[j] = resdfC
end
allData = vcat(allDataRaw...);
```

We then add in the averages \(\mu\), standard deviation \(\sigma\), and calculate the skew value for that day before taking the rolling average to arrive at the overall skew measure. We need to group by each ETF (the `Ticker`

column).

```
allData = groupby(allData, :Ticker)
allData = @transform(allData, :Avg = runmean(:LogReturn, 256), :Dev = runstd(:LogReturn, 256))
allData = @transform(allData, :SkewDay = ((:LogReturn .- :Avg) ./ :Dev) .^3)
allData = @transform(allData, :Skew = runmean(:SkewDay, 256))
allData = @subset(allData, .!isnan.(:Skew));
```

To check we’ve pulled the right data we plot the cumulative log returns.

```
plot(allData[allData.Ticker .== "SPY", :].Date, cumsum(allData[allData.Ticker .== "SPY", :].LogReturn), label = "SPY",
title="Returns", dpi=900, size=(800, 200))
plot!(allData[allData.Ticker .== "GLD", :].Date, cumsum(allData[allData.Ticker .== "GLD", :].LogReturn), label = "GLD")
plot!(allData[allData.Ticker .== "AGG", :].Date, cumsum(allData[allData.Ticker .== "AGG", :].LogReturn), label = "AGG")
```

Everything looks as we would expect. We can now look at the skew for these three assets.

The skews move differently and with different magnitudes notably GLD has the least variable skew but equity and bonds have a similar pattern.
The paper looks at the skew of the asset on the last day of the month and uses that to rebalance the portfolio so that with a `groupby`

and `last`

we can pull the skew value on the last day of the month.

We need to avoid the look-ahead bias in the backtest. The portfolio weight is calculated using the last day of the month, so we observe the closing price and use that to calculate the return and update the parameters - average return, volatility, and finally the skew. This skew then goes into the weighting calculation *but* it is only active on the next working day, otherwise, we are getting a ‘free’ day of return.

So on the 31st of the Jan, we update the weights and then do the rebalance on the 1st of Feb (assuming that’s a working day). There is also the additional cost of trading into the position, at the minute we are assuming we can trade at the previous closing price but that is a problem to solve for another day.

```
allData = @transform(allData, :Month = floor.(:Date, Month(1)), :Week = floor.(:Date, Week(1)));
allData = @transform(groupby(allData, :Ticker), :NextDay = [:Date[2:end]; Date(2015)])
monthlyVals = @combine(groupby(allData, [:Month, :AssetClass, :Ticker]),
:Date = last(:Date), :NextDate = last(:NextDay),
:EOMSkew = last(:Skew));
```

We rank each asset in its respective asset class using the negative of the skew value, so the most positive skew gets the lowest rank and the most negative skew gets the highest rank. We also normalise the ranks by the number of assets in the group.

To come up with the portfolio weight, we want all the long positions (positive ranks) to have a total weighting of 1 and short positions (negative ranks) to have a total weighting of -1. This corresponds to being long 1 dollar and short 1 dollar so self-financed overall.

```
monthlyVals = groupby(monthlyVals, [:Date, :AssetClass])
monthlyVals = @transform(monthlyVals, :SkewWeightRaw = ordinalrank(-1*:EOMSkew) .- ((length(:EOMSkew) + 1) /2))
monthlyVals = groupby(monthlyVals, [:Date, :AssetClass])
monthlyVals = @transform(monthlyVals, :SkewWeight = :SkewWeightRaw ./ sum(1:maximum(:SkewWeightRaw)))
```

For example, if we look at the commodity ETFs and their latest skew values and how that changes the portfolio weights.

Date | Asset Class | Ticker | EOM Skew | SkewWeightRaw | Skew Weight |
---|---|---|---|---|---|

2024-02-07 | Commodities | GLD | 0.23 | -3 | -0.5 |

2024-02-07 | Commodities | SLV | 0.02 | -2 | -0.333 |

2024-02-07 | Commodities | DBA | -0.04 | -1 | -0.167 |

2024-02-07 | Commodities | PPLT | -0.07 | 0 | 0 |

2024-02-07 | Commodities | GSG | -0.12 | 1 | 0.167 |

2024-02-07 | Commodities | UNG | -0.16 | 2 | 0.333 |

2024-02-07 | Commodities | USO | -0.19 | 3 | 0.5 |

The most negatively skewed ETF, USO, gets the highest positive weight and vice versa. If we look at the weights over the period for the three example assets.

The portfolio weights for both SPY and AGG show that the last two months have been short SPY and no position in AGG. GLD has been allocated in the opposite direction to the other two, right now we are short GLD.

We join the weights to the original dataframe and forward fill the weightings to look at the daily performance. I pulled a forward fill function from https://hongtaoh.com/en/2021/06/27/julia-ffill/ and joining the portfolio weights to the daily returns allows us to understand the daily changes in the portfolios.

```
ffill(v) = v[accumulate(max, [i*!ismissing(v[i]) for i in 1:length(v)], init=1)]
weightings = @select(monthlyVals, :NextDate, :Ticker, :SkewWeight)
rename!(weightings,:NextDate => :Date)
allDataWeights = leftjoin(allData, weightings, on=[:Date, :Ticker]);
allDataWeights = sort(allDataWeights, :Date)
allDataWeights = @transform(groupby(allDataWeights, :Ticker), :SkewWeight2 = ffill(:SkewWeight));
```

Plotting the resulting portfolios gives us an idea of their performance.

```
assetPortfolios = dropmissing(@combine(groupby(allDataWeights, [:Date, :AssetClass]),
:PortfolioReturn = sum(:SkewWeight2 .* :LogReturn),
:MktReturn = mean(:LogReturn)))
p = plot(title = "Skew Portfolios")
for ac in unique(assetPortfolios.AssetClass)
plot!(p, assetPortfolios[assetPortfolios.AssetClass .== ac, :].Date,
cumsum(assetPortfolios[assetPortfolios.AssetClass .== ac, :].PortfolioReturn), label =ac)
end
hline!([0], color = "black", label = :none)
p
```

These are the results for each asset class. Interestingly, all of them (except Other) have a positive return as of February and most have never fallen below their starting returns. Commodities are very volatile and swung back and forth quite dramatically, equities have been one-way traffic in the right direction!

We also want to combine all the asset classes to produce a single portfolio but first have to normalise the returns by the volatility so that they are equally weighted on a risk basis.

```
assetPortfolios = @transform(groupby(assetPortfolios, :AssetClass), :Vol = sqrt.(runvar(:PortfolioReturn, 256)))
assetPortfolios = @transform(groupby(assetPortfolios, :AssetClass),
:NormReturn = 0.1*:PortfolioReturn ./ :Vol,
:NormMarketReturn = 0.1*:MktReturn ./ :Vol)
gcf = @combine(groupby(assetPortfolios, :Date), :Return = mean(:NormReturn), :MktReturn = mean(:NormMarketReturn));
plot(gcf.Date[2:end], cumsum(gcf.Return[2:end]), label = "Global Skew Factor", title = "Global Portfolio")
plot!(gcf.Date[2:end], cumsum(gcf.MktReturn[2:end]), label = "Global Market Return")
hline!([0], color = "black", label = :none)
```

Again, a positive result, well at least recently. This indicates that skew has some associated premium. Now we want to see if this is alpha or beta.

It’s great that these portfolios both at an asset level and global level have ended up in the green but we want to compare the performance to the general market and see if it’s riding the market or adding something new.

This is simple enough to compare, we can look at the equal-weighted return of all the assets in the group and see how that ended up.

Again, all of the skew portfolios have outperformed the market portfolio (except the Other asset class). so this is a good indication that this skew strategy is adding something new.

A more systematic approach is to regress the portfolio return against the market return and this will give us a measure of the \(\alpha\) and \(\beta\) of the strategy.

\[\text{Skew Return} = \alpha + \beta \cdot \text{Market Return}\]```
using GLM
for ac in unique(assetPortfolios.AssetClass)
ols = lm(@formula(PortfolioReturn ~ MktReturn), assetPortfolios[assetPortfolios.AssetClass .== ac, :])
println(ac)
println(coeftable(ols))
println(r2(ols))
end
```

Asset Class | \(\alpha\) | \(p\) value | \(\beta\) | \(p\) value | \(R^2\) |
---|---|---|---|---|---|

Equity | 0.0003 | 0.0544 | -0.01 | 0.4465 | 0.0003 |

FI | 0.0001 | 0.1796 | -0.05 | 0.0728 | 0.002 |

Commodities | 0.0004 | 0.4799 | 0.113 | 0.0232 | 0.003 |

Other | -0.00004 | 0.5845 | 0.007 | 0.1690 | 0.001 |

Ccy | 0.0001 | 0.3622 | 0.498 | <1e-27 | 0.08 |

The first thing to note is the low \(R^2\)’s across the board, which is to be expected in these types of models. Generally, the \(\alpha\)’s are all statistically insignificant with only the equity portfolio getting close to significance which indicates that the skew factor isn’t providing ‘new returns’. Interestingly though, only commodities and currencies have a statistically significant \(\beta\) which means for other asset classes the modelling is essentially noise. So whilst the lack of \(\alpha\) is a problem, the lack of \(\beta\) sort of makes up for it. Essentially I think this is a promising sign that there is perhaps something more to be done.

An equity fund manager who wants to allocate to skew also needs to verify that skew is providing something unique and not a repackaging of momentum/value/growth/carry factors. This is easy enough as there are ETFs that represent these factors, so we just include it in the regression.

```
mtum = load("MTUM") #momentum
vtv = load("VTV") #value
vug = load("VUG") #growth
cry = load("VIG") #carry
equityFactors = vcat([mtum, vtv, vug, cry]...);
```

Joining these with the equity data gives us a bigger dataset to construct the OLS regression.

```
equity = assetPortfolios[assetPortfolios.AssetClass .== "Equity", :]
equity = leftjoin(equity,
unstack(@select(equityFactors, :Date, :Ticker, :LogReturn), :Date, :Ticker, :LogReturn),
on = "Date")
coeftable(lm(@formula(PortfolioReturn ~ MktReturn + MTUM + VTV + VUG + VIG),
equity))
```

Coef. | Std. Error | t | Pr(> \(\mid t \mid\)) | Lower 95% | Upper 95% | |
---|---|---|---|---|---|---|

(Intercept) | 0.000280318 | 0.000180867 | 1.55 | 0.1214 | -7.44597e-5 | 0.000635095 |

MktReturn | -0.300453 | 0.0312806 | -9.61 | <1e-20 | -0.361811 | -0.239094 |

MTUM | -0.0881885 | 0.0305466 | -2.89 | 0.0039 | -0.148107 | -0.0282701 |

VTV | 0.450562 | 0.0614928 | 7.33 | <1e-12 | 0.329942 | 0.571183 |

VUG | 0.109752 | 0.0358138 | 3.06 | 0.0022 | 0.0395015 | 0.180002 |

VIG | -0.140079 | 0.0739041 | -1.90 | 0.0582 | -0.285045 | 0.00488637 |

Again, no \(\alpha\), significant market \(\beta\), and significant momentum, value, and growth coefficients but no significance with carry. This isn’t great for the Skew factor as this regression suggests we can replicate it using the other factors, namely, it’s anti-correlated to the market and momentum and correlated with value and growth. Given it’s a mean-reversion-esq strategy this makes sense as value is generally about finding underpriced assets.

This has been a successful replication of the original paper, which used ETFs of different asset sectors to explore skew. We now understand that skew is a measure of how left or right-tailed a distribution is, and how it can be exploited in a trading strategy. By calculating skew across different assets and ranking the skew in asset class groups, we allocate long positions to the most negatively skewed assets and short positions to positively skewed assets. This portfolio has produced a positive return in equities, fixed income, currencies, and commodities (but not Other), and has outperformed the market portfolio. A global skew portfolio was also constructed by scaling each asset class to 10% volatility and combining the returns, which also outperformed the market.

The use of the Other asset class was the only sector where skew didn’t work, so it would be hurting the overal skew portfolio, so going forward we would know to restrict the universe to equity, fixed income, currencies and commodities.

However, when we regressed the portfolio return onto the market returns, we found no statistically significant alphas and significant betas. The equity portfolio was close to having a significant alpha, but given it had the largest number of underlying assets, it could be a function of asset size.

We have neglected the trading costs and potential capacity of the overall strategy, but given its low turnover (weights only updating every month), this is probably safe to ignore until you hit the super asset manager size.

Although the results are not as conclusive as the original paper, they are on a shorter timescale and smaller universe, and do not contradict the original findings. We have shown that skew is out there and can provide a source of returns.

Going forward, refining the calculation of the skew and tuning the lookback windows might improve the results. Also, expanding the universe into more specific funds could provide better insights. At the moment, the fixed income component is too broad to pick up on the skew changes.

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

Regularisation is normally taught as a method to reduce overfitting, you have a big model and you make it smaller by shrinking some of the factors. Work by Janzing (papers below) argues that this can help produce better causal models too and in this blog post I will work through two papers to try and understand the process better.

I’ll work off two main papers for causal regularisation:

In truth, I am working backward. I first encountered causal regularisation in Better AB testing via Causal Regularisation where it uses causal regularisation to produce better estimates by combining a biased and an unbiased dataset. I want to take a step back and understand casual regularisation from the original papers. Using free data from the UCI Machine Learning Repository we can attempt to replicate the methods from the papers and see how causal regularisation works to produce better **causal** models.

As ever, I’m in Julia (1.9), so fire up that notebook and follow along.

```
using CSV, DataFrames, DataFramesMeta
using Plots
using GLM, Statistics
```

The `wine-quality`

dataset from the UCI repository provides measurements of the chemical properties of wine and a quality rating from someone drinking the wine. It’s a simple CSV file that you can download (winequality) and load with minimal data wrangling needed.

We will be working with the red wine data set as that’s what both Janzing papers use.

```
rawData = CSV.read("wine+quality/winequality-red.csv", DataFrame)
first(rawData)
```

APD! Always Plotting the Data to make sure the values are something you expect. Sometimes you need a visual confirmation that things line up with what you believe.

```
plot(scatter(rawData.alcohol, rawData.quality, title = "Alcohol", label = :none, color="#eac435"),
scatter(rawData.pH, rawData.quality, title = "pH", label = :none, color="#345995"),
scatter(rawData.sulphates, rawData.quality, title= "Sulphates", label = :none, color="#E40066"),
scatter(rawData.density, rawData.quality, title = "Density", label = :none, color="#03CEA4"), ylabel = "Quality")
```

By choosing four of the variables randomly we can see that some are correlated with quality and some are not.

A loose goal is to come up with a causal model that can explain the quality of the wine using the provided factors. We will change the data slightly to highlight how causal regularisation helps, but for now, let’s start with the simple OLS model.

In the paper they normalise the variables to be unit variance, so we divide by the standard deviation. We then model the quality of the wine using all the available variables.

```
vars = names(rawData, Not(:quality))
cleanData = deepcopy(rawData)
for var in filter(!isequal("White"), vars)
cleanData[!, var] = cleanData[!, var] ./ std(cleanData[!, var])
end
cleanData[!, :quality] .= Float64.(cleanData[!, :quality])
ols = lm(term(:quality) ~ sum(term.(Symbol.(vars))), cleanData)
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
quality ~ 1 + fixed acidity + volatile acidity + citric acid + residual sugar + chlorides + free sulfur dioxide + total sulfur dioxide + density + pH + sulphates + alcohol
Coefficients:
────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────────────
(Intercept) 21.9652 21.1946 1.04 0.3002 -19.6071 63.5375
fixed acidity 0.043511 0.0451788 0.96 0.3357 -0.0451055 0.132127
volatile acidity -0.194027 0.0216844 -8.95 <1e-18 -0.23656 -0.151494
citric acid -0.0355637 0.0286701 -1.24 0.2150 -0.0917989 0.0206716
residual sugar 0.0230259 0.0211519 1.09 0.2765 -0.0184626 0.0645145
chlorides -0.088211 0.0197337 -4.47 <1e-05 -0.126918 -0.0495041
free sulfur dioxide 0.0456202 0.0227121 2.01 0.0447 0.00107145 0.090169
total sulfur dioxide -0.107389 0.0239718 -4.48 <1e-05 -0.154409 -0.0603698
density -0.0337477 0.0408289 -0.83 0.4086 -0.113832 0.0463365
pH -0.0638624 0.02958 -2.16 0.0310 -0.121883 -0.00584239
sulphates 0.155325 0.019381 8.01 <1e-14 0.11731 0.19334
alcohol 0.294335 0.0282227 10.43 <1e-23 0.238977 0.349693
────────────────────────────────────────────────────────────────────────────────────────
```

The dominant factor is the `alcohol`

amount which is the strongest variable in predicting the quality, i.e. higher quality has a higher alcohol content. We also note that 5 out of the 12 variables are deemed insignificant at the 5% level. We save these parameters and then look at the regression without the `alcohol`

variable.

```
olsParams = DataFrame(Dict(zip(vars, coef(ols)[2:end])))
olsParams[!, :Model] .= "OLS"
olsParams
```

1×12 DataFrame

Row | alcohol | chlorides | citric acid | density | fixed acidity | free sulfur dioxide | pH | residual sugar | sulphates | total sulfur dioxide | volatile acidity | Model |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | String | |

1 | 0.294335 | -0.088211 | -0.0355637 | -0.0337477 | 0.043511 | 0.0456202 | -0.0638624 | 0.0230259 | 0.155325 | -0.107389 | -0.194027 | OLS |

```
cleanDataConfounded = select(cleanData, Not(:alcohol))
vars = names(cleanDataConfounded, Not(:quality))
confoundOLS = lm(term(:quality) ~ sum(term.(Symbol.(vars))), cleanDataConfounded)
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
quality ~ 1 + fixed acidity + volatile acidity + citric acid + residual sugar + chlorides + free sulfur dioxide + total sulfur dioxide + density + pH + sulphates
Coefficients:
───────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────────────────────
(Intercept) 189.679 14.2665 13.30 <1e-37 161.696 217.662
fixed acidity 0.299551 0.0391918 7.64 <1e-13 0.222678 0.376424
volatile acidity -0.176182 0.0223382 -7.89 <1e-14 -0.219997 -0.132366
citric acid 0.00912711 0.0292941 0.31 0.7554 -0.0483321 0.0665863
residual sugar 0.133781 0.0189031 7.08 <1e-11 0.0967031 0.170858
chlorides -0.107215 0.0203052 -5.28 <1e-06 -0.147043 -0.0673877
free sulfur dioxide 0.0394281 0.023462 1.68 0.0931 -0.00659172 0.0854479
total sulfur dioxide -0.128248 0.0246854 -5.20 <1e-06 -0.176668 -0.0798287
density -0.355576 0.0276265 -12.87 <1e-35 -0.409765 -0.301388
pH 0.0965662 0.0261087 3.70 0.0002 0.0453551 0.147777
sulphates 0.213697 0.0191745 11.14 <1e-27 0.176087 0.251307
───────────────────────────────────────────────────────────────────────────────────────────
```

`citric acid`

and `free sulfur dioxide`

are now the only insignificant variables, the rest are believed to contribute to the quality. This means we are experiencing *confounding* as `alcohol`

is the better explainer but the effect of alcohol is now hiding behind these other variables.

**Confounding** - When a variable influences other variables and the outcome at the same time leading to an incorrect view on the correlation between the variables and outcomes.

This regression after dropping the `alcohol`

variable is incorrect and provides the wrong causal conclusion. So can we do better and get closer to the true regression coefficients using some regularisation methods?

For now, we save these incorrect parameters and explore the causal regularisation methods.

```
olsParamsConf = DataFrame(Dict(zip(vars, coef(confoundOLS)[2:end])))
olsParamsConf[!, :Model] .= "OLS No Alcohol"
olsParamsConf[!, :alcohol] .= NaN
olsParamsConf
```

1×12 DataFrame

Row | chlorides | citric acid | density | fixed acidity | free sulfur dioxide | pH | residual sugar | sulphates | total sulfur dioxide | volatile acidity | Model | alcohol |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | String | Float64 | |

1 | -0.107215 | 0.00912711 | -0.355576 | 0.299551 | 0.0394281 | 0.0965662 | 0.133781 | 0.213697 | -0.128248 | -0.176182 | OLS No Alcohol | NaN |

Some maths. Regression is taking our variables \(X\) and finding the parameters \(a\) that get us closest to \(Y\).

\[Y = a X\]\(X\) is a matrix, and \(a\) is a vector. When we fit this to some data, the values of \(a\) are free to converge to any value they want, so long as it gets close to the outcome variable. This means we are minimising the difference between \(Y\) and \(X\)

\[||(Y - a X)|| ^2.\]Regularisation is the act of restricting the values \(a\) can take.

For example, we can make the sum of all the \(a\)’s equal to a constant (L_1 regularisation), or the sum of the square of the $a$ values equal a constant (L_2 regularisation). In simpler terms, if we want to increase the coefficient of one parameter, we need to reduce the parameter of a different term. Think of there being a finite amount of mass that we can allocate to the parameters, they can’t take on whatever value they like, but instead need to regulate amongst themselves. This helps reduce overfitting as it constrains how much influence a parameter can have and the final result should converge to a model that doesn’t overfit.

In ridge regression we are minimising the \(L_2\) norm, so restricting the sum of the square of the \(a\)’s and at the same time minimising the original OLS regression.

\[||(Y - a X)|| ^2 - \lambda || a || ^2.\]So we can see how regularisation is an additional component of OLS regression. \(\lambda\) is a hyperparameter that is just a number and controls how much restriction we place on the \(a\) values.

To do ridge regression in Julia I’ll be leaning on the MLJ.jl framework and using that to build out the learning machines.

```
using MLJ
@load RidgeRegressor pkg=MLJLinearModels
```

We will take the confounded dataset (so the data where the alcohol column is deleted), partition it into train and test sets, and get started with some regularisation.

```
y, X = unpack(cleanDataConfounded, ==(:quality); rng=123);
train, test = partition(eachindex(y), 0.7, shuffle=true)
mdl = MLJLinearModels.RidgeRegressor()
```

```
RidgeRegressor(
lambda = 1.0,
fit_intercept = true,
penalize_intercept = false,
scale_penalty_with_samples = true,
solver = nothing)
```

Can see the hyperparameter `lambda`

is initialised to 1.

We want to know the optimal \(\lambda\) value so will use cross-validation to train the model on one set of data and verify on a hold-out set before repeating. This is all simple in MLJ.jl, we define a grid of penalisations between 0 and 1 and fit the regression using cross-validation across the different lambdas. We are optimising for the best \(R^2\) value.

```
lambda_range = range(mdl, :lambda, lower = 0, upper = 1)
lmTuneModel = TunedModel(model=mdl,
resampling = CV(nfolds=6, shuffle=true),
tuning = Grid(resolution=200),
range = [lambda_range],
measures=[rsq]);
lmTunedMachine = machine(lmTuneModel, X, y);
fit!(lmTunedMachine, rows=train, verbosity=0)
report(lmTunedMachine).best_model
```

```
RidgeRegressor(
lambda = 0.020100502512562814,
fit_intercept = true,
penalize_intercept = false,
scale_penalty_with_samples = true,
solver = nothing)
```

The best value of \(\lambda\) is 0.0201. When we plot the \(R^2\) vs the \(\lambda\) values there isn’t that much of a change just a minor inflection around the small ones.

```
plot(lmTunedMachine)
```

Let’s save those parameters. This will be our basic ridge regression result that the other technique builds off.

```
res = fitted_params(lmTunedMachine).best_fitted_params.coefs
ridgeParams = DataFrame(res)
ridgeParams = hcat(ridgeParams, DataFrame(Model = "Ridge", alcohol=NaN))
ridgeParams
```

1×12 DataFrame

Row | fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | Model | alcohol |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | String | Float64 | |

1 | 0.190892 | -0.157286 | 0.0410523 | 0.117846 | -0.142458 | 0.0374597 | -0.153419 | -0.29919 | 0.0375852 | 0.232461 | Ridge | NaN |

The main result from the paper is that we first need to estimate the confounding effect \(\beta\) and then choose a penalisation factor \(\lambda\) that satisfies

\[(1-\beta) || a || ^ 2\]So the \(L_2\) norm of the ridge parameters can only be so much. In the 2nd paper, they estimate \(\beta\) to be 0.8. For us, we can use the above grid search, calculate the norm of the parameters, and find which ones satisfy those criteria.

So iterate through the above results of the grid search, and calculate the L2 norm of the parameters.

```
mdls = report(lmTunedMachine).history
l = zeros(length(mdls))
a = zeros(length(mdls))
for (i, mdl) in enumerate(mdls)
l[i] = mdl.model.lambda
a[i] = sum(map( x-> x[2], fitted_params(fit!(machine(mdl.model, X, y))).coefs) .^2)
end
```

Plotting the results gives us a visual idea of how the penalisation works. Larger values of \(\lambda\) mean the model parameters are more and more restricted.

```
inds = sortperm(l)
l = l[inds]
a = a[inds]
mdlsSorted = report(lmTunedMachine).history[inds]
scatter(l, a, label = :none)
hline!([(1-0.8) * sum(coef(confoundOLS)[2:end] .^ 2)], label = "Target Length", xlabel = "Lambda", ylabel = "a Length")
```

We search the lengths for the one closest to the target length and save those parameters.

```
targetLength = (1-0.8) * sum(coef(confoundOLS)[2:end] .^ 2)
ind = findfirst(x-> x < targetLength, a)
res = fitted_params(fit!(machine(mdlsSorted[ind].model, X, y))).coefs
finalParams = DataFrame(res)
finalParams = hcat(finalParams, DataFrame(Model = "With Beta", alcohol=NaN))
finalParams
```

1×12 DataFrame

Row | fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | Model | alcohol |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | String | Float64 | |

1 | 0.0521908 | -0.139099 | 0.0598797 | 0.0377729 | -0.0786037 | 0.00654776 | -0.0856938 | -0.124057 | 0.00682623 | 0.11735 | With Beta | NaN |

Now the code to calculate \(\beta\) isn’t the easiest or straightforward to implement (hence why I took their estimate). Instead, we could take the approach from Better AB Testing via Causal Regularisation and use the test set to optimise the penalisation parameter \(\lambda\) and then use that value when training the model on the train set.

Applying this method to the wine dataset isn’t a true replication of their paper, as their test and train data sets are instead two data sets, one with bias and one without like you might observe from an AB test. So it’s more of a demonstration of the method rather than a direct comparison to the Janzing method.

Again, `MLJ`

makes this simple, we just fit the machine using the `test`

rows to produce the best-fitting model.

```
lambda_range = range(mdl, :lambda, lower = 0, upper = 1)
lmTuneModel = TunedModel(model=mdl,
resampling = CV(nfolds=6, shuffle=true),
tuning = Grid(resolution=200),
range = [lambda_range],
measures=[rsq]);
lmTunedMachine = machine(lmTuneModel, X, y);
fit!(lmTunedMachine, rows=test, verbosity=0)
plot(lmTunedMachine)
```

```
report(lmTunedMachine).best_model
```

```
RidgeRegressor(
lambda = 0.010050251256281407,
fit_intercept = true,
penalize_intercept = false,
scale_penalty_with_samples = true,
solver = nothing)
```

Our best \(\lambda\) is 0.01 so we retrain the same machine, this time using the training rows.

```
res2 = fit!(machine(report(lmTunedMachine).best_model, X, y), rows=train)
```

Again saving these parameters down leaves us with three methods and three sets of parameters.

```
finalParams2 = DataFrame(fitted_params(res2).coefs)
finalParams2 = hcat(finalParams2, DataFrame(Model = "No Beta", alcohol=NaN))
allParams = vcat([olsParams, olsParamsConf, ridgeParams, finalParams, finalParams2]...)
allParams
```

5×12 DataFrame

Row | alcohol | chlorides | citric acid | density | fixed acidity | free sulfur dioxide | pH | residual sugar | sulphates | total sulfur dioxide | volatile acidity | Model |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | String | |

1 | 0.294335 | -0.088211 | -0.0355637 | -0.0337477 | 0.043511 | 0.0456202 | -0.0638624 | 0.0230259 | 0.155325 | -0.107389 | -0.194027 | OLS |

2 | NaN | -0.107215 | 0.00912711 | -0.355576 | 0.299551 | 0.0394281 | 0.0965662 | 0.133781 | 0.213697 | -0.128248 | -0.176182 | OLS No Alcohol |

3 | NaN | -0.142458 | 0.0410523 | -0.29919 | 0.190892 | 0.0374597 | 0.0375852 | 0.117846 | 0.232461 | -0.153419 | -0.157286 | Ridge |

4 | NaN | -0.0786037 | 0.0598797 | -0.124057 | 0.0521908 | 0.00654776 | 0.00682623 | 0.0377729 | 0.11735 | -0.0856938 | -0.139099 | With Beta |

5 | NaN | -0.141766 | 0.031528 | -0.323596 | 0.222812 | 0.03869 | 0.048907 | 0.127026 | 0.23961 | -0.153488 | -0.157603 | No Beta |

What method has done the best at uncovering the confounded relationship?

We have our different estimates of the parameters of the model, we now want to compare these to the ‘true’ unconfounded variables and see whether we have recovered the correct variables. To do this we calculate the square difference and normalise by the overall \(L_2\) norm of the parameters.

In practice, this just means we are comparing how far the fitted parameters are away from the true (unconfounded) model parameters.

```
allParamsLong = stack(allParams, Not(:Model))
trueParams = select(@subset(allParamsLong, :Model .== "OLS"), Not(:Model))
rename!(trueParams, ["variable", "truth"])
allParamsLong = leftjoin(allParamsLong, trueParams, on = :variable)
errorRes = @combine(groupby(@subset(allParamsLong, :variable .!= "alcohol"), :Model),
:a = sum((:truth .- :value) .^2),
:a2 = sum(:value .^ 2))
errorRes = @transform(errorRes, :e = :a ./ :a2)
sort(errorRes, :e)
```

5×4 DataFrame

Row | Model | a | a2 | e |
---|---|---|---|---|

String | Float64 | Float64 | Float64 | |

1 | OLS | 0.0 | 0.0920729 | 0.0 |

2 | With Beta | 0.0291038 | 0.0698576 | 0.416616 |

3 | Ridge | 0.129761 | 0.266952 | 0.486085 |

4 | No Beta | 0.157667 | 0.301286 | 0.523314 |

5 | OLS No Alcohol | 0.213692 | 0.349675 | 0.611116 |

Using the \(\beta\) estimation method gives the best model (smallest \(e\)), which lines up with the paper and the magnitude of error is also inline with the paper (they had 0.35 and 0.45 for Lasoo/ridge regression respectively). The ridge regression and no beta method also improved on the naive OLS approach, so that indicates that there is some improvement from using these methods. The No Beta method is not a faithful reproduction of the Better AB testing paper because it requires the ‘test’ dataset to be an AB test scenario, which we don’t have from the above, so that might explain why the values don’t quite line up.

All methods improve on the naive ‘OLS No Alcohol’ parameters though, which shows this approach to causal regularisation can uncover better models if you have underlying confounding in your data.

We are always stuck with the data we are given and most of the time can’t collect more to try and uncover more relationships. Causal regularisation gives us a chance to use normal machine learning techniques to build better causal relationships by guiding what the regularisation parameters should be and using that to restrict the overall parameters. When we can estimate the expected confounding value \(\beta\) we get the best results, but regular ridge regression and the Webster-Westray method also provide an improvement on just doing a naive regression. So whilst overfitting is the main driver for doing regularisation it also brings with it some causal benefits and lets you understand true relationships between variables in a truer sense.

I’ve written about causal analysis techniques before with Double Machine Learning - An Easy Introduction. This is another way of building causal models.

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

I’ve tried to cover different assets and frequencies to hopefully inspire the various types of quant finance out there.

My day-to-day job is in FX so naturally, that’s where I think all the best data can be found. TrueFX provides tick-by-tick in milliseconds, so high-frequency data is available for free and across lots of different currencies. So if you are interested in working out how to deal with large amounts of data (1 month of EURUSD is 600MB) efficiently, this source is a good place to start.

As a demo, I’ve downloaded the USDJPY October dataset.

```
using CSV, DataFrames, DataFramesMeta, Dates, Statistics
using Plots
```

It’s a big CSV file, so this isn’t the best way to store the data, instead, stick it into a database like QuestDB that are made for time series data.

```
usdjpy = CSV.read("USDJPY-2023-10.csv", DataFrame,
header = ["Ccy", "Time", "Bid", "Ask"])
usdjpy.Time = DateTime.(usdjpy.Time, dateformat"yyyymmdd HH:MM:SS.sss")
first(usdjpy, 4)
```

4×4 DataFrame

Row | Ccy | Time | Bid | Ask |
---|---|---|---|---|

String7 | DateTime | Float64 | Float64 | |

1 | USD/JPY | 2023-10-01T21:04:56.931 | 149.298 | 149.612 |

2 | USD/JPY | 2023-10-01T21:04:56.962 | 149.298 | 149.782 |

3 | USD/JPY | 2023-10-01T21:04:57.040 | 149.589 | 149.782 |

4 | USD/JPY | 2023-10-01T21:04:58.201 | 149.608 | 149.782 |

It’s simple data, just a bid and ask price with a time stamp.

```
usdjpy = @transform(usdjpy, :Spread = :Ask .- :Bid,
:Mid = 0.5*(:Ask .+ :Bid),
:Hour = round.(:Time, Minute(10)))
usdjpyHourly = @combine(groupby(usdjpy, :Hour), :open = first(:Mid), :close = last(:Mid), :avg_spread = mean(:Spread))
usdjpyHourly.Time = Time.(usdjpyHourly.Hour)
plot(usdjpyHourly.Hour, usdjpyHourly.open, lw =1, label = :none, title = "USDJPY Price Over October")
```

Looking at the hourly price over the month gives you flat periods over the weekend.

Let’s look at the average spread (ask - bid) throughout the day.

```
hourlyAvgSpread = sort(@combine(groupby(usdjpyHourly, :Time), :avg_spread = mean(:avg_spread)), :Time)
plot(hourlyAvgSpread.Time, hourlyAvgSpread.avg_spread, lw =2, title = "USDJPY Intraday Spread", label = :none)
```

We see a big spike at 10 pm because of the day roll and the secondary markets go offline briefly, which pollutes the data bit. Looking at just midnight to 8 pm gives a more indicative picture.

```
plot(hourlyAvgSpread[hourlyAvgSpread.Time .<= Time("20:00:00"), :].Time,
hourlyAvgSpread[hourlyAvgSpread.Time .<= Time("20:00:00"), :].avg_spread, label = :none, lw=2,
title = "USDJPY Intraday Spread")
```

In October spreads have generally been wider in the later part of the day compared to the morning.

There is much more that can be done with this data across the different currencies though. For example:

- How stable are correlations across currencies at different time frequencies?
- Can you replicate my microstructure noise post? How does the microstructure noise change between currencies
- Price updates are irregular, what are some statistical properties?

Let’s zoom out a little bit now, decrease the frequency, and widen the asset pool. Futures cover many asset classes, oil, coal, currencies, metals, agriculture, stocks, bonds, interest rates, and probably something else I’ve missed. This data is daily and roll adjusted, so you have a continuous time series of an asset for many years. This means you can look at the classic momentum/mean reversion portfolio models and have a real stab at long-term trends.

The data is part of the Nasdaq data link product (formerly Quandl) and once you sign up for an account you have access to the free data. This futures dataset is Wiki Continuous Futures and after about 50 clicks and logging in, re-logging in, 2FA codes you can view the pages.

To get the data you can go through one of the API packages in your favourite language. In Julia, this means the QuandlAccess.jl package which keeps things simple.

```
using QuandlAccess
futuresMeta = CSV.read("continuous.csv", DataFrame)
futuresCodes = futuresMeta[!, "Quandl Code"] .* "1"
quandl = Quandl("QUANDL_KEY")
function get_data(code)
futuresData = quandl(TimeSeries(code))
futuresData.Code .= code
futuresData
end
futureData = get_data.(rand(futuresCodes, 4));
```

We have an array of all the available contracts `futuresCodes`

and
sample 4 of them randomly to see what the data looks like.

```
p = []
for df in futureData
append!(p, plot(df.Date, df.Settle, label = df.Code[1]))
end
plot(plot.(p)..., layout = 4)
```

- ABY - WTI Brent Bullet - Spread between two oil futures on different exchanges.
- TZ6 - Transco Zone 6 Non-N.Y. Natural Gas (Platts IFERC) Basis - Spread between two different natural gas contracts
- PG - PG&E Citygate Natural Gas (Platts IFERC) Basis - Again, spread between two different natural gas contracts
- FMJP - MSCI Japan Index - Index containing Japanese stocks

I’ve managed to randomly select 3 energy futures and one stock index.

Project ideas with this data:

- Cross-asset momentum and mean reversion.
- Cross-asset correlations, does the price of oil drive some equity indexes?
- Macro regimes, can you pick out commonalities of market factors over the years?

Out there in the wild is the FI2010 dataset which is essentially a sample of the full order book for five different stocks on the Nordic stock exchange for 10 days. You have 10 levels of prices and volumes and so can reconstruct the order book throughout the day. It is the benchmark dataset for limit order book prediction and you will see it referenced in papers that are trying to implement new prediction models. For example Benchmark Dataset for Mid-Price Forecasting of Limit Order Book Data with Machine Learning Methods references some basic methods on the dataset and how they perform when predicting the mid-price.

I found the dataset (as a Python package) here https://github.com/simaki/fi2010 but it’s just stored as a CSV which you can lift easily.

```
fi2010 = CSV.read(download("https://raw.githubusercontent.com/simaki/fi2010/main/data/data.csv"),DataFrame);
```

**Update on 7/01/2024**

Since posting this the above link has gone offline and the user has deleted their Github account! Instead the data set can be found here: https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649/data . I’ve not verified if its in the same format, so there might be some additional work going from the raw data to how this blog post sets it up. Thank’s to the commentators below pointing this out.

The data is wide (each column is a depth level of the price and volume) so I turn each into a long data set and add the level, side and variable as a new column.

```
fi2010Long = stack(fi2010, 4:48, [:Column1, :STOCK, :DAY])
fi2010Long = @transform(fi2010Long, :a = collect.(eachsplit.(:variable, "_")))
fi2010Long = @transform(fi2010Long, :var = first.(:a), :level = last.(:a), :side = map(x->x[2], :a))
fi2010Long = @transform(groupby(fi2010Long, [:STOCK, :DAY]), :Time = collect(1:length(:Column1)))
first(fi2010Long, 4)
```

The ‘book depth’ is the sum of the liquidity available at all the levels and indicates how easy it is to trade the stock. As a quick example, we can take the average of each stock per day and use that as a proxy for the ease of trading these stocks.

```
intraDayDepth = @combine(groupby(fi2010Long, [:STOCK, :DAY, :var]), :avgDepth = mean(:value))
intraDayDepth = @subset(intraDayDepth, :var .== "VOLUME");
plot(intraDayDepth.DAY, intraDayDepth.avgDepth, group=intraDayDepth.STOCK,
marker = :circle, title = "Avg Daily Book Depth - FI2010")
```

Stock 3 and 4 have the highest average depth, so most likely the easier to trade, whereas Stock 1 has the thinnest depth. Stock 2 has an interesting switch between liquid and not liquid.

So if you want to look beyond top-of-book data, this dataset provides the extra level information needed and is closer to what a professional shop is using. Better than trying to predict daily Yahoo finance mid-prices with neural nets at least.

If you want to take a further step back then being able to build the tools that take in streaming data directly from the exchanges and save that into a database is another way you can build out your technical capabilities. This means you have full control over what you download and save. Do you want just the top of book every update, the full depth of the book, or just the reported trades? I’ve written about this before, Getting Started with High Frequency Finance using Crypto Data and Julia, and learned a lot in the process. Doing things this way means you have full control over the entire process and can fully understand the data you are saving and any additional quirks around the process.

Plenty to get stuck into and learn from. Being able to get the data and loading it into an environment is always the first challenge and learning how to do that with all these different types of data should help you understand what these types of jobs entail.

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

Reinforcement learning is a pillar of machine learning and it combines the use of data and learning how to make a better decision automatically. One of the basic models in reinforcement learning is the *multi-armed bandit*. A bit of an anachronistic name, but the single-armed bandit refers to a casino game where you pull the lever (or push a button), some cassettes roll round and you might win a prize.

The multi-armed bandit is an extension to this type of game and means we have different levers we can pull that lead to a different reward. The reward depends on the lever pulled.

This simple mental model is surprisingly applicable to lots of different problems and it can act as a good approximation to whatever you are trying to solve. For example, let’s use an advertising example. You have multiple adverts that you display to try and get people to click through to your website. Each time a page loads you can load one advertisement, you then record how many people click on that advert and use that to decide which advert to show next. With each page load you decide, do I show the most succesful advert so far or try a new advert to see how that performs? Over time you will find out which advert performs the best and show that as much as possible to get as many clicks.

Imagine we have a multi-armed bandit machine, where we pull a lever and get a reward. The reward depends on the lever pulled, how do we learn what the best lever is?

First let’s build our bandit. We will have 5 levers and the reward will be a sample from a normal distribution where each lever will have a random mean and standard deviation.

```
using Plots, StatsPlots
using Distributions
nLevers = 5
rewardMeans = rand(Normal(0, 3), nLevers)
rewardSD = rand(Gamma(2, 2), nLevers)
hcat(rewardMeans, rewardSD)
```

```
5×2 Matrix{Float64}:
-4.7724 5.88533
-4.60967 0.627556
-5.96987 1.14465
8.96919 3.80253
2.11311 4.84983
```

These are the parameters of our levers in our bandit, so lets look at the distribution of the rewards.

```
density(rand(Normal(rewardMeans[1], rewardSD[1]), 1000), label = "Lever 1")
for i in 2:nLevers
density!(rand(Normal(rewardMeans[i], rewardSD[i]), 1000), label = "Lever " * string(i))
end
plot!()
```

So our levers giving us a sample from a normal distribution is illustrated above. The 4th lever looks like the best as it has the most likely chance of getting a positive value and has the wider tail too. As we are talking about rewards, large positive values are better.

So given we have a process of pulling a lever and getting a reward, how do we learn what the best lever is and importantly as quickly as possible?

Like all good statistics problems, we start with the most basic model and start pulling levers randomly.

Just pull a random lever every time. Nothing is being learned here though and we are just demonstrating how the problem setup works. With each play we generate a random integer that corresponds to the lever, pull the lever (draw a random normal variable with mean/deviation of that lever), record what lever was pulled and the reward amount. Then repeat several times.

```
function random_learner(rewardMeans, rewardSD, nPlays)
nLevers = length(rewardMeans)
selectedLever = zeros(Int64, nPlays)
rewards = zeros(nPlays)
cumSelection = zeros(Int64, nLevers)
cumRewards = zeros(nLevers)
optimalChoice = Array{Bool}(undef, nPlays)
bestLever = findmax(rewardMeans)[2]
for i = 1:nPlays
selectedLever[i] = rand(1:nLevers)
optimalChoice[i] = selectedLever[i] == bestLever
rewards[i] = rand(Normal(rewardMeans[selectedLever[i]], rewardSD[selectedLever[i]]))
cumSelection[selectedLever[i]] += 1
cumRewards[selectedLever[i]] += rewards[i]
end
return selectedLever, rewards, cumSelection, cumRewards, optimalChoice
end
```

We run this learner for 1,000 steps and look at the number of times each lever is pulled.

```
randomStrat = random_learner(rewardMeans, rewardSD, 1000);
histogram(randomStrat[1], label = "Number of Time Lever Pulled")
```

Each of the levers is pulled a roughly equal amount of times, with no learning, just randomly pulling. Moving on, how do we learn?

Reinforcement learning is about balancing the explore/exploit set-up of the problem. We need to sample each of the levers and work out what kind of rewards they provide and then use that information to inform our next decision.

For each iteration, we randomly decide if we will pull any lever or do we use the old information to choose our best guess at the best lever. Our information in this case is the rolling average of the reward each time we pulled the lever. This is called a *greedy learner*. It’s just doing its best with what it knows and has no real ability to decide whether to explore a new lever.

The probability of choosing a random lever is called the learning rate (\(\eta\)) and controls how often we make the perceived optimal choice. A high value of \(\eta\) means lots of exploring (learning) and a low value restricts the learning and means we pull the (perceived) best lever each time. So if we had many levers and a low learning rate it is possible that we never find the globally optimal lever and instead just stick to the locally optimal lever, hence why it is called a greedy learner, it can get stuck.

```
function greedy_learner(rewardMeans, rewardSD, nPlays, eta)
nLevers = length(rewardMeans)
selectedLever = zeros(Int64, nPlays)
rewards = zeros(nPlays)
cumSelection = zeros(Int64, nLevers)
cumRewards = zeros(nLevers)
optimalChoice = Array{Bool}(undef, nPlays)
bestLever = findmax(rewardMeans)[2]
for i = 1:nPlays
if rand() < eta
selectedLever[i] = rand(1:nLevers)
else
q = cumRewards ./ cumSelection
q[isnan.(q)] .= 0
selectedLever[i] = findmax(q)[2]
end
optimalChoice[i] = selectedLever[i] == bestLever
rewards[i] = rand(Normal(rewardMeans[selectedLever[i]], rewardSD[selectedLever[i]]))
cumSelection[selectedLever[i]] += 1
cumRewards[selectedLever[i]] += rewards[i]
end
return selectedLever, rewards, cumSelection, cumRewards, optimalChoice
end
```

Again, we can run it for 1,000 steps and we set our learning rate to 0.5.

```
greedyStrat = greedy_learner(rewardMeans, rewardSD, 1000, 0.5)
histogram(greedyStrat[1], label = "Number of Time Lever Pulled", legend = :topleft)
```

This has done what we thought, it has selected the 4th lever that we thought looked the best from the distribution. So we’ve learned something, hooray!

The \(\eta\) parameter was set to 0.5 above, but how does varying change the outcome? To explore this we will do multiple runs of multiple plays of the game and also increase the number of levers. For each run, we will generate a new set of reward averages/standard deviations and run the random learner and the greedy learner with different \(\eta\).

```
nRuns = 2000
nPlays = 1000
nLevers = 10
optimalLevel = zeros(nRuns)
randomRes = Array{Tuple}(undef, nRuns)
greedyRes = Array{Tuple}(undef, nRuns)
greedyRes05 = Array{Tuple}(undef, nRuns)
greedyRes01 = Array{Tuple}(undef, nRuns)
greedyRes001 = Array{Tuple}(undef, nRuns)
greedyRes0001 = Array{Tuple}(undef, nRuns)
for i=1:nRuns
rewardMeans = rand(Normal(0, 1), nLevers)
rewardSD = ones(nLevers)
randomRes[i] = random_learner(rewardMeans, rewardSD, nPlays)
greedyRes[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0)
greedyRes05[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.5)
greedyRes01[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.1)
greedyRes001[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.01)
greedyRes0001[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.001)
optimalLevel[i] = findmax(rewardMeans)[2]
end
```

For each of the runs we have the evolution of the reward, so we want to take the average of the reward on each time step and see how that evolves with each play of the game.

```
randomAvg = mapreduce(x-> x[2], +, randomRes) ./ nRuns
greedyAvg = mapreduce(x-> x[2], +, greedyRes) ./ nRuns
greedyAvg01 = mapreduce(x-> x[2], +, greedyRes01) ./ nRuns
greedyAvg09 = mapreduce(x-> x[2], +, greedyRes05) ./ nRuns
greedyAvg001 = mapreduce(x-> x[2], +, greedyRes001) ./ nRuns;
greedyAvg0001 = mapreduce(x-> x[2], +, greedyRes0001) ./ nRuns;
```

And plotting the average reward over time.

```
plot(1:nPlays, randomAvg, label="Random", legend = :bottomright, xlabel = "Time Step", ylabel = "Average Reward")
plot!(1:nPlays, greedyAvg, label="0")
plot!(1:nPlays, greedyAvg05, label="0.5")
plot!(1:nPlays, greedyAvg01, label="0.1")
plot!(1:nPlays, greedyAvg001, label="0.01")
plot!(1:nPlays, greedyAvg0001, label="0.001")
```

Good to see that all the greedy learners outperform the random learner, so that algorithm is doing something. If we focus on the gready learners we see how the learning rates changes performances.

```
plot(1:nPlays, greedyAvg, label="0", legend=:bottomright, xlabel = "Time Step", ylabel = "Average Reward")
plot!(1:nPlays, greedyAvg01, label="0.1")
plot!(1:nPlays, greedyAvg001, label="0.01")
plot!(1:nPlays, greedyAvg0001, label="0.001")
```

This is an interesting result! When \(\eta = 0\) we see that it never reaches as high as the other learning rates. So when \(\eta = 0\) we never explore the other options, we just select what we think is the best one from history and never stray away from our beliefs. This ultimately hurts us because if we don’t get the best level on the first try then we are stuck in a suboptimal. Likewise, when the learning rate is very low, it doesn’t get much better, so this shows there is always value in exploring the options.

Philosophically, this shows that with any procedure you need to iterate through different configurations and explore the outcomes rather than sticking with what you believe is optimal.

```
scatter([0, 0.5, 0.1,0.01, 0.001],
map(x-> mean(x[750:1000]), [greedyAvg, greedyAvg05, greedyAvg01, greedyAvg001, greedyAvg0001]),
xlabel="Learning Rate",
ylabel = "Converged Reward", legend=:none)
```

The learning rate looks like it is optimal around 0.1. You can do a grid search to see how the overall behaviour changes in terms of both the speed of convergence to the final state and how good that final reward state is.

We can improve the above implementation by just saving memory and CPU cycles by doing ‘online learning’ of the rewards and using that to drive the selection. We create one matrix $$Q$, update it with the average reward of each lever and use the maximum of each iteration to select our lever if we are not exploring.

```
function greedy_learner_incremental(rewardMeans, rewardSD, nPlays, eta)
nLevers = length(rewardMeans)
selectedLever = zeros(Int64, nPlays)
rewards = zeros(nPlays)
cumSelection = zeros(Int64, nLevers)
cumRewards = zeros(nLevers)
Q = zeros((nPlays+1, nLevers))
rewardsArray = zeros(nLevers)
optimalChoice = Array{Bool}(undef, nPlays)
bestLever = findmax(rewardMeans)[2]
for i = 1:nPlays
if rand() < eta
selectedLever[i] = rand(1:nLevers)
else
selectedLever[i] = findmax(Q[i,:])[2]
end
optimalChoice[i] = selectedLever[i] == bestLever
reward = rand(Normal(rewardMeans[selectedLever[i]], rewardSD[selectedLever[i]]))
rewards[i] = reward
rewardsArray[selectedLever[i]] = reward
cumSelection[selectedLever[i]] += 1
cumRewards[selectedLever[i]] += reward
Q[i+1, :] = Q[i, :] + (1/i) * (rewardsArray - Q[i,:])
end
return selectedLever, rewards, cumSelection, cumRewards, optimalChoice
end
```

Using the normal Julia benchmarking tools we can get a good idea if this rewrite has changed anything materially.

```
using BenchmarkTools
oldImp = @benchmark greedy_learner(rewardMeans, rewardSD, nPlays, 0.1)
newImp = @benchmark greedy_learner_incremental(rewardMeans, rewardSD, nPlays, 0.1)
judge(median(oldImp), median(newImp))
```

```
BenchmarkTools.TrialJudgement:
time: -43.91% => improvement (5.00% tolerance)
memory: -70.15% => improvement (1.00% tolerance)
```

It’s 50% faster and uses 70% less memory, so a good optimisation.

This is the basic intro to reinforcement learning but a good foundation for how to think about these problems. The main step is going from data to decisions and how to update the decisions you make each time. You need to make sure you explore the problem space as otherwise you never know how much better some other options might be.

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

I’ve written before about predicting the number of goals in a game and this is a compliment to that post. Part of my PhD involved fitting a multidimensional Hawkes process to the time of goals scored by the home and away teams and this post isn’t as complicated as that instead we look at something simpler.

This is a change of language too, I’m writing R instead of Julia for once!

```
require(jsonlite)
require(dplyr)
require(tidyr)
require(ggplot2)
knitr::opts_chunk$set(fig.retina=2)
require(hrbrthemes)
theme_set(theme_ipsum())
extrafont::loadfonts()
require(wesanderson)
```

I have a dataset that contains the odds and the times of goals for many different football matches.

```
finalData <- readRDS("/Users/deanmarkwick/Documents/PhD/Research/Hawkes and Football/Data/allDataOddsAndGoals.RDS")
```

We do some wrangling of the data, converting it from the JSON format to give us a vector of each team’s goals split into whether they are home or away.

```
homeGoalTimes <- lapply(finalData$home.mins.goal, fromJSON)
awayGoalTimes <- lapply(finalData$away.mins.goal, fromJSON)
allGoals <- c(unlist(homeGoalTimes), unlist(awayGoalTimes))
```

To clean the data we need to replace the games without scores to a numeric type and also truncate any goals scored in extra time. We need a fixed window for the point process modeling.

```
replaceEmptyWithNumeric <- function(x){
if(length(x) == 0){
return(numeric(0))
}else{
return(x)
}
}
max90 <- function(x){
x[x > 90] <- 90
return(x)
}
homeGoalTimesClean <- lapply(homeGoalTimes, replaceEmptyWithNumeric)
homeGoalTimesClean <- lapply(homeGoalTimesClean, max90)
awayGoalTimesClean <- lapply(awayGoalTimes, replaceEmptyWithNumeric)
awayGoalTimesClean <- lapply(awayGoalTimesClean, max90)
```

As the number of goals scored for each team will be proportional to the strength of the team we will use the odds of the team winning the match as a proxy for their strength. This does a good job as my previous blog post Goals from team strengths explored.

```
homeProbsStrengths <- finalData$PSCH
awayProbsStrengths <- finalData$PSCA
allStrengths <- c(homeProbsStrengths, awayProbsStrengths)
allGoalTimes <- c(homeGoalTimesClean, awayGoalTimesClean)
```

Interestingly we can do the same cleaning in `dplyr`

easily using the
`case_when`

function.

```
allGoalsFrame <- data.frame(Time = allGoals)
allGoalsFrame %>%
mutate(TimeClean = case_when(Time > 90 ~ 90,
TRUE ~ as.numeric(Time))) -> allGoalsFrame
```

After all that we can plot our distribution of goal times.

```
ggplot(allGoalsFrame, aes(x=TimeClean, y=after_stat(density))) +
geom_histogram(binwidth = 1) +
xlab("Time (mins)") +
ylab("Goal Density")
```

Two bumps, 1 around 45 minutes where goals are scored during extra time in the first half and the 90+ minute goals.

This is what we are trying to model. We want to predict when the goals will happen based on that team’s strength, which will also control how many goals are scored.

A point process is a mathematical model that describes when things happen in a fixed window. Our window is the 90 minutes of the football match and we want to know where the goals fall in this window.

A point process is described by its intensity \(\lambda (t)\) which is proportional to the likelihood of seeing an event at time \(t\). So a higher intensity, a larger chance of a goal occurring. From our plot above we can see there are two main features we want our model to capture:

- The general increase in goals as the match as time progresses.
- The spike at 90 because of extra time.

To fit this type of model we will write an intensity function \(\lambda\) and optimise the parameters to minimise the likelihood.

The likelihood for a point process is the summation of the intensity \(\lambda(t)\) at each event and the integration of the intensity function over the window

\[\mathcal{L} = \sum _{i} \log \lambda (t_i) - \int _0^T \lambda (t) \mathrm{d} t.\]We have to specify the form of \(\lambda\) with a function and parameters and then fit the parameters to the data. By looking at the data we can see the intensity appears to be increasing and we need to account for the spike at 90

\[\lambda (t) = w \beta _0 + \beta _1 \frac{t}{T} + \beta _{90} \delta (t-90),\]where \(w\) is the team strength, \(T\) is 90 and \(\delta (x)\) is the Dirac delta function. More on that later.

Which we can easily integrate.

\[\int _0^T \lambda(t) = w \beta_0 T + \beta _1 \frac{T}{2} + \beta_{90}.\]This gives us our likelihood function so we can move on to optimising it over our data.

It’s always good to make sure you are on the right track by simulating the models you are exploring. Jumping straight into the real data means you are hoping your methods are correct, but starting with a known model and using the methods to recover the parameters gives you some confidence that what you are doing is correct.

There are three components to our model:

- the intensity function
- the integrated intensity function
- the likelihood

We will also be using a Dirac delta function to represent the 90 minute spike

Given our data is measured in minutes and all the goals that happen in
extra time have the value of `t=90`

this means we need a sensible way to
account for this mega spike. Essentially, we want something that is 1 at
a single point and 0 everywhere else. That way we can assign a weight to
this component in the overall model and that helps describe the data
that also integrates nicely.

Now I’m a physicist by training, so my mathematical rigour around the function might not be up to scratch.

```
diract <- function(t, x=90){
2*as.numeric((round(t) == x))
}
qplot(seq(0, 100, 0.1), diract(seq(0, 100, 0.1))) +
xlab("Time") +
ylab("Weight")
```

As expected, 1 at 90 and 0 everywhere else.

We can now write the R code for our intensity function, and then the likelihood by combining the intensity and integrated intensity.

```
intensityFunction <- function(params, t, winProb, maxT){
beta0 <- params[1]
beta1 <- params[2]
beta90 <- params[3]
int <- (winProb * beta0) + (beta1 * (t/maxT)) + (beta90*diract(t))
int[int < 0] <- 0
int
}
intensitFunctionInt <- function(params, maxT, winProb){
beta0 <- params[1]
beta1 <- params[2]
beta90 <- params[3]
beta0*winProb*maxT + (beta1*maxT)/2 + beta90
}
likelihood <- function(params, t, winProb){
ss <- sum(log(intensityFunction(params, t, winProb, 90)))
int <- intensitFunctionInt(params, 90, winProb)
ss - int
}
```

We now combine the three functions and simulate a point process from the
intensity function. We will use *thinning* to simulate the
inhomogeneous intensity. This means generating more points than expected
from a larger intensity, and then choosing what ones remain as a ratio
between the larger intensity and true intensity. For a more in-depth
discussion I’ve written about it previously in my
post.

```
sim_events <- function(params, winProb){
lambdaMax <- 1.1*intensityFunction(params, 90, winProb, 90)
nevents <- rpois(1, lambdaMax*90)
tstar <- runif(nevents, 0, 90)
accept_prob <- intensityFunction(params, tstar, winProb, 90) / lambdaMax
(sort(tstar[runif(length(accept_prob)) < accept_prob]))
}
```

```
N <- 100
testParams <- c(3, 2, 2)
testWinProb <- 1
testEvents <- replicate(N, sim_events(testParams, testWinProb))
testWinProbs <- rep_len(testWinProb, N)
trueInt <- intensityFunction(testParams, 0:90, testWinProb, 90)
```

As we have multiple simulated games, we want to calculate the overall likelihood across the total sample and maximise that likelihood.

```
alllikelihood <- function(params, events, winProbs){
ll <- sum(vapply(seq_along(events),
function(i) likelihood(params, events[[i]], winProbs[[i]]),
numeric(1)))
if(ll == -Inf){
return(-1e9)
} else {
return(ll)
}
}
trueLikelihood <- alllikelihood(testParams, testEvents, testWinProbs)
```

Simple enough to do the optimisation, chuck the function into `optim`

and away we go.

```
simRes <- optim(runif(3), function(x) -1*alllikelihood(c(x[1], x[2], x[3]),
testEvents,
testWinProbs), lower = c(0,0,0), method = "L-BFGS-B")
print(simRes$par)
```

3.005867 1.995551 1.932193

The parameters come out almost exactly as they were specified.

```
simResDF <- data.frame(Time = 0:90,
TrueIntensity = trueInt,
EstimatedIntensity = intensityFunction(simRes$par, 0:90, testWinProb, 90))
ggplot(simResDF, aes(x=Time, y=TrueIntensity, color = "True")) +
geom_line() +
geom_line(aes(y=EstimatedIntensity, color = "Estimated")) +
labs(color = NULL) +
xlab("Time") +
ylab("Intensity") +
theme(legend.position = "bottom")
```

Okay, so our method is good. We’ve recovered all three factors in the intensity so well that you can hardly tell the difference between the real and estimated intensities. So we can now go on looking at our data.

Let’s do the train/test split and fit our model on the training data.

```
trainInds <- sample.int(length(allGoalTimes), size = floor(length(allGoalTimes)*0.7))
goalTimesTrain <- allGoalTimes[trainInds]
strengthTrain <- allStrengths[trainInds]
goalTimesTest <- allGoalTimes[-trainInds]
strengthTest <- allStrengths[-trainInds]
```

We start by using a null model. This is where we will just use the constant parameter and the team strengths and see how well that fits the data.

```
optNull <- optim(runif(1), function(x) -1*alllikelihood(c(x[1], 0, 0),
goalTimesTrain,
strengthTrain), lower = c(0,0,0), method = "L-BFGS-B")
optNull
```

We add in the next parameter, the linear trend.

```
optNull2 <- optim(runif(2), function(x) -1*alllikelihood(c(x[1], x[2], 0),
goalTimesTrain,
strengthTrain), lower = c(0,0,0), method = "L-BFGS-B")
optNull2
```

We can now use all the features previously described and fit the full model across the data.

```
optRes <- optim(runif(3), function(x) -1*alllikelihood(x,
goalTimesTrain,
strengthTrain), lower = c(0,0,0), method = "L-BFGS-B")
optRes
```

And then just to check, let’s remove the linear parameter.

```
optRes2 <- optim(runif(2), function(x) -1*alllikelihood(c(x[1], 0, x[2]),
goalTimesTrain,
strengthTrain), lower = c(0,0,0), method = "L-BFGS-B")
optRes2
```

Putting all the results into a table lets us compare nicely.

Model | \(\beta _0\) | \(\beta _1\) | \(\beta _{90}\) |
---|---|---|---|

Constant | 0.0039 | —– | —– |

Linear | 0.0006 | 0.025 | —– |

Delta | 0.00096 | 0.022 | 0.05 |

No Linear | 0.0037 | —– | 0.06 |

The positive linear parameter (\(\beta _1\)) shows that there is an increase in probability towards the end of the match.

It is easier to compare the resultant intensity functions though.

```
modelFits <- data.frame(Time = 0:90)
modelFits$Null <- intensityFunction(c(optNull$par[1],0,0), modelFits$Time, 2, 90)
modelFits$Linear <- intensityFunction(c(optNull2$par ,0), modelFits$Time, 2, 90)
modelFits$Delta <- intensityFunction(optRes$par, modelFits$Time, 2, 90)
modelFits$NoLinear <- intensityFunction(c(optRes2$par[1], 0, optRes2$par[2]), modelFits$Time, 2, 90)
modelFits %>%
pivot_longer(!Time, names_to="Model", values_to="Intensity") -> modelFitsTidy
ggplot(modelFitsTidy, aes(x=Time, y=Intensity, color = Model)) +
geom_line() +
theme(legend.position = "bottom")
```

So interesting differences between the three different models. Model 2 has a lower slope because it can accommodate the spike at the end. When looking at the final likelihoods from the models:

Model | Out of Sample Likelihood |
---|---|

Constant | -55337.35 |

Linear | -52268.48 |

Delta | -51917.7 |

No Linear | -54500.6 |

So, the best fitting model (largest likelihood) is the Delta model, so that 90-minute spike is doing some work. Also shows that the linear component of the model contributes something to the model as the No Linear result has a worse likelihood.

Using the likelihood to evaluate the model is only one approach though. We could go further with BIC/AIC/DIC values but given there are only three parameters in the model it probably won’t be instructive. Instead, we should look at what the model simulates results like.

We go through each of the test set matches and simulate a match 100 times, taking the maximum number of goals scored, we then compare this to the maximum observed number of goals across the data set and see how the distributions compare.

This is similar to the posterior p-values method for model checking but in this case slightly different because we do not have a chain of parameters and just the optimised values.

```
maxGoals <- vapply(strengthTest,
function(x) max(replicate(100, length(sim_events(optRes$par, x)))),
numeric(1))
actualMaxGoals <- max(vapply(allGoalTimes, length, numeric(1)))
```

```
ggplot(data = data.frame(MaxGoals = maxGoals), aes(x=MaxGoals)) +
geom_histogram(binwidth = 1) +
geom_vline(xintercept = actualMaxGoals) +
xlab("Maximum Number of Goals")
```

10 is the largest number of goals observed, and our model congregates around 5 as the maximum, but we did see 2 simulations with 10 goals, and another 2 more with 10+ goals. So overall, the model can generate something that resembles reality, if not infrequently. But then again, how often do we see 10-goal games?

Overall this is a nice little model that shows the probability of a team scoring appearing to increase linearly over time. We added in a delta function to account for the fact that some games go beyond 90 minutes and many goals are scored in that period. We then did some model checking by simulating using the fitted parameters and it turns out the model can generate large enough amounts of goals compared to the real data.

I’ve fitted this model by optimising the likelihood, so the next logical step would be to take a Bayesian approach and throw the model into Stan so we have a proper sample of parameters that lets us judge the uncertainty around the model a bit better. Then the next direction would be to relax the linearity of the model throw a non-parametric approach at the data and see if anything interesting turns up. I have been trying this with my dirichletprocess package, but never managed to get a satisfying result that improved the above. Plus with the large dataset, it was taking forever to run. Maybe a blog post for the future!

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

I’m using Julia 1.9 and my AlpacaMarkets.jl package gets all the data we need.

```
using AlpacaMarkets
using DataFrames, DataFramesMeta
using Dates
using Plots
using RollingFunctions, Statistics
using GLM
```

To start with we simply want the daily prices of JPM, XLF, and SPY. JPM is the stock we think will go through mean reversion, XLF is the financial sector ETF and SPY is the general SPY ETF.

We this that if JPM rises higher than XLF then it will soon revert and trade lower shortly. Likewise, if JPM falls lower than XLF then we think it will soon trade higher. Our mean reversion is all about JPM around XLF. We’ve chosen XLF as it represents the general financial sector landscape, so will represent the general sector outlook more consistently than JPM on its own.

```
jpm = AlpacaMarkets.stock_bars("JPM", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1]
xlf = AlpacaMarkets.stock_bars("XLF", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1];
spy = AlpacaMarkets.stock_bars("SPY", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1];
```

We want to clean the data to format the date correctly and select the close and open columns.

```
function parse_date(t)
Date(string(((split(t, "T")))[1]))
end
function clean(df, x)
df = @transform(df, :Date = parse_date.(:t), :Ticker = x, :NextOpen = [:o[2:end]; NaN])
@select(df, :Date, :c, :o, :Ticker, :NextOpen)
end
```

Now we calculate the close-to-close log returns and format the data into a column for each asset.

```
jpm = clean(jpm, "JPM")
xlf = clean(xlf, "XLF")
spy = clean(spy, "SPY")
allPrices = vcat(jpm, xlf, spy)
allPrices = sort(allPrices, :Date)
allPrices = @transform(groupby(allPrices, :Ticker),
:Return = [NaN; diff(log.(:c))],
:ReturnO = [NaN; diff(log.(:o))],
:ReturnTC = [NaN; diff(log.(:NextOpen))]);
modelData = unstack(@select(allPrices, :Date, :Ticker, :Return), :Date, :Ticker, :Return)
modelData = modelData[2:end, :];
last(modelData, 4)
```

4 rows × 4 columns

Date | JPM | XLF | SPY | |
---|---|---|---|---|

Date | Float64? | Float64? | Float64? | |

1 | 2023-06-30 | 0.0138731 | 0.00864001 | 0.0117316 |

2 | 2023-07-03 | 0.00799894 | 0.00562049 | 0.00114985 |

3 | 2023-07-05 | -0.00661524 | -0.00206703 | -0.0014883 |

4 | 2023-07-06 | -0.00993581 | -0.00860923 | -0.00786148 |

Looking at the actual returns we can see that all three move in sync

```
plot(modelData.Date, cumsum(modelData.JPM), label = "JPM")
plot!(modelData.Date, cumsum(modelData.XLF), label = "XLF")
plot!(modelData.Date, cumsum(modelData.SPY), label = "SPY", legend = :left)
```

The key point is that they are moving in sync with each other. Given XLF has JPM included in it, this is expected but it also presents the opportunity to trade around any dispersion between the ETF and the individual name.

- https://math.stackexchange.com/questions/345773/how-the-ornstein-uhlenbeck-process-can-be-considered-as-the-continuous-time-anal

Let’s think simply about pairs trading. We have two securities that we want to trade if their prices change too much, so our variable of interest is

\[e = P_1 - P_2\]and we will enter a trade if \(e\) becomes large enough in both the positive and negative directions.

To translate that into a statistical problem we have two steps.

- Work out the difference between the two securities
- Model how the difference changes over time.

Step 1 is a simple regression of the stock vs the ETF we are trading against. Step 2 needs a bit more thought, but is still only a simple regression.

In our data, we have the daily returns of JPM, the XLF ETF, and the SPY ETF. To work out the interdependence, it’s just a case of simple linear regression.

```
regModel = lm(@formula(JPM ~ XLF + SPY), modelData)
```

```
JPM ~ 1 + XLF + SPY
Coefficients:
──────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────────
(Intercept) 0.000188758 0.000162973 1.16 0.2469 -0.0001309 0.000508417
XLF 1.35986 0.0203485 66.83 <1e-99 1.31995 1.39977
SPY -0.363187 0.0260825 -13.92 <1e-41 -0.414345 -0.312028
──────────────────────────────────────────────────────────────────────────────────
```

From the slope of the model, we can see that JPM = 1.36XLF - 0.36SPY, so JPM has a \(\beta\) of 1.36 to the XLF index and a \(\beta\) of -0.36 to the SPY ETF, or general market. So each day, we can approximate JPMs return by multiplying the XLF returns and SPY returns.

This is our economic factor model, which describes from a ‘big picture’ kind of way how the stock trades vs the general market (SPY) and its sector-specific market (XLF).

What we need to do next is look at what this model *doesn’t* explain
and try and describe that.

Any difference around this model can be explained by the summation of the residuals over time. In the paper the sum of the residuals over time is called the ‘auxiliary process’ and this is the data behind the second regression.

```
plot(scatter(modelData.Date, residuals(regModel), label = "Residuals"),
plot(modelData.Date,cumsum(residuals(regModel)),
label = "Aux Process"),
layout = (2,1))
```

We believe the auxiliary process (cumulative sum of the residuals) can be modeled using a Ornstein-Uhlenbeck (OU) process.

An OU process is a type of differential equation that displays mean reversion behaviour. If the process falls away from its average level then it will be forced back.

\[dX = \kappa (m - X(t))dt + \sigma \mathrm{d} W\]\(\kappa\) represents how quickly the mean reversion occurs.

To fit this type of process we need to recognise that the above differential form of an OU process can be discretised to become a simple AR(1) model where the model parameters can be transformed to get the OU parameters.

We now fit the OU process onto the cumulative sum of the residuals from the first model. If the residuals have some sort of structure/pattern then this means our original model was missing some variable that explains the difference.

```
X = cumsum(residuals(regModel))
xDF = DataFrame(y=X[2:end], x = X[1:end-1])
arModel = lm(@formula(y~x), xDF)
```

```
y ~ 1 + x
Coefficients:
─────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept) 4.41618e-6 0.000162655 0.03 0.9783 -0.000314618 0.000323451
x 0.997147 0.00186733 534.00 <1e-99 0.993484 1.00081
─────────────────────────────────────────────────────────────────────────────────
```

We take these coefficients and transform them into the parameters from the paper.

```
varEta = var(residuals(arModel))
a, b = coef(arModel)
k = -log(b)*252
m = a/(1-b)
sigma = sqrt((varEta * 2 * k) / (1-b^2))
sigma_eq = sqrt(varEta / (1-b^2))
[m, sigma_eq]
```

```
2-element Vector{Float64}:
0.0015477568390823153
0.08709971423424319
```

So \(m\) gives us the average level and \(\sigma_{\text{eq}}\) the appropriate scale.

Now to build the mean reversion signal. We still have \(X\) as our auxiliary process which we believe is mean reverting. We now have the estimated parameters on the scale of this mean reversion so we can transform the auxiliary process by these parameters and use this to see when the process is higher or lower than the model suggests it should be.

```
modelData.Score = (X .- m)./sigma_eq;
plot(modelData.Date, modelData.Score, label = "s")
hline!([-1.25], label = "Long JPM, Short XLF", color = "red")
hline!([-0.5], label = "Close Long Position", color = "red", ls=:dash)
hline!([1.25], label = "Short JPM, Long XLF", color = "purple")
hline!([0.75], label = "Close Short Position", color = "purple", ls = :dash, legend=:topleft)
```

The red lines indicate when JPM has diverged from XLF on the negative side, i.e. we expect JPM to move higher and XLF to move lower. We enter the position if s < -1.25 (solid red line) and exit the position when s > -0.5 (dashed red line).

- Buy to open if \(s < -s_{bo}\) (< -1.25) Buy 1 JPM, sell Beta XLF
- Close long if \(s > -s_{c}\) (-0.5)

The purple line is the same but in the opposite direction.

- Sell to open if \(s > s_{so}\) (>1.25) Sell 1 JPM, buy Beta XLF
- Close short if \(s < s_{bc}\) (<0.75)

That’s the modeling part done. We model how the stock moves based on the overall market and then any differences to this we use the OU process to come up with the mean reversion parameters.

So, does it make money?

To backtest this type of model we have to roll through time and calculate both regressions to construct the signal.

A couple of new additions too

- We shift and scale the returns when doing the macro regression.
- The auxiliary process on the last day is always 0, which makes calculating the signal simple.

```
paramsRes = Array{DataFrame}(undef, length(90:(nrow(modelData) - 90)))
for (j, i) in enumerate(90:(nrow(modelData) - 90))
modelDataSub = modelData[i:(i+90), :]
modelDataSub.JPM = (modelDataSub.JPM .- mean(modelDataSub.JPM)) ./ std(modelDataSub.JPM)
modelDataSub.XLF = (modelDataSub.XLF .- mean(modelDataSub.XLF)) ./ std(modelDataSub.XLF)
modelDataSub.SPY = (modelDataSub.SPY .- mean(modelDataSub.SPY)) ./ std(modelDataSub.SPY)
macroRegr = lm(@formula(JPM ~ XLF + SPY), modelDataSub)
auxData = cumsum(residuals(macroRegr))
ouRegr = lm(@formula(y~x), DataFrame(x=auxData[1:end-1], y=auxData[2:end]))
varEta = var(residuals(ouRegr))
a, b = coef(ouRegr)
k = -log(b)*252
m = a/(1-b)
sigma = sqrt((varEta * 2 * k) / (1-b^2))
sigma_eq = sqrt(varEta / (1-b^2))
paramsRes[j] = DataFrame(Date= modelDataSub.Date[end],
MacroBeta_XLF = coef(macroRegr)[2], MacroBeta_SPY = coef(macroRegr)[3], MacroAlpha = coef(macroRegr)[1],
VarEta = varEta, OUA = a, OUB = b, OUK = k, Sigma = sigma, SigmaEQ=sigma_eq,
Score = -m/sigma_eq)
end
paramsRes = vcat(paramsRes...)
last(paramsRes, 4)
```

4 rows × 11 columns (omitted printing of 4 columns)

Date | MacroBeta_XLF | MacroBeta_SPY | MacroAlpha | VarEta | OUA | OUB | |
---|---|---|---|---|---|---|---|

Date | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |

1 | 2023-06-30 | 0.974615 | -0.230273 | 1.10933e-17 | 0.331745 | 0.175358 | 0.830417 |

2 | 2023-07-03 | 0.96943 | -0.228741 | -5.73883e-17 | 0.331222 | 0.198176 | 0.826816 |

3 | 2023-07-05 | 0.971319 | -0.230438 | 2.38846e-17 | 0.335844 | 0.242754 | 0.841018 |

4 | 2023-07-06 | 0.974721 | -0.232765 | 5.09875e-17 | 0.331695 | 0.256579 | 0.823822 |

The benefit of doing it this way also means we can see how each \(\beta\) in the macro regression evolves.

```
plot(paramsRes.Date, paramsRes.MacroBeta_XLF, label = "XLF Beta")
plot!(paramsRes.Date, paramsRes.MacroBeta_SPY, label = "SPY Beta")
```

Good to see they are consistent in their signs and generally don’t vary a great deal.

In the OU process, we are also interested in the speed of the mean reversion as we don’t want to take a position that is very slow to revert to the mean level.

```
kplot = plot(paramsRes.Date, paramsRes.OUK, label = :none)
kplot = hline!([252/45], label = "K Threshold")
```

In the paper, they suggest making sure the reversion happens with half of the estimation period. As we are using 90 days, that means the horizontal line shows when \(k\) is above this value.

Plotting the score function also shows how the model wants to go long/short the different components over time.

```
splot = plot(paramsRes.Date, paramsRes.Score, label = "Score")
hline!([-1.25], label = "Long JPM, Short XLF", color = "red")
hline!([-0.5], label = "Close Long Position", color = "red", ls=:dash)
hline!([1.25], label = "Short JPM, Long XLF", color = "purple")
hline!([0.75], label = "Close Short Position", color = "purple", ls = :dash)
```

We run through the allocation procedure and label whether we are long (+1) or short (-\(\beta\)) an amount of either the stock or ETFs.

```
paramsRes.JPM_Pos .= 0.0
paramsRes.XLF_Pos .= 0.0
paramsRes.SPY_Pos .= 0.0
for i in 2:nrow(paramsRes)
if paramsRes.OUK[i] > 252/45
if paramsRes.Score[i] >= 1.25
paramsRes.JPM_Pos[i] = -1
paramsRes.XLF_Pos[i] = paramsRes.MacroBeta_XLF[i]
paramsRes.SPY_Pos[i] = paramsRes.MacroBeta_SPY[i]
elseif paramsRes.Score[i] >= 0.75 && paramsRes.JPM_Pos[i-1] == -1
paramsRes.JPM_Pos[i] = -1
paramsRes.XLF_Pos[i] = paramsRes.MacroBeta_XLF[i]
paramsRes.SPY_Pos[i] = paramsRes.MacroBeta_SPY[i]
end
if paramsRes.Score[i] <= -1.25
paramsRes.JPM_Pos[i] = 1
paramsRes.XLF_Pos[i] = -paramsRes.MacroBeta_XLF[i]
paramsRes.SPY_Pos[i] = -paramsRes.MacroBeta_SPY[i]
elseif paramsRes.Score[i] <= -0.5 && paramsRes.JPM_Pos[i-1] == 1
paramsRes.JPM_Pos[i] = 1
paramsRes.XLF_Pos[i] = -paramsRes.MacroBeta_XLF[i]
paramsRes.SPY_Pos[i] = -paramsRes.MacroBeta_SPY[i]
end
end
end
```

To make sure we use the right price return we lead the return columns by one so that we enter the position and get the next return.

```
modelData = @transform(modelData, :NextJPM= lead(:JPM, 1),
:NextXLF = lead(:XLF, 1),
:NextSPY = lead(:SPY, 1))
paramsRes = leftjoin(paramsRes, modelData[:, [:Date, :NextJPM, :NextXLF, :NextSPY]], on=:Date)
portRes = @combine(groupby(paramsRes, :Date), :Return = :NextJPM .* :JPM_Pos .+ :NextXLF .* :XLF_Pos .+ :NextSPY .* :SPY_Pos);
plot(portRes.Date, cumsum(portRes.Return), label = "Stat Arb Return")
```

Sad trombone noise. This is not a great result as we’ve ended up
negative over the period. However, given the paper is 15 years old it
would be very rare to still be able to make money this way
after *everyone* knows how to do it. Plus, I’ve only used one stock vs
the ETF portfolio, you typically want to diversify out and use all the
stocks in the ETF to be long and short multiple single names and use
the ETF as a minimal hedge,

The good thing about it being a negative result means that we don’t have to start considering transaction costs or other annoying things like that.

When we break out the components of the strategy we can see that it appears to pick out the right times to short/long JPM and SPY, its the hedging with the XLF ETF that is bringing the portfolio down.

```
plot(paramsRes.Date, cumsum(paramsRes.NextJPM .* paramsRes.JPM_Pos), label = "JPM Component")
plot!(paramsRes.Date, cumsum(paramsRes.NextXLF .* paramsRes.XLF_Pos), label = "XLF Component")
plot!(paramsRes.Date, cumsum(paramsRes.NextSPY .* paramsRes.SPY_Pos), label = "SPY Component")
plot!(portRes.Date, cumsum(portRes.Return), label = "Stat Arb Portfolio")
```

So whilst naively trying to trade the stat arb portfolio is probably a loss maker, there might be some value in using the model as a signal input or overlay to another strategy.

What about if we up the frequency and look at intraday stat arb?

Crypto markets are open 24 hours a day 7 days a week and so gives that much more opportunity to build out a continuous trading model. We look back since the last year and repeat the backtesting process to see if this bares any fruit.

Once again AlpacaMarkets gives us an easy way to pull the hourly bar data for both ETH and BTC.

```
btcRaw = AlpacaMarkets.crypto_bars("BTC/USD", "1Hour"; startTime = now() - Year(1), limit = 10000)[1]
ethRaw = AlpacaMarkets.crypto_bars("ETH/USD", "1Hour"; startTime = now() - Year(1), limit = 10000)[1];
btc = @transform(btcRaw, :ts = DateTime.(chop.(:t)), :Ticker = "BTC")
eth = @transform(ethRaw, :ts = DateTime.(chop.(:t)), :Ticker = "ETH")
btc = btc[:, [:ts, :Ticker, :c]]
eth = eth[:, [:ts, :Ticker, :c]]
allPrices = vcat(btc, eth)
allPrices = sort(allPrices, :ts)
allPrices = @transform(groupby(allPrices, :Ticker),
:Return = [NaN; diff(log.(:c))]);
modelData = unstack(@select(allPrices, :ts, :Ticker, :Return), :ts, :Ticker, :Return);
modelData = @subset(modelData, .! isnan.(:ETH .+ :BTC))
```

Plotting out the returns we can see they are loosely related just like the stock example.

```
plot(modelData.ts, cumsum(modelData.BTC), label = "BTC")
plot!(modelData.ts, cumsum(modelData.ETH), label = "ETH")
```

We will be using BTC as the ‘index’ and see how ETH is related.

```
regModel = lm(@formula(ETH ~ BTC), modelData)
```

```
ETH ~ 1 + BTC
Coefficients:
─────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept) 7.72396e-6 3.64797e-5 0.21 0.8323 -6.37847e-5 7.92327e-5
BTC 1.115 0.00673766 165.49 <1e-99 1.10179 1.12821
─────────────────────────────────────────────────────────────────────────────
```

Fairly high beta for ETH and against BTC. We use a 90-hour rolling window now instead of a 90 day.

```
window = 90
paramsRes = Array{DataFrame}(undef, length(window:(nrow(modelData) - window)))
for (j, i) in enumerate(window:(nrow(modelData) - window))
modelDataSub = modelData[i:(i+window), :]
modelDataSub.ETH = (modelDataSub.ETH .- mean(modelDataSub.ETH)) ./ std(modelDataSub.ETH)
modelDataSub.BTC = (modelDataSub.BTC .- mean(modelDataSub.BTC)) ./ std(modelDataSub.BTC)
macroRegr = lm(@formula(ETH ~ BTC), modelDataSub)
auxData = cumsum(residuals(macroRegr))
ouRegr = lm(@formula(y~x), DataFrame(x=auxData[1:end-1], y=auxData[2:end]))
varEta = var(residuals(ouRegr))
a, b = coef(ouRegr)
k = -log(b)/((1/24)/252)
m = a/(1-b)
sigma = sqrt((varEta * 2 * k) / (1-b^2))
sigma_eq = sqrt(varEta / (1-b^2))
paramsRes[j] = DataFrame(ts= modelDataSub.ts[end], MacroBeta = coef(macroRegr)[2], MacroAlpha = coef(macroRegr)[1],
VarEta = varEta, OUA = a, OUB = b, OUK = k, Sigma = sigma, SigmaEQ=sigma_eq,
Score = -m/sigma_eq)
end
paramsRes = vcat(paramsRes...)
```

Again, looking at \(\beta\) overtime we see there has been a sudden shift

```
plot(plot(paramsRes.ts, paramsRes.MacroBeta, label = "Macro Beta", legend = :left),
plot(paramsRes.ts, paramsRes.OUK, label = "K"), layout = (2,1))
```

Interesting that there has been a big change in \(\beta\) between ETH and BTC recently that has suddenly reverted. Ok, onto the backtesting again.

```
paramsRes.ETH_Pos .= 0.0
paramsRes.BTC_Pos .= 0.0
for i in 2:nrow(paramsRes)
if paramsRes.OUK[i] > (252/(1/24)/45)
if paramsRes.Score[i] >= 1.25
paramsRes.ETH_Pos[i] = -1
paramsRes.BTC_Pos[i] = paramsRes.MacroBeta[i]
elseif paramsRes.Score[i] >= 0.75 && paramsRes.ETH_Pos[i-1] == -1
paramsRes.ETH_Pos[i] = -1
paramsRes.BTC_Pos[i] = paramsRes.MacroBeta[i]
end
if paramsRes.Score[i] <= -1.25
paramsRes.ETH_Pos[i] = 1
paramsRes.BTC_Pos[i] = -paramsRes.MacroBeta[i]
elseif paramsRes.Score[i] <= -0.5 && paramsRes.ETH_Pos[i-1] == 1
paramsRes.ETH_Pos[i] = 1
paramsRes.BTC_Pos[i] = -paramsRes.MacroBeta[i]
end
end
end
modelData = @transform(modelData, :NextETH= lead(:ETH, 1), :NextBTC = lead(:BTC, 1))
paramsRes = leftjoin(paramsRes, modelData[:, [:ts, :NextETH, :NextBTC]], on=:ts)
portRes = @combine(groupby(paramsRes, :ts), :Return = :NextETH .* :ETH_Pos .+ :NextBTC .* :BTC_Pos);
plot(portRes.ts, cumsum(portRes.Return))
```

This looks slightly better. At least it is positive at the end of the testing period.

```
plot(paramsRes.ts, cumsum(paramsRes.NextETH .* paramsRes.ETH_Pos), label = "ETH Component")
plot!(paramsRes.ts, cumsum(paramsRes.NextBTC .* paramsRes.BTC_Pos), label = "BTC Component")
plot!(portRes.ts, cumsum(portRes.Return), label = "Stat Arb Portfolio", legend=:topleft)
```

Again, the components of the portfolio seem to be ok in the ETH case but generally, this is from the overall long bias. Unlike the JPM/XLF example, there isn’t much more diversification we can add anything that might help. We could add in more crypto assets, or an equity/gold angle, but it becomes more of an asset class arb than something truly statistical.

The original paper is one of those that all quants get recommended to read and statistical arbitrage is a concept that you probably understand in theory but practically doing is another question. Hopefully, this blog post gets you up to speed with the basic concepts and how to implement them. It can be boiled down to two steps.

- Model as much as you can with a simple regression
- Model what’s left over as an OU process.

It can work with both high-frequency and low-frequency data, so have a look at different combinations or assets and see if you have more luck then I did backtesting.

If you do end up seeing something positive, make sure you are backtesting properly!

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

Step zero is to get yourself a GPS watch. I’ve got a Garmin 245 but any watch that can track your route and also your heart rate will do the job. You need the GPS to know how far you’ve gone, how fast you are running, and the heart rate monitoring to know how hard you worked. You’ll also want to be recording your heart rate throughout the day and when you sleep to get an accurate picture of your resting heart rate. Additionally, most watches also let you program various types of runs into the watch and schedule them into your calendar. This can save the mental load of trying to count laps or guess how fast you have been going whilst keeping you organised in the training.

When it comes to running in an actual race, Garmin will also pace the route out with PacePro and account for the elevation so your watch will tell you every kilometer how fast to go to hit your target time.

So in short, your watch will become your best friend in this training process.

Once you have the watch you’ll want to connect it to Runalyze. You might have heard of Strava, Runalyze is Strava that went to Uni and got a PhD in Sports Science. You get more of an idea of your training status and better tracking of each run and the effect the training is having on your fitness. It will also provide you with training paces that use ‘the science’ and also race time predictions. The predictions are also pretty accurate and lined up with my maximum efforts.

Buying a chest strap heart rate monitor is an optional extra. I had one from Garmin, but it broke, and never felt the need to replace it. The newer watches track the additional metrics it used to provide (ground contact time, vertical oscillation). I think the heart rate monitoring tech in the watches is always improving and the actual benefits of an additional heart rate monitor are lower now than say a few years ago.

What do you wear when you are running? This is less important. Just go to your TK Max and get whatever is cheap, shorts/T-shirts socks, etc. That’s what I did! You just want something, light, comfortable and that won’t get soaked in sweat. Maybe a jacket if you are running when it is cold. If it is going to be cold, get a Merino base layer, specifically a Merino one, they are very effective at keeping you warm. You’ll also want a running cap or hat to keep the sun off your face/ears warm. You do you and whatever makes you comfortable. Spend loads on Soar Running or go cheap like me.

That money you saved on clothes, spend on running trainers. Notice that’s plural, you’ll want (you don’t need) multiple different pairs. A slow pair, a medium pair (optional), and a fast pair.

The slow pair is for your easy runs. They need to be comfortable with plenty of cushioning and feel like clouds for your feet. My choice in this category is the Brooks Glycerins.

The fast pair is what you run the race in. They will have all that technology, like inbuilt plates in the sole and super modern foam that regenerates energy into your legs. You’ll want to minimise the amount of training in this pair, as generally, they are more brittle than everyday shoes, but you want to make sure you can run the distance in them. I used Saucony Endorphin Speed 3’s to run the marathon in and they have a nylon plate and feel quick on your feet. You might have heard of the Nike supershoes (Alphafly, Vapourfly) that have carbon plates. This is what we are after, something to run quickly in and get as much help from technology.

You then want something in-between the slow shoes and fast shoes, a medium pair per se. This is for quicker runs where the slow shoes can feel clunky. I went for Sacouny Guide 15’s. Handy to pack in a suitcase too as they are a bit more lightweight than the Glycerins.

Ok now, you are fully dressed, new shoes laced up, how do you approach the training?

Run lots. To get better at running you have to put the hours in a do plenty of running. Throughout my training, I was on average running 6 of the 7 days a week. Just the process of running helps improve your fitness and gets you used to running long distances and for a long time. So unfortunately, there is no secret sauce, no hidden training method just the harsh reality of giving up a part of your day to pound the streets. The majority of your running still needs to be slow, you should be comfortable and get through the easy runs without any trouble. Running more miles is more important than running fast miles. If you kill yourself running quickly one day and need to take two days off to recover then you are at a net loss in terms of progress. So just slow down and get out more often.

Once you’ve got used to running frequently you can start to introduce more structure into the runs. You’ll want a weekly ‘long run’ where you are out for more than 90 minutes. Beyond the hour-and-a-half mark is when the body stops burning short-term energy reserves and switches to long-term energy stores. The long run should be slow enough that you can reach that magic 90-minute mark easily and continue for some time afterward. This long-run is where your body adapts to going further and running for longer and gets you used to changing energy stores. You’ll also want to practice refueling on these long runs, as anything over an hour needs some sort of food to keep yourself performing.

You’ll also want to include two speed sessions each week. These are runs where you’ll be tuning the top end of your running, training to go faster and longer at quicker paces. The recommended way to hit the fast paces is up a hill with intervals. The increased incline means you can reach higher heart rates without putting as much stress on your legs. Doing it as an interval, i.e. running up the hill for a minute and then jogging down for a minute also means you can do a bit more, as the recovery can help extend the workout.

Then there is a ‘threshold’ run, where you are running at your anaerobic threshold for an extended amount of time. This is where the heart rate monitoring comes in from the watch. This is a zone 4 run, where you want to be in the zone for an extended amount of time (>5 minutes). The long runs train your aerobic capacity, the threshold runs train your anaerobic capacity.

But more importantly, the speed session can help break up the monotony of always running slow. Plus it’s more fun to start to see your records (PRs) for the shorter distances improve as you build up your fitness. But that only comes from the full package: lots of running, a long run, and some speed work.

So in conclusion, run lots, mostly slow, once long, and on the odd occasion fast up a hill.

“It never gets easier, you just go faster” - Greg LeMond

You’ve been running for a few months now, how do you know if you are having any effect? Firstly some qualitative effects before looking at the data. One, you should be hungry all the time, or at least I was in the training. Luckily I needed to shift a few kilograms, so the weight loss was welcomed. Secondly, you should find that the longer runs are getting longer for the same effort. It never gets easier, you just find a 10km run feels like what a 5km run felt like a few months ago.

What about something measurable? You should see your resting heart rate coming down.

This is my resting heart rate from Garmin, I started training in October/November and ramped up to a maximum weekly mileage in 2023. We can see a continuous downward trend as my heart is getting stronger. Feb was a tough month where I got ill briefly and then in April, I was on a few long-haul flights that interrupted my training flow. Overall, year on year I’ve dropped my heart rate by about 10 bpm. Make sure it tracks when you sleep though, otherwise, it will be undersampling when you are resting.

You should also start to see your average pace improving.

Given that my weekly running was of mixed paces (long run, intervals, etc.), this isn’t a pure comparison but still shows a trend showing that I was getting faster. This is taken from Runanlyze.

Seeing your V02max go up will also reassure you the right things are happening. Although it’s only an estimate of your VO2 max, it should hopefully be somewhat correlated with cardio performance.

This is a deep dive into the science behind human physiology in different endurance tasks. The more obvious ones like running and cycling plus the more extreme trail running, Mount Everest climbing and Antarctica exploring. It’s an engaging read that is very quanty in the sense it wants experiments to back up claims rather than just anecdotes. So for example, taking ibuprofen in a marathon, swishing energy drinks around your mouth instead of swallowing, and getting yourself mentally tired before going out training are all things backed up by science that will help your performance. Although, experiments in sport science always involve a handful of people and they are usually elite athletes too. So not quite as rigorous as other fields.

Matt Fitzgerlad had the chance to train with an elite group of runners in the high-altitude area of Flagstaff, Arizona. This book details his training and shows how different the elite athletes are compared to us mere mortals. But it also highlights how elite anything is still a job. They wake up run, think about running, eat/nap to make sure they are fresh for more running. One injury can derail your life and how you earn money. Very stressful. Great book though, would recommend it to both inspire and humble.

This website is a repository of information and inspired me to write this blog post. It covers similar topics, but probably in better/more detail. So if you like what I’ve written, you’ll love this website.

Running is a simple activity but needs a little bit of thinking to get the most out of it. I couldn’t be any further from an expert but have experienced everything written above and seen my performance and health improve because of it. In the end, I ran my marathon in 04:02:27 which is roughly the modal/mean performance for the average male. In the training process, I managed to drop my 5k from 25 minutes to 23:24, my 10k from 52:44 to 49:17, half marathon from 02:15:14 to 01:52:47.

When I plug all these times into the VDOT calculator:

Distance | PB | VDOT | 5km equiv | 10km equiv | Half equiv | Marathon equiv |
---|---|---|---|---|---|---|

5km | 0:23:24 | 41.4 | 00:23:24 | 00:48:35 | 1:47:43 | 3:43:11 |

10km | 0:49:17 | 40.7 | 00:23:46 | 00:49:17 | 1:49:22 | 3:46:29 |

Half Marathon | 1:52:47 | 39.2 | 00:24:32 | 00:50:54 | 1:52:47 | 3:53:27 |

Marathon | 4:02:27 | 37.4 | 00:25:31 | 00:52:58 | 1:57:24 | 4:02:27 |

The VDOT calculator gives you a barometer of what your records mean for other distances and can give you an idea of both your training paces and also target times for other distances.

My half-marathon time looks weaker compared to the others. So my next target will be to get a sub 1:50 half marathon. This means I’ll be lacing up those (probably new) shoes and running lots, mostly slow with the odd fast session thrown in. Who knows, maybe another marathon could be on the horizon too.

]]>