Enjoy these types of posts? Then you should sign up for my newsletter.

I’ve briefly touched on mean reversion and OU processes before in my Stat Arb - An Easy Walkthrough blog post where we modelled the spread between an asset and its respective ETF. The whole concept of ‘mean reversion’ is something that comes up frequently in finance and at different time scales. It can be thought of as the first basic extension as Brownian motion and instead of things moving randomly there is now a slight structure where it be oscillating around a constant value.

The Hudson Thames group have a similar post on OU processes (Mean-Reverting Spread Modeling: Caveats in Calibrating the OU Process) and my post should be a nice compliment with code and some extensions.

As a continuous process, we write the change in \(X_t\) as an increment in time and some noise

\[\mathrm{d}X_t = \theta (\mu - x_t) \mathrm{d}t + \sigma \mathrm{d}W_t\]The amount it changes in time depends on the previous \(X_t\) and to free parameters \(\mu\) and \(\theta\).

- The \(\mu\) is the long-term drift of the process
- The \(\theta\) is the mean reversion or momentum parameter depending on the sign.

If \(\theta\) is 0 we can see the equation collapses down to a simple random walk.

If we assume \(\mu = 0\), so the long-term average is 0, then a **positive** value of \(\theta\) means we see mean reversion. Large values of \(X\) mean the next change is likely to have a negative sign, leading to a smaller value in \(X\).

A **negative** value of \(\theta\) means the opposite and we end up with a large value in X generating a further large positive change and the process explodes.
E
If discretise the process we can simulate some samples with different parameters to illustrate these two modes.

where \(W_t \sim N(0,1)\).

which is easy to write out in Julia. We can save some time by drawing the random values first and then just summing everything together.

```
using Distributions, Plots
function simulate_os(theta, mu, sigma, dt, maxT, initial)
p = Array{Float64}(undef, length(0:dt:maxT))
p[1] = initial
w = sigma * rand(Normal(), length(p)) * sqrt(dt)
for i in 1:(length(p)-1)
p[i+1] = p[i] + theta*(mu-p[i])*dt + w[i]
end
return p
end
```

We have two classes of OU processes we want to simulate, a mean reverting \(\theta > 0\) and a momentum version (\(\theta < 0\)) and we also want to simulate a random walk at the same time, so \(\theta = 0\). We will assume \(\mu = 0\) which keeps the pictures simple.

```
maxT = 5
dt = 1/(60*60)
vol = 0.005
initial = 0.00*rand(Normal())
p1 = simulate_os(-0.5, 0, vol, dt, maxT, initial)
p2 = simulate_os(0.5, 0, vol, dt, maxT, initial)
p3 = simulate_os(0, 0, vol, dt, maxT, initial)
plot(0:dt:maxT, p1, label = "Momentum")
plot!(0:dt:maxT, p2, label = "Mean Reversion")
plot!(0:dt:maxT, p3, label = "Random Walk")
```

The mean reversion (orange) hasn’t moved away from the long-term average (\(\mu=0\)) and the momentum has diverged the furthest from the starting point, which lines up with the name. The random walk, inbetween both as we would expect.

Now we have successfully simulated the process we want to try and estimate the \(\theta\) parameter from the simulation. We have two slightly different (but similar methods) to achieve this.

When we look at the generating equation we can simply rearrange it into a linear equation.

\[\Delta X = \theta \mu \Delta t - \theta \Delta t X_t + \epsilon\]and the usual OLS equation

\[y = \alpha + \beta X + \epsilon\]such that

\[\alpha = \theta \mu \Delta t\] \[\beta = -\theta \Delta t\]where \(\epsilon\) is the noise. So we just need a DataFrame with the difference between subsequent observations and relate that to the current observation. Just a `diff`

and a shift.

```
using DataFrames, DataFramesMeta
momData = DataFrame(y=p1)
momData = @transform(momData, :diffY = [NaN; diff(:y)], :prevY = [NaN; :y[1:(end-1)]])
```

Then using the standard OLS process from the `GLM`

package.

```
mdl = lm(@formula(diffY ~ prevY), momData[2:end, :])
alpha, beta = coef(mdl)
theta = -beta / dt
mu = alpha / (theta * dt)
```

Which gives us \(\mu = 0.0075, \theta = -0.3989\), so close to zero for the drift and the reversion parameter has the correct sign.

Doing the same for the mean reversion data.

```
mdl = lm(@formula(diffY ~ prevY), revData[2:end, :])
alpha, beta = coef(mdl)
theta = -beta / dt
mu = alpha / (theta * dt)
```

This time \(\mu = 0.001\) and \(\theta = 1.2797\). So a little wrong compared to the true values, but at least the correct sign.

It could be that we need more data, so we use the bootstrap to randomly sample from the population to give us pseudo-new draws. We use the DataFrames again and pull random rows with replacement to build out the data set. We do this sampling 1000 times.

```
res = zeros(1000)
for i in 1:1000
mdl = lm(@formula(diffY ~ prevY + 0), momData[sample(2:nrow(momData), nrow(momData), replace=true), :])
res[i] = -first(coef(mdl)/dt)
end
bootMom = histogram(res, label = :none, title = "Momentum", color = "#7570b3")
bootMom = vline!(bootMom, [-0.5], label = "Truth", momentum = 2)
bootMom = vline!(bootMom, [0.0], label = :none, color = "black")
```

We then do the same for the reversion data.

```
res = zeros(1000)
for i in 1:1000
mdl = lm(@formula(diffY ~ prevY + 0), revData[sample(2:nrow(revData), nrow(revData), replace=true), :])
res[i] = first(-coef(mdl)/dt)
end
bootRev = histogram(res, label = :none, title = "Reversion", color = "#1b9e77")
bootRev = vline!(bootRev, [0.5], label = "Truth", lw = 2)
bootRev = vline!(bootRev, [0.0], label = :none, color = "black")
```

Then combining both the graphs into one plot.

```
plot(bootMom, bootRev,
layout=(2,1),dpi=900, size=(800, 300),
background_color=:transparent, foreground_color=:black,
link=:all)
```

The momentum bootstrap has worked and centred around the correct value, but the same cannot be said for the reversion plot. However, it has correctly guessed the sign.

If we continue assuming that \(\mu = 0\) then we can simplify the OLS to a 1-parameter regression - OLS without an intercept. From the generating process, we can see that this is an AR(1) process - each observation depends on the previous observation by some amount.

\[\phi = \frac{\sum _i X_i X_{i-1}}{\sum _i X_{i-1}^2}\]then the reversion parameter is calculated as

\[\theta = - \frac{\log \phi}{\Delta t}\]This gives us a simple equation to calculate \(\theta\) now.

For the momentum sample:

```
phi = sum(p1[2:end] .* p1[1:(end-1)]) / sum(p1[1:(end-1)] .^2)
-log(phi)/dt
```

Givens \(\theta = -0.50184\), so very close to the true value.

For the reversion sample

```
phi = sum(p2[2:end] .* p2[1:(end-1)]) / sum(p2[1:(end-1)] .^2)
-log(phi)/dt
```

Gives \(\theta = 1.26\), so correct sign, but quite a way off.

Finally, for the random walk

```
phi = sum(p3[2:end] .* p3[1:(end-1)]) / sum(p3[1:(end-1)] .^2)
-log(phi)/dt
```

Produces \(\theta = -0.027\), so quite close to zero.

Again, values are similar to what we expect, so our estimation process appears to be working.

If you aren’t convinced I don’t blame you. Those point estimates above are nowhere near the actual values that simulated the data so it’s hard to believe the estimation method is working. Instead, what we need to do is repeat the process and generate many more price paths and estimate the parameters of each one.

To make things a bit more manageable code-wise though I’m going to
introduce a `struct`

that contains the parameters and allows to
simulate and estimate in a more contained manner.

```
struct OUProcess
theta
mu
sigma
dt
maxT
initial
end
```

We now write specific functions for this object and this allows us to simplify the code slightly.

```
function simulate(ou::OUProcess)
simulate_os(ou.theta, ou.mu, ou.sigma, ou.dt, ou.maxT, ou.initial)
end
function estimate(ou::OUProcess)
p = simulate(ou)
phi = sum(p[2:end] .* p[1:(end-1)]) / sum(p[1:(end-1)] .^2)
-log(phi)/ou.dt
end
function estimate(ou::OUProcess, N)
res = zeros(N)
for i in 1:N
p = simulate(ou)
res[i] = estimate(ou)
end
res
end
```

We use these new functions to draw from the process 1,000 times and sample the parameters for each one, collecting the results as an array.

```
ou = OUProcess(0.5, 0.0, vol, dt, maxT, initial)
revPlot = histogram(estimate(ou, 1000), label = :none, title = "Reversion")
vline!(revPlot, [0.5], label = :none);
```

And the same for the momentum OU process

```
ou = OUProcess(-0.5, 0.0, vol, dt, maxT, initial)
momPlot = histogram(estimate(ou, 1000), label = :none, title = "Momentum")
vline!(momPlot, [-0.5], label = :none);
```

Plotting the distribution of the results gives us a decent understanding of how varied the samples can be.

```
plot(revPlot, momPlot, layout = (2,1), link=:all)
```

We can see the heavy-tailed nature of the estimation process, but thankfully the histograms are centred around the correct number. This goes to show how difficult it is to estimate the mean reversion parameter even in this simple setup. So for a real dataset, you need to work out how to collect more samples or radically adjust how accurate you think your estimate is.

We have progressed from simulating an Ornstein-Uhlenbeck process to estimating its parameters using various methods. We attempted to enhance the accuracy of the estimates through bootstrapping, but we discovered that the best approach to improve the estimation is to have multiple samples.

So if you are trying to fit this type of process on some real world data, be it the spread between two stocks (Statistical Arbitrage in the U.S. Equities Market), client flow (Unwinding Stochastic Order Flow: When to Warehouse Trades) or anything else you believe might be mean reverting, then understand how much data you might need to accurately model the process.

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

In this post, I’ll go through what skew is, how it can be used as a trading strategy, and backtest the portfolio across different asset classes. We will then see if it produces any alpha (\(\alpha\)) and or if skew is just market beta (\(\beta\)). I’ll then take a deeper dive into the equity performance and how it compares to the typical factors.

I’ll be working through everything in Julia (1.9) and pulling daily data from AlpacaMarkets.

```
using AlpacaMarkets, Dates,CSV, DataFrames, DataFramesMeta, RollingFunctions
using Plots, StatsBase
using Distributions
function parse_date(t)
Date(string(split(t, "T")[1]))
end
function clean(df, x)
df = @transform(df, :Date = parse_date.(:t),
:Ticker = x, :NextOpen = [:o[2:end]; NaN], :LogReturn = [NaN; diff(log.(:c))])
@select(df, :Date, :Ticker, :c, :o, :NextOpen, :LogReturn)
end
function load(etf)
df = AlpacaMarkets.stock_bars(etf, "1Day"; startTime = now() - Year(10), limit = 10000, adjustment = "all")[1]
clean(df, etf)
end
```

Skew (or skewness) measures how symmetric the distribution is around the mean value. A distribution of values with more values to the right of the mean is a positively skewed distribution and vice versa for the left of the mean.

We can demonstrate this by generating some random values from a skewed distribution (lognormal) and unskewed (normal).

Which shows the general tilt in the x-axis across the 3 different distributions.

Skew is weird in the sense that there isn’t a single way to calculate how skewed a distribution is. For our defined distributions above we can calculate the analytical values of skew and see that it is zero for the middle graph and positive (as expected) for the right-hand graph. Given that we flip the sign of the left-hand graph, that has the negative skew.

```
skewness.([Normal(1,1), LogNormal(0, 0.5)])
```

```
2-element Vector{Float64}:
0.0
1.7501896550697178
```

In the paper, the skew of an asset is calculated as

\[S = \frac{1}{N} \sum _{i=1} ^N \frac{(r_i - \mu ) ^3}{\sigma ^3},\]where \(\mu\) is the average and \(\sigma ^2\) is the variance of the returns of an asset with a lookback window of \(N\). We can look at the skewness of the SPY ETF over a 256-day rolling window using the `RollingFunctions`

package.

```
spy = load("SPY")
spy = @transform(spy, :Avg = runmean(:LogReturn, 256), :Dev = runstd(:LogReturn, 256))
spy = @transform(spy, :SkewDay = ((:LogReturn .- :Avg) ./ :Dev) .^3)
spy = @transform(spy, :Skew = runmean(:SkewDay, 256))
spy = @subset(spy, .!isnan.(:Skew))
plot(spy.Date, spy.Skew, label = "SPY Skew", dpi=900, size=(800, 200))
hline!([0], color="black", label = :none)
```

It’s jumpy, but the jumps make sense as it’s a \(^3\) calculation, so large values will be amplified. SPY became very negatively skewed over COVID-19 as there were all the market corrections leading to large down days. In recent days it’s now more positively skewed as we’ve seen some larger positive returns.

The paper believes that skew can predict future returns and that we want to be long assets with a negative skew and short assets with a positive skew. This gives it a ‘mean reversion’ explanation for future returns, so over COVID-19 when there were lots of down days, we should be buying because the movement is likely to be overblown and the market will correct higher. Likewise, large jumps up mean that it’s a positive move that is overblown and will come back down. So again, looking at the skew of SPY in recent weeks, the skew is positive therefore we would be inclined to short this ETF.

The overall strategy is looking at **cross-sectional skew**, so how skewed an asset its relative to it’s peers rather than looking at the raw skew number on a given day. The paper looks at equity indexes across countries, bond futures across different countries, different currencies, and commodities. In our replication, we are going to be using different ETFs that look at similar themes and should capture the broad cross-section of finance.

The original paper uses futures data from 1990 up to 2017 to run the backtest, I will be instead using different ETFs and a much shorter timescale, just because that’s all the data I have available from my `AlpacaMarkets`

free account using AlpacaMarkets.jl.

Blackrock is nice enough to publish this document for their different equity funds across the globe, Around the World with iShares Country ETFs, which I use to get the different country equity performance plus some broader indexes.

For the fixed income part I just try and take a cross-section of the different types of fixed income instruments available and different durations, mixing long-term, short-term, government, corporates, etc.

Commodities, again, just trying to get a broad mix, and the Other class is mainly real-estate and whatever other cruff comes up on the ETF database website. Finally, the currency ETFs each represent a different currency, so cover that part of the paper.

```
universe = [("Equity", ["SPY", "EWU", "EWJ", "INDA", "EWG", "EWL", "EWP", "EWQ",
"VTI", "FXI", "EWZ", "EWY", "EWA", "EWC", "EWG",
"EWH", "EWI", "EWN", "EWD", "EWT", "EZA", "EWW", "ENOR", "EDEN", "TUR"]),
("FI", ["AGG", "TLT", "LQD", "JNK", "MUB", "MBB", "IAGG", "IGOV", "EMB", "BND", "BNDX", "VCIT", "VCSH", "BSV", "SRLN"]),
("Commodities", ["GLD", "SLV", "GSG", "USO", "PPLT", "UNG", "DBA"]),
("Other", ["IYR", "REET", "USRT", "ICF", "VNQ"]),
("Ccy", ["UUP", "FXY", "FXE", "FXF", "FXB", "FXA", "FXC"])
]
```

We iterate through all the asset classes and pull the most amount of daily data possible.

```
allDataRaw = Array{DataFrame}(undef, length(universe))
for (j, (assetClass, etfs)) in enumerate(universe)
println(assetClass)
resdf = Array{DataFrame}(undef, length(etfs))
for (i, etf) in enumerate(etfs)
#println(etf)
df = load(etf)
resdf[i] = df
end
resdfC = vcat(resdf...)
resdfC.AssetClass .= assetClass
allDataRaw[j] = resdfC
end
allData = vcat(allDataRaw...);
```

We then add in the averages \(\mu\), standard deviation \(\sigma\), and calculate the skew value for that day before taking the rolling average to arrive at the overall skew measure. We need to group by each ETF (the `Ticker`

column).

```
allData = groupby(allData, :Ticker)
allData = @transform(allData, :Avg = runmean(:LogReturn, 256), :Dev = runstd(:LogReturn, 256))
allData = @transform(allData, :SkewDay = ((:LogReturn .- :Avg) ./ :Dev) .^3)
allData = @transform(allData, :Skew = runmean(:SkewDay, 256))
allData = @subset(allData, .!isnan.(:Skew));
```

To check we’ve pulled the right data we plot the cumulative log returns.

```
plot(allData[allData.Ticker .== "SPY", :].Date, cumsum(allData[allData.Ticker .== "SPY", :].LogReturn), label = "SPY",
title="Returns", dpi=900, size=(800, 200))
plot!(allData[allData.Ticker .== "GLD", :].Date, cumsum(allData[allData.Ticker .== "GLD", :].LogReturn), label = "GLD")
plot!(allData[allData.Ticker .== "AGG", :].Date, cumsum(allData[allData.Ticker .== "AGG", :].LogReturn), label = "AGG")
```

Everything looks as we would expect. We can now look at the skew for these three assets.

The skews move differently and with different magnitudes notably GLD has the least variable skew but equity and bonds have a similar pattern.
The paper looks at the skew of the asset on the last day of the month and uses that to rebalance the portfolio so that with a `groupby`

and `last`

we can pull the skew value on the last day of the month.

We need to avoid the look-ahead bias in the backtest. The portfolio weight is calculated using the last day of the month, so we observe the closing price and use that to calculate the return and update the parameters - average return, volatility, and finally the skew. This skew then goes into the weighting calculation *but* it is only active on the next working day, otherwise, we are getting a ‘free’ day of return.

So on the 31st of the Jan, we update the weights and then do the rebalance on the 1st of Feb (assuming that’s a working day). There is also the additional cost of trading into the position, at the minute we are assuming we can trade at the previous closing price but that is a problem to solve for another day.

```
allData = @transform(allData, :Month = floor.(:Date, Month(1)), :Week = floor.(:Date, Week(1)));
allData = @transform(groupby(allData, :Ticker), :NextDay = [:Date[2:end]; Date(2015)])
monthlyVals = @combine(groupby(allData, [:Month, :AssetClass, :Ticker]),
:Date = last(:Date), :NextDate = last(:NextDay),
:EOMSkew = last(:Skew));
```

We rank each asset in its respective asset class using the negative of the skew value, so the most positive skew gets the lowest rank and the most negative skew gets the highest rank. We also normalise the ranks by the number of assets in the group.

To come up with the portfolio weight, we want all the long positions (positive ranks) to have a total weighting of 1 and short positions (negative ranks) to have a total weighting of -1. This corresponds to being long 1 dollar and short 1 dollar so self-financed overall.

```
monthlyVals = groupby(monthlyVals, [:Date, :AssetClass])
monthlyVals = @transform(monthlyVals, :SkewWeightRaw = ordinalrank(-1*:EOMSkew) .- ((length(:EOMSkew) + 1) /2))
monthlyVals = groupby(monthlyVals, [:Date, :AssetClass])
monthlyVals = @transform(monthlyVals, :SkewWeight = :SkewWeightRaw ./ sum(1:maximum(:SkewWeightRaw)))
```

For example, if we look at the commodity ETFs and their latest skew values and how that changes the portfolio weights.

Date | Asset Class | Ticker | EOM Skew | SkewWeightRaw | Skew Weight |
---|---|---|---|---|---|

2024-02-07 | Commodities | GLD | 0.23 | -3 | -0.5 |

2024-02-07 | Commodities | SLV | 0.02 | -2 | -0.333 |

2024-02-07 | Commodities | DBA | -0.04 | -1 | -0.167 |

2024-02-07 | Commodities | PPLT | -0.07 | 0 | 0 |

2024-02-07 | Commodities | GSG | -0.12 | 1 | 0.167 |

2024-02-07 | Commodities | UNG | -0.16 | 2 | 0.333 |

2024-02-07 | Commodities | USO | -0.19 | 3 | 0.5 |

The most negatively skewed ETF, USO, gets the highest positive weight and vice versa. If we look at the weights over the period for the three example assets.

The portfolio weights for both SPY and AGG show that the last two months have been short SPY and no position in AGG. GLD has been allocated in the opposite direction to the other two, right now we are short GLD.

We join the weights to the original dataframe and forward fill the weightings to look at the daily performance. I pulled a forward fill function from https://hongtaoh.com/en/2021/06/27/julia-ffill/ and joining the portfolio weights to the daily returns allows us to understand the daily changes in the portfolios.

```
ffill(v) = v[accumulate(max, [i*!ismissing(v[i]) for i in 1:length(v)], init=1)]
weightings = @select(monthlyVals, :NextDate, :Ticker, :SkewWeight)
rename!(weightings,:NextDate => :Date)
allDataWeights = leftjoin(allData, weightings, on=[:Date, :Ticker]);
allDataWeights = sort(allDataWeights, :Date)
allDataWeights = @transform(groupby(allDataWeights, :Ticker), :SkewWeight2 = ffill(:SkewWeight));
```

Plotting the resulting portfolios gives us an idea of their performance.

```
assetPortfolios = dropmissing(@combine(groupby(allDataWeights, [:Date, :AssetClass]),
:PortfolioReturn = sum(:SkewWeight2 .* :LogReturn),
:MktReturn = mean(:LogReturn)))
p = plot(title = "Skew Portfolios")
for ac in unique(assetPortfolios.AssetClass)
plot!(p, assetPortfolios[assetPortfolios.AssetClass .== ac, :].Date,
cumsum(assetPortfolios[assetPortfolios.AssetClass .== ac, :].PortfolioReturn), label =ac)
end
hline!([0], color = "black", label = :none)
p
```

These are the results for each asset class. Interestingly, all of them (except Other) have a positive return as of February and most have never fallen below their starting returns. Commodities are very volatile and swung back and forth quite dramatically, equities have been one-way traffic in the right direction!

We also want to combine all the asset classes to produce a single portfolio but first have to normalise the returns by the volatility so that they are equally weighted on a risk basis.

```
assetPortfolios = @transform(groupby(assetPortfolios, :AssetClass), :Vol = sqrt.(runvar(:PortfolioReturn, 256)))
assetPortfolios = @transform(groupby(assetPortfolios, :AssetClass),
:NormReturn = 0.1*:PortfolioReturn ./ :Vol,
:NormMarketReturn = 0.1*:MktReturn ./ :Vol)
gcf = @combine(groupby(assetPortfolios, :Date), :Return = mean(:NormReturn), :MktReturn = mean(:NormMarketReturn));
plot(gcf.Date[2:end], cumsum(gcf.Return[2:end]), label = "Global Skew Factor", title = "Global Portfolio")
plot!(gcf.Date[2:end], cumsum(gcf.MktReturn[2:end]), label = "Global Market Return")
hline!([0], color = "black", label = :none)
```

Again, a positive result, well at least recently. This indicates that skew has some associated premium. Now we want to see if this is alpha or beta.

It’s great that these portfolios both at an asset level and global level have ended up in the green but we want to compare the performance to the general market and see if it’s riding the market or adding something new.

This is simple enough to compare, we can look at the equal-weighted return of all the assets in the group and see how that ended up.

Again, all of the skew portfolios have outperformed the market portfolio (except the Other asset class). so this is a good indication that this skew strategy is adding something new.

A more systematic approach is to regress the portfolio return against the market return and this will give us a measure of the \(\alpha\) and \(\beta\) of the strategy.

\[\text{Skew Return} = \alpha + \beta \cdot \text{Market Return}\]```
using GLM
for ac in unique(assetPortfolios.AssetClass)
ols = lm(@formula(PortfolioReturn ~ MktReturn), assetPortfolios[assetPortfolios.AssetClass .== ac, :])
println(ac)
println(coeftable(ols))
println(r2(ols))
end
```

Asset Class | \(\alpha\) | \(p\) value | \(\beta\) | \(p\) value | \(R^2\) |
---|---|---|---|---|---|

Equity | 0.0003 | 0.0544 | -0.01 | 0.4465 | 0.0003 |

FI | 0.0001 | 0.1796 | -0.05 | 0.0728 | 0.002 |

Commodities | 0.0004 | 0.4799 | 0.113 | 0.0232 | 0.003 |

Other | -0.00004 | 0.5845 | 0.007 | 0.1690 | 0.001 |

Ccy | 0.0001 | 0.3622 | 0.498 | <1e-27 | 0.08 |

The first thing to note is the low \(R^2\)’s across the board, which is to be expected in these types of models. Generally, the \(\alpha\)’s are all statistically insignificant with only the equity portfolio getting close to significance which indicates that the skew factor isn’t providing ‘new returns’. Interestingly though, only commodities and currencies have a statistically significant \(\beta\) which means for other asset classes the modelling is essentially noise. So whilst the lack of \(\alpha\) is a problem, the lack of \(\beta\) sort of makes up for it. Essentially I think this is a promising sign that there is perhaps something more to be done.

An equity fund manager who wants to allocate to skew also needs to verify that skew is providing something unique and not a repackaging of momentum/value/growth/carry factors. This is easy enough as there are ETFs that represent these factors, so we just include it in the regression.

```
mtum = load("MTUM") #momentum
vtv = load("VTV") #value
vug = load("VUG") #growth
cry = load("VIG") #carry
equityFactors = vcat([mtum, vtv, vug, cry]...);
```

Joining these with the equity data gives us a bigger dataset to construct the OLS regression.

```
equity = assetPortfolios[assetPortfolios.AssetClass .== "Equity", :]
equity = leftjoin(equity,
unstack(@select(equityFactors, :Date, :Ticker, :LogReturn), :Date, :Ticker, :LogReturn),
on = "Date")
coeftable(lm(@formula(PortfolioReturn ~ MktReturn + MTUM + VTV + VUG + VIG),
equity))
```

Coef. | Std. Error | t | Pr(> \(\mid t \mid\)) | Lower 95% | Upper 95% | |
---|---|---|---|---|---|---|

(Intercept) | 0.000280318 | 0.000180867 | 1.55 | 0.1214 | -7.44597e-5 | 0.000635095 |

MktReturn | -0.300453 | 0.0312806 | -9.61 | <1e-20 | -0.361811 | -0.239094 |

MTUM | -0.0881885 | 0.0305466 | -2.89 | 0.0039 | -0.148107 | -0.0282701 |

VTV | 0.450562 | 0.0614928 | 7.33 | <1e-12 | 0.329942 | 0.571183 |

VUG | 0.109752 | 0.0358138 | 3.06 | 0.0022 | 0.0395015 | 0.180002 |

VIG | -0.140079 | 0.0739041 | -1.90 | 0.0582 | -0.285045 | 0.00488637 |

Again, no \(\alpha\), significant market \(\beta\), and significant momentum, value, and growth coefficients but no significance with carry. This isn’t great for the Skew factor as this regression suggests we can replicate it using the other factors, namely, it’s anti-correlated to the market and momentum and correlated with value and growth. Given it’s a mean-reversion-esq strategy this makes sense as value is generally about finding underpriced assets.

This has been a successful replication of the original paper, which used ETFs of different asset sectors to explore skew. We now understand that skew is a measure of how left or right-tailed a distribution is, and how it can be exploited in a trading strategy. By calculating skew across different assets and ranking the skew in asset class groups, we allocate long positions to the most negatively skewed assets and short positions to positively skewed assets. This portfolio has produced a positive return in equities, fixed income, currencies, and commodities (but not Other), and has outperformed the market portfolio. A global skew portfolio was also constructed by scaling each asset class to 10% volatility and combining the returns, which also outperformed the market.

The use of the Other asset class was the only sector where skew didn’t work, so it would be hurting the overal skew portfolio, so going forward we would know to restrict the universe to equity, fixed income, currencies and commodities.

However, when we regressed the portfolio return onto the market returns, we found no statistically significant alphas and significant betas. The equity portfolio was close to having a significant alpha, but given it had the largest number of underlying assets, it could be a function of asset size.

We have neglected the trading costs and potential capacity of the overall strategy, but given its low turnover (weights only updating every month), this is probably safe to ignore until you hit the super asset manager size.

Although the results are not as conclusive as the original paper, they are on a shorter timescale and smaller universe, and do not contradict the original findings. We have shown that skew is out there and can provide a source of returns.

Going forward, refining the calculation of the skew and tuning the lookback windows might improve the results. Also, expanding the universe into more specific funds could provide better insights. At the moment, the fixed income component is too broad to pick up on the skew changes.

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

Regularisation is normally taught as a method to reduce overfitting, you have a big model and you make it smaller by shrinking some of the factors. Work by Janzing (papers below) argues that this can help produce better causal models too and in this blog post I will work through two papers to try and understand the process better.

I’ll work off two main papers for causal regularisation:

In truth, I am working backward. I first encountered causal regularisation in Better AB testing via Causal Regularisation where it uses causal regularisation to produce better estimates by combining a biased and an unbiased dataset. I want to take a step back and understand casual regularisation from the original papers. Using free data from the UCI Machine Learning Repository we can attempt to replicate the methods from the papers and see how causal regularisation works to produce better **causal** models.

As ever, I’m in Julia (1.9), so fire up that notebook and follow along.

```
using CSV, DataFrames, DataFramesMeta
using Plots
using GLM, Statistics
```

The `wine-quality`

dataset from the UCI repository provides measurements of the chemical properties of wine and a quality rating from someone drinking the wine. It’s a simple CSV file that you can download (winequality) and load with minimal data wrangling needed.

We will be working with the red wine data set as that’s what both Janzing papers use.

```
rawData = CSV.read("wine+quality/winequality-red.csv", DataFrame)
first(rawData)
```

APD! Always Plotting the Data to make sure the values are something you expect. Sometimes you need a visual confirmation that things line up with what you believe.

```
plot(scatter(rawData.alcohol, rawData.quality, title = "Alcohol", label = :none, color="#eac435"),
scatter(rawData.pH, rawData.quality, title = "pH", label = :none, color="#345995"),
scatter(rawData.sulphates, rawData.quality, title= "Sulphates", label = :none, color="#E40066"),
scatter(rawData.density, rawData.quality, title = "Density", label = :none, color="#03CEA4"), ylabel = "Quality")
```

By choosing four of the variables randomly we can see that some are correlated with quality and some are not.

A loose goal is to come up with a causal model that can explain the quality of the wine using the provided factors. We will change the data slightly to highlight how causal regularisation helps, but for now, let’s start with the simple OLS model.

In the paper they normalise the variables to be unit variance, so we divide by the standard deviation. We then model the quality of the wine using all the available variables.

```
vars = names(rawData, Not(:quality))
cleanData = deepcopy(rawData)
for var in filter(!isequal("White"), vars)
cleanData[!, var] = cleanData[!, var] ./ std(cleanData[!, var])
end
cleanData[!, :quality] .= Float64.(cleanData[!, :quality])
ols = lm(term(:quality) ~ sum(term.(Symbol.(vars))), cleanData)
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
quality ~ 1 + fixed acidity + volatile acidity + citric acid + residual sugar + chlorides + free sulfur dioxide + total sulfur dioxide + density + pH + sulphates + alcohol
Coefficients:
────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────────────
(Intercept) 21.9652 21.1946 1.04 0.3002 -19.6071 63.5375
fixed acidity 0.043511 0.0451788 0.96 0.3357 -0.0451055 0.132127
volatile acidity -0.194027 0.0216844 -8.95 <1e-18 -0.23656 -0.151494
citric acid -0.0355637 0.0286701 -1.24 0.2150 -0.0917989 0.0206716
residual sugar 0.0230259 0.0211519 1.09 0.2765 -0.0184626 0.0645145
chlorides -0.088211 0.0197337 -4.47 <1e-05 -0.126918 -0.0495041
free sulfur dioxide 0.0456202 0.0227121 2.01 0.0447 0.00107145 0.090169
total sulfur dioxide -0.107389 0.0239718 -4.48 <1e-05 -0.154409 -0.0603698
density -0.0337477 0.0408289 -0.83 0.4086 -0.113832 0.0463365
pH -0.0638624 0.02958 -2.16 0.0310 -0.121883 -0.00584239
sulphates 0.155325 0.019381 8.01 <1e-14 0.11731 0.19334
alcohol 0.294335 0.0282227 10.43 <1e-23 0.238977 0.349693
────────────────────────────────────────────────────────────────────────────────────────
```

The dominant factor is the `alcohol`

amount which is the strongest variable in predicting the quality, i.e. higher quality has a higher alcohol content. We also note that 5 out of the 12 variables are deemed insignificant at the 5% level. We save these parameters and then look at the regression without the `alcohol`

variable.

```
olsParams = DataFrame(Dict(zip(vars, coef(ols)[2:end])))
olsParams[!, :Model] .= "OLS"
olsParams
```

1×12 DataFrame

Row | alcohol | chlorides | citric acid | density | fixed acidity | free sulfur dioxide | pH | residual sugar | sulphates | total sulfur dioxide | volatile acidity | Model |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | String | |

1 | 0.294335 | -0.088211 | -0.0355637 | -0.0337477 | 0.043511 | 0.0456202 | -0.0638624 | 0.0230259 | 0.155325 | -0.107389 | -0.194027 | OLS |

```
cleanDataConfounded = select(cleanData, Not(:alcohol))
vars = names(cleanDataConfounded, Not(:quality))
confoundOLS = lm(term(:quality) ~ sum(term.(Symbol.(vars))), cleanDataConfounded)
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
quality ~ 1 + fixed acidity + volatile acidity + citric acid + residual sugar + chlorides + free sulfur dioxide + total sulfur dioxide + density + pH + sulphates
Coefficients:
───────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────────────────────
(Intercept) 189.679 14.2665 13.30 <1e-37 161.696 217.662
fixed acidity 0.299551 0.0391918 7.64 <1e-13 0.222678 0.376424
volatile acidity -0.176182 0.0223382 -7.89 <1e-14 -0.219997 -0.132366
citric acid 0.00912711 0.0292941 0.31 0.7554 -0.0483321 0.0665863
residual sugar 0.133781 0.0189031 7.08 <1e-11 0.0967031 0.170858
chlorides -0.107215 0.0203052 -5.28 <1e-06 -0.147043 -0.0673877
free sulfur dioxide 0.0394281 0.023462 1.68 0.0931 -0.00659172 0.0854479
total sulfur dioxide -0.128248 0.0246854 -5.20 <1e-06 -0.176668 -0.0798287
density -0.355576 0.0276265 -12.87 <1e-35 -0.409765 -0.301388
pH 0.0965662 0.0261087 3.70 0.0002 0.0453551 0.147777
sulphates 0.213697 0.0191745 11.14 <1e-27 0.176087 0.251307
───────────────────────────────────────────────────────────────────────────────────────────
```

`citric acid`

and `free sulfur dioxide`

are now the only insignificant variables, the rest are believed to contribute to the quality. This means we are experiencing *confounding* as `alcohol`

is the better explainer but the effect of alcohol is now hiding behind these other variables.

**Confounding** - When a variable influences other variables and the outcome at the same time leading to an incorrect view on the correlation between the variables and outcomes.

This regression after dropping the `alcohol`

variable is incorrect and provides the wrong causal conclusion. So can we do better and get closer to the true regression coefficients using some regularisation methods?

For now, we save these incorrect parameters and explore the causal regularisation methods.

```
olsParamsConf = DataFrame(Dict(zip(vars, coef(confoundOLS)[2:end])))
olsParamsConf[!, :Model] .= "OLS No Alcohol"
olsParamsConf[!, :alcohol] .= NaN
olsParamsConf
```

1×12 DataFrame

Row | chlorides | citric acid | density | fixed acidity | free sulfur dioxide | pH | residual sugar | sulphates | total sulfur dioxide | volatile acidity | Model | alcohol |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | String | Float64 | |

1 | -0.107215 | 0.00912711 | -0.355576 | 0.299551 | 0.0394281 | 0.0965662 | 0.133781 | 0.213697 | -0.128248 | -0.176182 | OLS No Alcohol | NaN |

Some maths. Regression is taking our variables \(X\) and finding the parameters \(a\) that get us closest to \(Y\).

\[Y = a X\]\(X\) is a matrix, and \(a\) is a vector. When we fit this to some data, the values of \(a\) are free to converge to any value they want, so long as it gets close to the outcome variable. This means we are minimising the difference between \(Y\) and \(X\)

\[||(Y - a X)|| ^2.\]Regularisation is the act of restricting the values \(a\) can take.

For example, we can make the sum of all the \(a\)’s equal to a constant (L_1 regularisation), or the sum of the square of the $a$ values equal a constant (L_2 regularisation). In simpler terms, if we want to increase the coefficient of one parameter, we need to reduce the parameter of a different term. Think of there being a finite amount of mass that we can allocate to the parameters, they can’t take on whatever value they like, but instead need to regulate amongst themselves. This helps reduce overfitting as it constrains how much influence a parameter can have and the final result should converge to a model that doesn’t overfit.

In ridge regression we are minimising the \(L_2\) norm, so restricting the sum of the square of the \(a\)’s and at the same time minimising the original OLS regression.

\[||(Y - a X)|| ^2 - \lambda || a || ^2.\]So we can see how regularisation is an additional component of OLS regression. \(\lambda\) is a hyperparameter that is just a number and controls how much restriction we place on the \(a\) values.

To do ridge regression in Julia I’ll be leaning on the MLJ.jl framework and using that to build out the learning machines.

```
using MLJ
@load RidgeRegressor pkg=MLJLinearModels
```

We will take the confounded dataset (so the data where the alcohol column is deleted), partition it into train and test sets, and get started with some regularisation.

```
y, X = unpack(cleanDataConfounded, ==(:quality); rng=123);
train, test = partition(eachindex(y), 0.7, shuffle=true)
mdl = MLJLinearModels.RidgeRegressor()
```

```
RidgeRegressor(
lambda = 1.0,
fit_intercept = true,
penalize_intercept = false,
scale_penalty_with_samples = true,
solver = nothing)
```

Can see the hyperparameter `lambda`

is initialised to 1.

We want to know the optimal \(\lambda\) value so will use cross-validation to train the model on one set of data and verify on a hold-out set before repeating. This is all simple in MLJ.jl, we define a grid of penalisations between 0 and 1 and fit the regression using cross-validation across the different lambdas. We are optimising for the best \(R^2\) value.

```
lambda_range = range(mdl, :lambda, lower = 0, upper = 1)
lmTuneModel = TunedModel(model=mdl,
resampling = CV(nfolds=6, shuffle=true),
tuning = Grid(resolution=200),
range = [lambda_range],
measures=[rsq]);
lmTunedMachine = machine(lmTuneModel, X, y);
fit!(lmTunedMachine, rows=train, verbosity=0)
report(lmTunedMachine).best_model
```

```
RidgeRegressor(
lambda = 0.020100502512562814,
fit_intercept = true,
penalize_intercept = false,
scale_penalty_with_samples = true,
solver = nothing)
```

The best value of \(\lambda\) is 0.0201. When we plot the \(R^2\) vs the \(\lambda\) values there isn’t that much of a change just a minor inflection around the small ones.

```
plot(lmTunedMachine)
```

Let’s save those parameters. This will be our basic ridge regression result that the other technique builds off.

```
res = fitted_params(lmTunedMachine).best_fitted_params.coefs
ridgeParams = DataFrame(res)
ridgeParams = hcat(ridgeParams, DataFrame(Model = "Ridge", alcohol=NaN))
ridgeParams
```

1×12 DataFrame

Row | fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | Model | alcohol |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | String | Float64 | |

1 | 0.190892 | -0.157286 | 0.0410523 | 0.117846 | -0.142458 | 0.0374597 | -0.153419 | -0.29919 | 0.0375852 | 0.232461 | Ridge | NaN |

The main result from the paper is that we first need to estimate the confounding effect \(\beta\) and then choose a penalisation factor \(\lambda\) that satisfies

\[(1-\beta) || a || ^ 2\]So the \(L_2\) norm of the ridge parameters can only be so much. In the 2nd paper, they estimate \(\beta\) to be 0.8. For us, we can use the above grid search, calculate the norm of the parameters, and find which ones satisfy those criteria.

So iterate through the above results of the grid search, and calculate the L2 norm of the parameters.

```
mdls = report(lmTunedMachine).history
l = zeros(length(mdls))
a = zeros(length(mdls))
for (i, mdl) in enumerate(mdls)
l[i] = mdl.model.lambda
a[i] = sum(map( x-> x[2], fitted_params(fit!(machine(mdl.model, X, y))).coefs) .^2)
end
```

Plotting the results gives us a visual idea of how the penalisation works. Larger values of \(\lambda\) mean the model parameters are more and more restricted.

```
inds = sortperm(l)
l = l[inds]
a = a[inds]
mdlsSorted = report(lmTunedMachine).history[inds]
scatter(l, a, label = :none)
hline!([(1-0.8) * sum(coef(confoundOLS)[2:end] .^ 2)], label = "Target Length", xlabel = "Lambda", ylabel = "a Length")
```

We search the lengths for the one closest to the target length and save those parameters.

```
targetLength = (1-0.8) * sum(coef(confoundOLS)[2:end] .^ 2)
ind = findfirst(x-> x < targetLength, a)
res = fitted_params(fit!(machine(mdlsSorted[ind].model, X, y))).coefs
finalParams = DataFrame(res)
finalParams = hcat(finalParams, DataFrame(Model = "With Beta", alcohol=NaN))
finalParams
```

1×12 DataFrame

Row | fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | Model | alcohol |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | String | Float64 | |

1 | 0.0521908 | -0.139099 | 0.0598797 | 0.0377729 | -0.0786037 | 0.00654776 | -0.0856938 | -0.124057 | 0.00682623 | 0.11735 | With Beta | NaN |

Now the code to calculate \(\beta\) isn’t the easiest or straightforward to implement (hence why I took their estimate). Instead, we could take the approach from Better AB Testing via Causal Regularisation and use the test set to optimise the penalisation parameter \(\lambda\) and then use that value when training the model on the train set.

Applying this method to the wine dataset isn’t a true replication of their paper, as their test and train data sets are instead two data sets, one with bias and one without like you might observe from an AB test. So it’s more of a demonstration of the method rather than a direct comparison to the Janzing method.

Again, `MLJ`

makes this simple, we just fit the machine using the `test`

rows to produce the best-fitting model.

```
lambda_range = range(mdl, :lambda, lower = 0, upper = 1)
lmTuneModel = TunedModel(model=mdl,
resampling = CV(nfolds=6, shuffle=true),
tuning = Grid(resolution=200),
range = [lambda_range],
measures=[rsq]);
lmTunedMachine = machine(lmTuneModel, X, y);
fit!(lmTunedMachine, rows=test, verbosity=0)
plot(lmTunedMachine)
```

```
report(lmTunedMachine).best_model
```

```
RidgeRegressor(
lambda = 0.010050251256281407,
fit_intercept = true,
penalize_intercept = false,
scale_penalty_with_samples = true,
solver = nothing)
```

Our best \(\lambda\) is 0.01 so we retrain the same machine, this time using the training rows.

```
res2 = fit!(machine(report(lmTunedMachine).best_model, X, y), rows=train)
```

Again saving these parameters down leaves us with three methods and three sets of parameters.

```
finalParams2 = DataFrame(fitted_params(res2).coefs)
finalParams2 = hcat(finalParams2, DataFrame(Model = "No Beta", alcohol=NaN))
allParams = vcat([olsParams, olsParamsConf, ridgeParams, finalParams, finalParams2]...)
allParams
```

5×12 DataFrame

Row | alcohol | chlorides | citric acid | density | fixed acidity | free sulfur dioxide | pH | residual sugar | sulphates | total sulfur dioxide | volatile acidity | Model |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | String | |

1 | 0.294335 | -0.088211 | -0.0355637 | -0.0337477 | 0.043511 | 0.0456202 | -0.0638624 | 0.0230259 | 0.155325 | -0.107389 | -0.194027 | OLS |

2 | NaN | -0.107215 | 0.00912711 | -0.355576 | 0.299551 | 0.0394281 | 0.0965662 | 0.133781 | 0.213697 | -0.128248 | -0.176182 | OLS No Alcohol |

3 | NaN | -0.142458 | 0.0410523 | -0.29919 | 0.190892 | 0.0374597 | 0.0375852 | 0.117846 | 0.232461 | -0.153419 | -0.157286 | Ridge |

4 | NaN | -0.0786037 | 0.0598797 | -0.124057 | 0.0521908 | 0.00654776 | 0.00682623 | 0.0377729 | 0.11735 | -0.0856938 | -0.139099 | With Beta |

5 | NaN | -0.141766 | 0.031528 | -0.323596 | 0.222812 | 0.03869 | 0.048907 | 0.127026 | 0.23961 | -0.153488 | -0.157603 | No Beta |

What method has done the best at uncovering the confounded relationship?

We have our different estimates of the parameters of the model, we now want to compare these to the ‘true’ unconfounded variables and see whether we have recovered the correct variables. To do this we calculate the square difference and normalise by the overall \(L_2\) norm of the parameters.

In practice, this just means we are comparing how far the fitted parameters are away from the true (unconfounded) model parameters.

```
allParamsLong = stack(allParams, Not(:Model))
trueParams = select(@subset(allParamsLong, :Model .== "OLS"), Not(:Model))
rename!(trueParams, ["variable", "truth"])
allParamsLong = leftjoin(allParamsLong, trueParams, on = :variable)
errorRes = @combine(groupby(@subset(allParamsLong, :variable .!= "alcohol"), :Model),
:a = sum((:truth .- :value) .^2),
:a2 = sum(:value .^ 2))
errorRes = @transform(errorRes, :e = :a ./ :a2)
sort(errorRes, :e)
```

5×4 DataFrame

Row | Model | a | a2 | e |
---|---|---|---|---|

String | Float64 | Float64 | Float64 | |

1 | OLS | 0.0 | 0.0920729 | 0.0 |

2 | With Beta | 0.0291038 | 0.0698576 | 0.416616 |

3 | Ridge | 0.129761 | 0.266952 | 0.486085 |

4 | No Beta | 0.157667 | 0.301286 | 0.523314 |

5 | OLS No Alcohol | 0.213692 | 0.349675 | 0.611116 |

Using the \(\beta\) estimation method gives the best model (smallest \(e\)), which lines up with the paper and the magnitude of error is also inline with the paper (they had 0.35 and 0.45 for Lasoo/ridge regression respectively). The ridge regression and no beta method also improved on the naive OLS approach, so that indicates that there is some improvement from using these methods. The No Beta method is not a faithful reproduction of the Better AB testing paper because it requires the ‘test’ dataset to be an AB test scenario, which we don’t have from the above, so that might explain why the values don’t quite line up.

All methods improve on the naive ‘OLS No Alcohol’ parameters though, which shows this approach to causal regularisation can uncover better models if you have underlying confounding in your data.

We are always stuck with the data we are given and most of the time can’t collect more to try and uncover more relationships. Causal regularisation gives us a chance to use normal machine learning techniques to build better causal relationships by guiding what the regularisation parameters should be and using that to restrict the overall parameters. When we can estimate the expected confounding value \(\beta\) we get the best results, but regular ridge regression and the Webster-Westray method also provide an improvement on just doing a naive regression. So whilst overfitting is the main driver for doing regularisation it also brings with it some causal benefits and lets you understand true relationships between variables in a truer sense.

I’ve written about causal analysis techniques before with Double Machine Learning - An Easy Introduction. This is another way of building causal models.

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

I’ve tried to cover different assets and frequencies to hopefully inspire the various types of quant finance out there.

My day-to-day job is in FX so naturally, that’s where I think all the best data can be found. TrueFX provides tick-by-tick in milliseconds, so high-frequency data is available for free and across lots of different currencies. So if you are interested in working out how to deal with large amounts of data (1 month of EURUSD is 600MB) efficiently, this source is a good place to start.

As a demo, I’ve downloaded the USDJPY October dataset.

```
using CSV, DataFrames, DataFramesMeta, Dates, Statistics
using Plots
```

It’s a big CSV file, so this isn’t the best way to store the data, instead, stick it into a database like QuestDB that are made for time series data.

```
usdjpy = CSV.read("USDJPY-2023-10.csv", DataFrame,
header = ["Ccy", "Time", "Bid", "Ask"])
usdjpy.Time = DateTime.(usdjpy.Time, dateformat"yyyymmdd HH:MM:SS.sss")
first(usdjpy, 4)
```

4×4 DataFrame

Row | Ccy | Time | Bid | Ask |
---|---|---|---|---|

String7 | DateTime | Float64 | Float64 | |

1 | USD/JPY | 2023-10-01T21:04:56.931 | 149.298 | 149.612 |

2 | USD/JPY | 2023-10-01T21:04:56.962 | 149.298 | 149.782 |

3 | USD/JPY | 2023-10-01T21:04:57.040 | 149.589 | 149.782 |

4 | USD/JPY | 2023-10-01T21:04:58.201 | 149.608 | 149.782 |

It’s simple data, just a bid and ask price with a time stamp.

```
usdjpy = @transform(usdjpy, :Spread = :Ask .- :Bid,
:Mid = 0.5*(:Ask .+ :Bid),
:Hour = round.(:Time, Minute(10)))
usdjpyHourly = @combine(groupby(usdjpy, :Hour), :open = first(:Mid), :close = last(:Mid), :avg_spread = mean(:Spread))
usdjpyHourly.Time = Time.(usdjpyHourly.Hour)
plot(usdjpyHourly.Hour, usdjpyHourly.open, lw =1, label = :none, title = "USDJPY Price Over October")
```

Looking at the hourly price over the month gives you flat periods over the weekend.

Let’s look at the average spread (ask - bid) throughout the day.

```
hourlyAvgSpread = sort(@combine(groupby(usdjpyHourly, :Time), :avg_spread = mean(:avg_spread)), :Time)
plot(hourlyAvgSpread.Time, hourlyAvgSpread.avg_spread, lw =2, title = "USDJPY Intraday Spread", label = :none)
```

We see a big spike at 10 pm because of the day roll and the secondary markets go offline briefly, which pollutes the data bit. Looking at just midnight to 8 pm gives a more indicative picture.

```
plot(hourlyAvgSpread[hourlyAvgSpread.Time .<= Time("20:00:00"), :].Time,
hourlyAvgSpread[hourlyAvgSpread.Time .<= Time("20:00:00"), :].avg_spread, label = :none, lw=2,
title = "USDJPY Intraday Spread")
```

In October spreads have generally been wider in the later part of the day compared to the morning.

There is much more that can be done with this data across the different currencies though. For example:

- How stable are correlations across currencies at different time frequencies?
- Can you replicate my microstructure noise post? How does the microstructure noise change between currencies
- Price updates are irregular, what are some statistical properties?

Let’s zoom out a little bit now, decrease the frequency, and widen the asset pool. Futures cover many asset classes, oil, coal, currencies, metals, agriculture, stocks, bonds, interest rates, and probably something else I’ve missed. This data is daily and roll adjusted, so you have a continuous time series of an asset for many years. This means you can look at the classic momentum/mean reversion portfolio models and have a real stab at long-term trends.

The data is part of the Nasdaq data link product (formerly Quandl) and once you sign up for an account you have access to the free data. This futures dataset is Wiki Continuous Futures and after about 50 clicks and logging in, re-logging in, 2FA codes you can view the pages.

To get the data you can go through one of the API packages in your favourite language. In Julia, this means the QuandlAccess.jl package which keeps things simple.

```
using QuandlAccess
futuresMeta = CSV.read("continuous.csv", DataFrame)
futuresCodes = futuresMeta[!, "Quandl Code"] .* "1"
quandl = Quandl("QUANDL_KEY")
function get_data(code)
futuresData = quandl(TimeSeries(code))
futuresData.Code .= code
futuresData
end
futureData = get_data.(rand(futuresCodes, 4));
```

We have an array of all the available contracts `futuresCodes`

and
sample 4 of them randomly to see what the data looks like.

```
p = []
for df in futureData
append!(p, plot(df.Date, df.Settle, label = df.Code[1]))
end
plot(plot.(p)..., layout = 4)
```

- ABY - WTI Brent Bullet - Spread between two oil futures on different exchanges.
- TZ6 - Transco Zone 6 Non-N.Y. Natural Gas (Platts IFERC) Basis - Spread between two different natural gas contracts
- PG - PG&E Citygate Natural Gas (Platts IFERC) Basis - Again, spread between two different natural gas contracts
- FMJP - MSCI Japan Index - Index containing Japanese stocks

I’ve managed to randomly select 3 energy futures and one stock index.

Project ideas with this data:

- Cross-asset momentum and mean reversion.
- Cross-asset correlations, does the price of oil drive some equity indexes?
- Macro regimes, can you pick out commonalities of market factors over the years?

Out there in the wild is the FI2010 dataset which is essentially a sample of the full order book for five different stocks on the Nordic stock exchange for 10 days. You have 10 levels of prices and volumes and so can reconstruct the order book throughout the day. It is the benchmark dataset for limit order book prediction and you will see it referenced in papers that are trying to implement new prediction models. For example Benchmark Dataset for Mid-Price Forecasting of Limit Order Book Data with Machine Learning Methods references some basic methods on the dataset and how they perform when predicting the mid-price.

I found the dataset (as a Python package) here https://github.com/simaki/fi2010 but it’s just stored as a CSV which you can lift easily.

```
fi2010 = CSV.read(download("https://raw.githubusercontent.com/simaki/fi2010/main/data/data.csv"),DataFrame);
```

**Update on 7/01/2024**

Since posting this the above link has gone offline and the user has deleted their Github account! Instead the data set can be found here: https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649/data . I’ve not verified if its in the same format, so there might be some additional work going from the raw data to how this blog post sets it up. Thank’s to the commentators below pointing this out.

The data is wide (each column is a depth level of the price and volume) so I turn each into a long data set and add the level, side and variable as a new column.

```
fi2010Long = stack(fi2010, 4:48, [:Column1, :STOCK, :DAY])
fi2010Long = @transform(fi2010Long, :a = collect.(eachsplit.(:variable, "_")))
fi2010Long = @transform(fi2010Long, :var = first.(:a), :level = last.(:a), :side = map(x->x[2], :a))
fi2010Long = @transform(groupby(fi2010Long, [:STOCK, :DAY]), :Time = collect(1:length(:Column1)))
first(fi2010Long, 4)
```

The ‘book depth’ is the sum of the liquidity available at all the levels and indicates how easy it is to trade the stock. As a quick example, we can take the average of each stock per day and use that as a proxy for the ease of trading these stocks.

```
intraDayDepth = @combine(groupby(fi2010Long, [:STOCK, :DAY, :var]), :avgDepth = mean(:value))
intraDayDepth = @subset(intraDayDepth, :var .== "VOLUME");
plot(intraDayDepth.DAY, intraDayDepth.avgDepth, group=intraDayDepth.STOCK,
marker = :circle, title = "Avg Daily Book Depth - FI2010")
```

Stock 3 and 4 have the highest average depth, so most likely the easier to trade, whereas Stock 1 has the thinnest depth. Stock 2 has an interesting switch between liquid and not liquid.

So if you want to look beyond top-of-book data, this dataset provides the extra level information needed and is closer to what a professional shop is using. Better than trying to predict daily Yahoo finance mid-prices with neural nets at least.

If you want to take a further step back then being able to build the tools that take in streaming data directly from the exchanges and save that into a database is another way you can build out your technical capabilities. This means you have full control over what you download and save. Do you want just the top of book every update, the full depth of the book, or just the reported trades? I’ve written about this before, Getting Started with High Frequency Finance using Crypto Data and Julia, and learned a lot in the process. Doing things this way means you have full control over the entire process and can fully understand the data you are saving and any additional quirks around the process.

Plenty to get stuck into and learn from. Being able to get the data and loading it into an environment is always the first challenge and learning how to do that with all these different types of data should help you understand what these types of jobs entail.

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

Reinforcement learning is a pillar of machine learning and it combines the use of data and learning how to make a better decision automatically. One of the basic models in reinforcement learning is the *multi-armed bandit*. A bit of an anachronistic name, but the single-armed bandit refers to a casino game where you pull the lever (or push a button), some cassettes roll round and you might win a prize.

The multi-armed bandit is an extension to this type of game and means we have different levers we can pull that lead to a different reward. The reward depends on the lever pulled.

This simple mental model is surprisingly applicable to lots of different problems and it can act as a good approximation to whatever you are trying to solve. For example, let’s use an advertising example. You have multiple adverts that you display to try and get people to click through to your website. Each time a page loads you can load one advertisement, you then record how many people click on that advert and use that to decide which advert to show next. With each page load you decide, do I show the most succesful advert so far or try a new advert to see how that performs? Over time you will find out which advert performs the best and show that as much as possible to get as many clicks.

Imagine we have a multi-armed bandit machine, where we pull a lever and get a reward. The reward depends on the lever pulled, how do we learn what the best lever is?

First let’s build our bandit. We will have 5 levers and the reward will be a sample from a normal distribution where each lever will have a random mean and standard deviation.

```
using Plots, StatsPlots
using Distributions
nLevers = 5
rewardMeans = rand(Normal(0, 3), nLevers)
rewardSD = rand(Gamma(2, 2), nLevers)
hcat(rewardMeans, rewardSD)
```

```
5×2 Matrix{Float64}:
-4.7724 5.88533
-4.60967 0.627556
-5.96987 1.14465
8.96919 3.80253
2.11311 4.84983
```

These are the parameters of our levers in our bandit, so lets look at the distribution of the rewards.

```
density(rand(Normal(rewardMeans[1], rewardSD[1]), 1000), label = "Lever 1")
for i in 2:nLevers
density!(rand(Normal(rewardMeans[i], rewardSD[i]), 1000), label = "Lever " * string(i))
end
plot!()
```

So our levers giving us a sample from a normal distribution is illustrated above. The 4th lever looks like the best as it has the most likely chance of getting a positive value and has the wider tail too. As we are talking about rewards, large positive values are better.

So given we have a process of pulling a lever and getting a reward, how do we learn what the best lever is and importantly as quickly as possible?

Like all good statistics problems, we start with the most basic model and start pulling levers randomly.

Just pull a random lever every time. Nothing is being learned here though and we are just demonstrating how the problem setup works. With each play we generate a random integer that corresponds to the lever, pull the lever (draw a random normal variable with mean/deviation of that lever), record what lever was pulled and the reward amount. Then repeat several times.

```
function random_learner(rewardMeans, rewardSD, nPlays)
nLevers = length(rewardMeans)
selectedLever = zeros(Int64, nPlays)
rewards = zeros(nPlays)
cumSelection = zeros(Int64, nLevers)
cumRewards = zeros(nLevers)
optimalChoice = Array{Bool}(undef, nPlays)
bestLever = findmax(rewardMeans)[2]
for i = 1:nPlays
selectedLever[i] = rand(1:nLevers)
optimalChoice[i] = selectedLever[i] == bestLever
rewards[i] = rand(Normal(rewardMeans[selectedLever[i]], rewardSD[selectedLever[i]]))
cumSelection[selectedLever[i]] += 1
cumRewards[selectedLever[i]] += rewards[i]
end
return selectedLever, rewards, cumSelection, cumRewards, optimalChoice
end
```

We run this learner for 1,000 steps and look at the number of times each lever is pulled.

```
randomStrat = random_learner(rewardMeans, rewardSD, 1000);
histogram(randomStrat[1], label = "Number of Time Lever Pulled")
```

Each of the levers is pulled a roughly equal amount of times, with no learning, just randomly pulling. Moving on, how do we learn?

Reinforcement learning is about balancing the explore/exploit set-up of the problem. We need to sample each of the levers and work out what kind of rewards they provide and then use that information to inform our next decision.

For each iteration, we randomly decide if we will pull any lever or do we use the old information to choose our best guess at the best lever. Our information in this case is the rolling average of the reward each time we pulled the lever. This is called a *greedy learner*. It’s just doing its best with what it knows and has no real ability to decide whether to explore a new lever.

The probability of choosing a random lever is called the learning rate (\(\eta\)) and controls how often we make the perceived optimal choice. A high value of \(\eta\) means lots of exploring (learning) and a low value restricts the learning and means we pull the (perceived) best lever each time. So if we had many levers and a low learning rate it is possible that we never find the globally optimal lever and instead just stick to the locally optimal lever, hence why it is called a greedy learner, it can get stuck.

```
function greedy_learner(rewardMeans, rewardSD, nPlays, eta)
nLevers = length(rewardMeans)
selectedLever = zeros(Int64, nPlays)
rewards = zeros(nPlays)
cumSelection = zeros(Int64, nLevers)
cumRewards = zeros(nLevers)
optimalChoice = Array{Bool}(undef, nPlays)
bestLever = findmax(rewardMeans)[2]
for i = 1:nPlays
if rand() < eta
selectedLever[i] = rand(1:nLevers)
else
q = cumRewards ./ cumSelection
q[isnan.(q)] .= 0
selectedLever[i] = findmax(q)[2]
end
optimalChoice[i] = selectedLever[i] == bestLever
rewards[i] = rand(Normal(rewardMeans[selectedLever[i]], rewardSD[selectedLever[i]]))
cumSelection[selectedLever[i]] += 1
cumRewards[selectedLever[i]] += rewards[i]
end
return selectedLever, rewards, cumSelection, cumRewards, optimalChoice
end
```

Again, we can run it for 1,000 steps and we set our learning rate to 0.5.

```
greedyStrat = greedy_learner(rewardMeans, rewardSD, 1000, 0.5)
histogram(greedyStrat[1], label = "Number of Time Lever Pulled", legend = :topleft)
```

This has done what we thought, it has selected the 4th lever that we thought looked the best from the distribution. So we’ve learned something, hooray!

The \(\eta\) parameter was set to 0.5 above, but how does varying change the outcome? To explore this we will do multiple runs of multiple plays of the game and also increase the number of levers. For each run, we will generate a new set of reward averages/standard deviations and run the random learner and the greedy learner with different \(\eta\).

```
nRuns = 2000
nPlays = 1000
nLevers = 10
optimalLevel = zeros(nRuns)
randomRes = Array{Tuple}(undef, nRuns)
greedyRes = Array{Tuple}(undef, nRuns)
greedyRes05 = Array{Tuple}(undef, nRuns)
greedyRes01 = Array{Tuple}(undef, nRuns)
greedyRes001 = Array{Tuple}(undef, nRuns)
greedyRes0001 = Array{Tuple}(undef, nRuns)
for i=1:nRuns
rewardMeans = rand(Normal(0, 1), nLevers)
rewardSD = ones(nLevers)
randomRes[i] = random_learner(rewardMeans, rewardSD, nPlays)
greedyRes[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0)
greedyRes05[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.5)
greedyRes01[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.1)
greedyRes001[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.01)
greedyRes0001[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.001)
optimalLevel[i] = findmax(rewardMeans)[2]
end
```

For each of the runs we have the evolution of the reward, so we want to take the average of the reward on each time step and see how that evolves with each play of the game.

```
randomAvg = mapreduce(x-> x[2], +, randomRes) ./ nRuns
greedyAvg = mapreduce(x-> x[2], +, greedyRes) ./ nRuns
greedyAvg01 = mapreduce(x-> x[2], +, greedyRes01) ./ nRuns
greedyAvg09 = mapreduce(x-> x[2], +, greedyRes05) ./ nRuns
greedyAvg001 = mapreduce(x-> x[2], +, greedyRes001) ./ nRuns;
greedyAvg0001 = mapreduce(x-> x[2], +, greedyRes0001) ./ nRuns;
```

And plotting the average reward over time.

```
plot(1:nPlays, randomAvg, label="Random", legend = :bottomright, xlabel = "Time Step", ylabel = "Average Reward")
plot!(1:nPlays, greedyAvg, label="0")
plot!(1:nPlays, greedyAvg05, label="0.5")
plot!(1:nPlays, greedyAvg01, label="0.1")
plot!(1:nPlays, greedyAvg001, label="0.01")
plot!(1:nPlays, greedyAvg0001, label="0.001")
```

Good to see that all the greedy learners outperform the random learner, so that algorithm is doing something. If we focus on the gready learners we see how the learning rates changes performances.

```
plot(1:nPlays, greedyAvg, label="0", legend=:bottomright, xlabel = "Time Step", ylabel = "Average Reward")
plot!(1:nPlays, greedyAvg01, label="0.1")
plot!(1:nPlays, greedyAvg001, label="0.01")
plot!(1:nPlays, greedyAvg0001, label="0.001")
```

This is an interesting result! When \(\eta = 0\) we see that it never reaches as high as the other learning rates. So when \(\eta = 0\) we never explore the other options, we just select what we think is the best one from history and never stray away from our beliefs. This ultimately hurts us because if we don’t get the best level on the first try then we are stuck in a suboptimal. Likewise, when the learning rate is very low, it doesn’t get much better, so this shows there is always value in exploring the options.

Philosophically, this shows that with any procedure you need to iterate through different configurations and explore the outcomes rather than sticking with what you believe is optimal.

```
scatter([0, 0.5, 0.1,0.01, 0.001],
map(x-> mean(x[750:1000]), [greedyAvg, greedyAvg05, greedyAvg01, greedyAvg001, greedyAvg0001]),
xlabel="Learning Rate",
ylabel = "Converged Reward", legend=:none)
```

The learning rate looks like it is optimal around 0.1. You can do a grid search to see how the overall behaviour changes in terms of both the speed of convergence to the final state and how good that final reward state is.

We can improve the above implementation by just saving memory and CPU cycles by doing ‘online learning’ of the rewards and using that to drive the selection. We create one matrix $$Q$, update it with the average reward of each lever and use the maximum of each iteration to select our lever if we are not exploring.

```
function greedy_learner_incremental(rewardMeans, rewardSD, nPlays, eta)
nLevers = length(rewardMeans)
selectedLever = zeros(Int64, nPlays)
rewards = zeros(nPlays)
cumSelection = zeros(Int64, nLevers)
cumRewards = zeros(nLevers)
Q = zeros((nPlays+1, nLevers))
rewardsArray = zeros(nLevers)
optimalChoice = Array{Bool}(undef, nPlays)
bestLever = findmax(rewardMeans)[2]
for i = 1:nPlays
if rand() < eta
selectedLever[i] = rand(1:nLevers)
else
selectedLever[i] = findmax(Q[i,:])[2]
end
optimalChoice[i] = selectedLever[i] == bestLever
reward = rand(Normal(rewardMeans[selectedLever[i]], rewardSD[selectedLever[i]]))
rewards[i] = reward
rewardsArray[selectedLever[i]] = reward
cumSelection[selectedLever[i]] += 1
cumRewards[selectedLever[i]] += reward
Q[i+1, :] = Q[i, :] + (1/i) * (rewardsArray - Q[i,:])
end
return selectedLever, rewards, cumSelection, cumRewards, optimalChoice
end
```

Using the normal Julia benchmarking tools we can get a good idea if this rewrite has changed anything materially.

```
using BenchmarkTools
oldImp = @benchmark greedy_learner(rewardMeans, rewardSD, nPlays, 0.1)
newImp = @benchmark greedy_learner_incremental(rewardMeans, rewardSD, nPlays, 0.1)
judge(median(oldImp), median(newImp))
```

```
BenchmarkTools.TrialJudgement:
time: -43.91% => improvement (5.00% tolerance)
memory: -70.15% => improvement (1.00% tolerance)
```

It’s 50% faster and uses 70% less memory, so a good optimisation.

This is the basic intro to reinforcement learning but a good foundation for how to think about these problems. The main step is going from data to decisions and how to update the decisions you make each time. You need to make sure you explore the problem space as otherwise you never know how much better some other options might be.

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

I’ve written before about predicting the number of goals in a game and this is a compliment to that post. Part of my PhD involved fitting a multidimensional Hawkes process to the time of goals scored by the home and away teams and this post isn’t as complicated as that instead we look at something simpler.

This is a change of language too, I’m writing R instead of Julia for once!

```
require(jsonlite)
require(dplyr)
require(tidyr)
require(ggplot2)
knitr::opts_chunk$set(fig.retina=2)
require(hrbrthemes)
theme_set(theme_ipsum())
extrafont::loadfonts()
require(wesanderson)
```

I have a dataset that contains the odds and the times of goals for many different football matches.

```
finalData <- readRDS("/Users/deanmarkwick/Documents/PhD/Research/Hawkes and Football/Data/allDataOddsAndGoals.RDS")
```

We do some wrangling of the data, converting it from the JSON format to give us a vector of each team’s goals split into whether they are home or away.

```
homeGoalTimes <- lapply(finalData$home.mins.goal, fromJSON)
awayGoalTimes <- lapply(finalData$away.mins.goal, fromJSON)
allGoals <- c(unlist(homeGoalTimes), unlist(awayGoalTimes))
```

To clean the data we need to replace the games without scores to a numeric type and also truncate any goals scored in extra time. We need a fixed window for the point process modeling.

```
replaceEmptyWithNumeric <- function(x){
if(length(x) == 0){
return(numeric(0))
}else{
return(x)
}
}
max90 <- function(x){
x[x > 90] <- 90
return(x)
}
homeGoalTimesClean <- lapply(homeGoalTimes, replaceEmptyWithNumeric)
homeGoalTimesClean <- lapply(homeGoalTimesClean, max90)
awayGoalTimesClean <- lapply(awayGoalTimes, replaceEmptyWithNumeric)
awayGoalTimesClean <- lapply(awayGoalTimesClean, max90)
```

As the number of goals scored for each team will be proportional to the strength of the team we will use the odds of the team winning the match as a proxy for their strength. This does a good job as my previous blog post Goals from team strengths explored.

```
homeProbsStrengths <- finalData$PSCH
awayProbsStrengths <- finalData$PSCA
allStrengths <- c(homeProbsStrengths, awayProbsStrengths)
allGoalTimes <- c(homeGoalTimesClean, awayGoalTimesClean)
```

Interestingly we can do the same cleaning in `dplyr`

easily using the
`case_when`

function.

```
allGoalsFrame <- data.frame(Time = allGoals)
allGoalsFrame %>%
mutate(TimeClean = case_when(Time > 90 ~ 90,
TRUE ~ as.numeric(Time))) -> allGoalsFrame
```

After all that we can plot our distribution of goal times.

```
ggplot(allGoalsFrame, aes(x=TimeClean, y=after_stat(density))) +
geom_histogram(binwidth = 1) +
xlab("Time (mins)") +
ylab("Goal Density")
```

Two bumps, 1 around 45 minutes where goals are scored during extra time in the first half and the 90+ minute goals.

This is what we are trying to model. We want to predict when the goals will happen based on that team’s strength, which will also control how many goals are scored.

A point process is a mathematical model that describes when things happen in a fixed window. Our window is the 90 minutes of the football match and we want to know where the goals fall in this window.

A point process is described by its intensity \(\lambda (t)\) which is proportional to the likelihood of seeing an event at time \(t\). So a higher intensity, a larger chance of a goal occurring. From our plot above we can see there are two main features we want our model to capture:

- The general increase in goals as the match as time progresses.
- The spike at 90 because of extra time.

To fit this type of model we will write an intensity function \(\lambda\) and optimise the parameters to minimise the likelihood.

The likelihood for a point process is the summation of the intensity \(\lambda(t)\) at each event and the integration of the intensity function over the window

\[\mathcal{L} = \sum _{i} \log \lambda (t_i) - \int _0^T \lambda (t) \mathrm{d} t.\]We have to specify the form of \(\lambda\) with a function and parameters and then fit the parameters to the data. By looking at the data we can see the intensity appears to be increasing and we need to account for the spike at 90

\[\lambda (t) = w \beta _0 + \beta _1 \frac{t}{T} + \beta _{90} \delta (t-90),\]where \(w\) is the team strength, \(T\) is 90 and \(\delta (x)\) is the Dirac delta function. More on that later.

Which we can easily integrate.

\[\int _0^T \lambda(t) = w \beta_0 T + \beta _1 \frac{T}{2} + \beta_{90}.\]This gives us our likelihood function so we can move on to optimising it over our data.

It’s always good to make sure you are on the right track by simulating the models you are exploring. Jumping straight into the real data means you are hoping your methods are correct, but starting with a known model and using the methods to recover the parameters gives you some confidence that what you are doing is correct.

There are three components to our model:

- the intensity function
- the integrated intensity function
- the likelihood

We will also be using a Dirac delta function to represent the 90 minute spike

Given our data is measured in minutes and all the goals that happen in
extra time have the value of `t=90`

this means we need a sensible way to
account for this mega spike. Essentially, we want something that is 1 at
a single point and 0 everywhere else. That way we can assign a weight to
this component in the overall model and that helps describe the data
that also integrates nicely.

Now I’m a physicist by training, so my mathematical rigour around the function might not be up to scratch.

```
diract <- function(t, x=90){
2*as.numeric((round(t) == x))
}
qplot(seq(0, 100, 0.1), diract(seq(0, 100, 0.1))) +
xlab("Time") +
ylab("Weight")
```

As expected, 1 at 90 and 0 everywhere else.

We can now write the R code for our intensity function, and then the likelihood by combining the intensity and integrated intensity.

```
intensityFunction <- function(params, t, winProb, maxT){
beta0 <- params[1]
beta1 <- params[2]
beta90 <- params[3]
int <- (winProb * beta0) + (beta1 * (t/maxT)) + (beta90*diract(t))
int[int < 0] <- 0
int
}
intensitFunctionInt <- function(params, maxT, winProb){
beta0 <- params[1]
beta1 <- params[2]
beta90 <- params[3]
beta0*winProb*maxT + (beta1*maxT)/2 + beta90
}
likelihood <- function(params, t, winProb){
ss <- sum(log(intensityFunction(params, t, winProb, 90)))
int <- intensitFunctionInt(params, 90, winProb)
ss - int
}
```

We now combine the three functions and simulate a point process from the
intensity function. We will use *thinning* to simulate the
inhomogeneous intensity. This means generating more points than expected
from a larger intensity, and then choosing what ones remain as a ratio
between the larger intensity and true intensity. For a more in-depth
discussion I’ve written about it previously in my
post.

```
sim_events <- function(params, winProb){
lambdaMax <- 1.1*intensityFunction(params, 90, winProb, 90)
nevents <- rpois(1, lambdaMax*90)
tstar <- runif(nevents, 0, 90)
accept_prob <- intensityFunction(params, tstar, winProb, 90) / lambdaMax
(sort(tstar[runif(length(accept_prob)) < accept_prob]))
}
```

```
N <- 100
testParams <- c(3, 2, 2)
testWinProb <- 1
testEvents <- replicate(N, sim_events(testParams, testWinProb))
testWinProbs <- rep_len(testWinProb, N)
trueInt <- intensityFunction(testParams, 0:90, testWinProb, 90)
```

As we have multiple simulated games, we want to calculate the overall likelihood across the total sample and maximise that likelihood.

```
alllikelihood <- function(params, events, winProbs){
ll <- sum(vapply(seq_along(events),
function(i) likelihood(params, events[[i]], winProbs[[i]]),
numeric(1)))
if(ll == -Inf){
return(-1e9)
} else {
return(ll)
}
}
trueLikelihood <- alllikelihood(testParams, testEvents, testWinProbs)
```

Simple enough to do the optimisation, chuck the function into `optim`

and away we go.

```
simRes <- optim(runif(3), function(x) -1*alllikelihood(c(x[1], x[2], x[3]),
testEvents,
testWinProbs), lower = c(0,0,0), method = "L-BFGS-B")
print(simRes$par)
```

3.005867 1.995551 1.932193

The parameters come out almost exactly as they were specified.

```
simResDF <- data.frame(Time = 0:90,
TrueIntensity = trueInt,
EstimatedIntensity = intensityFunction(simRes$par, 0:90, testWinProb, 90))
ggplot(simResDF, aes(x=Time, y=TrueIntensity, color = "True")) +
geom_line() +
geom_line(aes(y=EstimatedIntensity, color = "Estimated")) +
labs(color = NULL) +
xlab("Time") +
ylab("Intensity") +
theme(legend.position = "bottom")
```

Okay, so our method is good. We’ve recovered all three factors in the intensity so well that you can hardly tell the difference between the real and estimated intensities. So we can now go on looking at our data.

Let’s do the train/test split and fit our model on the training data.

```
trainInds <- sample.int(length(allGoalTimes), size = floor(length(allGoalTimes)*0.7))
goalTimesTrain <- allGoalTimes[trainInds]
strengthTrain <- allStrengths[trainInds]
goalTimesTest <- allGoalTimes[-trainInds]
strengthTest <- allStrengths[-trainInds]
```

We start by using a null model. This is where we will just use the constant parameter and the team strengths and see how well that fits the data.

```
optNull <- optim(runif(1), function(x) -1*alllikelihood(c(x[1], 0, 0),
goalTimesTrain,
strengthTrain), lower = c(0,0,0), method = "L-BFGS-B")
optNull
```

We add in the next parameter, the linear trend.

```
optNull2 <- optim(runif(2), function(x) -1*alllikelihood(c(x[1], x[2], 0),
goalTimesTrain,
strengthTrain), lower = c(0,0,0), method = "L-BFGS-B")
optNull2
```

We can now use all the features previously described and fit the full model across the data.

```
optRes <- optim(runif(3), function(x) -1*alllikelihood(x,
goalTimesTrain,
strengthTrain), lower = c(0,0,0), method = "L-BFGS-B")
optRes
```

And then just to check, let’s remove the linear parameter.

```
optRes2 <- optim(runif(2), function(x) -1*alllikelihood(c(x[1], 0, x[2]),
goalTimesTrain,
strengthTrain), lower = c(0,0,0), method = "L-BFGS-B")
optRes2
```

Putting all the results into a table lets us compare nicely.

Model | \(\beta _0\) | \(\beta _1\) | \(\beta _{90}\) |
---|---|---|---|

Constant | 0.0039 | —– | —– |

Linear | 0.0006 | 0.025 | —– |

Delta | 0.00096 | 0.022 | 0.05 |

No Linear | 0.0037 | —– | 0.06 |

The positive linear parameter (\(\beta _1\)) shows that there is an increase in probability towards the end of the match.

It is easier to compare the resultant intensity functions though.

```
modelFits <- data.frame(Time = 0:90)
modelFits$Null <- intensityFunction(c(optNull$par[1],0,0), modelFits$Time, 2, 90)
modelFits$Linear <- intensityFunction(c(optNull2$par ,0), modelFits$Time, 2, 90)
modelFits$Delta <- intensityFunction(optRes$par, modelFits$Time, 2, 90)
modelFits$NoLinear <- intensityFunction(c(optRes2$par[1], 0, optRes2$par[2]), modelFits$Time, 2, 90)
modelFits %>%
pivot_longer(!Time, names_to="Model", values_to="Intensity") -> modelFitsTidy
ggplot(modelFitsTidy, aes(x=Time, y=Intensity, color = Model)) +
geom_line() +
theme(legend.position = "bottom")
```

So interesting differences between the three different models. Model 2 has a lower slope because it can accommodate the spike at the end. When looking at the final likelihoods from the models:

Model | Out of Sample Likelihood |
---|---|

Constant | -55337.35 |

Linear | -52268.48 |

Delta | -51917.7 |

No Linear | -54500.6 |

So, the best fitting model (largest likelihood) is the Delta model, so that 90-minute spike is doing some work. Also shows that the linear component of the model contributes something to the model as the No Linear result has a worse likelihood.

Using the likelihood to evaluate the model is only one approach though. We could go further with BIC/AIC/DIC values but given there are only three parameters in the model it probably won’t be instructive. Instead, we should look at what the model simulates results like.

We go through each of the test set matches and simulate a match 100 times, taking the maximum number of goals scored, we then compare this to the maximum observed number of goals across the data set and see how the distributions compare.

This is similar to the posterior p-values method for model checking but in this case slightly different because we do not have a chain of parameters and just the optimised values.

```
maxGoals <- vapply(strengthTest,
function(x) max(replicate(100, length(sim_events(optRes$par, x)))),
numeric(1))
actualMaxGoals <- max(vapply(allGoalTimes, length, numeric(1)))
```

```
ggplot(data = data.frame(MaxGoals = maxGoals), aes(x=MaxGoals)) +
geom_histogram(binwidth = 1) +
geom_vline(xintercept = actualMaxGoals) +
xlab("Maximum Number of Goals")
```

10 is the largest number of goals observed, and our model congregates around 5 as the maximum, but we did see 2 simulations with 10 goals, and another 2 more with 10+ goals. So overall, the model can generate something that resembles reality, if not infrequently. But then again, how often do we see 10-goal games?

Overall this is a nice little model that shows the probability of a team scoring appearing to increase linearly over time. We added in a delta function to account for the fact that some games go beyond 90 minutes and many goals are scored in that period. We then did some model checking by simulating using the fitted parameters and it turns out the model can generate large enough amounts of goals compared to the real data.

I’ve fitted this model by optimising the likelihood, so the next logical step would be to take a Bayesian approach and throw the model into Stan so we have a proper sample of parameters that lets us judge the uncertainty around the model a bit better. Then the next direction would be to relax the linearity of the model throw a non-parametric approach at the data and see if anything interesting turns up. I have been trying this with my dirichletprocess package, but never managed to get a satisfying result that improved the above. Plus with the large dataset, it was taking forever to run. Maybe a blog post for the future!

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

I’m using Julia 1.9 and my AlpacaMarkets.jl package gets all the data we need.

```
using AlpacaMarkets
using DataFrames, DataFramesMeta
using Dates
using Plots
using RollingFunctions, Statistics
using GLM
```

To start with we simply want the daily prices of JPM, XLF, and SPY. JPM is the stock we think will go through mean reversion, XLF is the financial sector ETF and SPY is the general SPY ETF.

We this that if JPM rises higher than XLF then it will soon revert and trade lower shortly. Likewise, if JPM falls lower than XLF then we think it will soon trade higher. Our mean reversion is all about JPM around XLF. We’ve chosen XLF as it represents the general financial sector landscape, so will represent the general sector outlook more consistently than JPM on its own.

```
jpm = AlpacaMarkets.stock_bars("JPM", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1]
xlf = AlpacaMarkets.stock_bars("XLF", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1];
spy = AlpacaMarkets.stock_bars("SPY", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1];
```

We want to clean the data to format the date correctly and select the close and open columns.

```
function parse_date(t)
Date(string(((split(t, "T")))[1]))
end
function clean(df, x)
df = @transform(df, :Date = parse_date.(:t), :Ticker = x, :NextOpen = [:o[2:end]; NaN])
@select(df, :Date, :c, :o, :Ticker, :NextOpen)
end
```

Now we calculate the close-to-close log returns and format the data into a column for each asset.

```
jpm = clean(jpm, "JPM")
xlf = clean(xlf, "XLF")
spy = clean(spy, "SPY")
allPrices = vcat(jpm, xlf, spy)
allPrices = sort(allPrices, :Date)
allPrices = @transform(groupby(allPrices, :Ticker),
:Return = [NaN; diff(log.(:c))],
:ReturnO = [NaN; diff(log.(:o))],
:ReturnTC = [NaN; diff(log.(:NextOpen))]);
modelData = unstack(@select(allPrices, :Date, :Ticker, :Return), :Date, :Ticker, :Return)
modelData = modelData[2:end, :];
last(modelData, 4)
```

4 rows × 4 columns

Date | JPM | XLF | SPY | |
---|---|---|---|---|

Date | Float64? | Float64? | Float64? | |

1 | 2023-06-30 | 0.0138731 | 0.00864001 | 0.0117316 |

2 | 2023-07-03 | 0.00799894 | 0.00562049 | 0.00114985 |

3 | 2023-07-05 | -0.00661524 | -0.00206703 | -0.0014883 |

4 | 2023-07-06 | -0.00993581 | -0.00860923 | -0.00786148 |

Looking at the actual returns we can see that all three move in sync

```
plot(modelData.Date, cumsum(modelData.JPM), label = "JPM")
plot!(modelData.Date, cumsum(modelData.XLF), label = "XLF")
plot!(modelData.Date, cumsum(modelData.SPY), label = "SPY", legend = :left)
```

The key point is that they are moving in sync with each other. Given XLF has JPM included in it, this is expected but it also presents the opportunity to trade around any dispersion between the ETF and the individual name.

- https://math.stackexchange.com/questions/345773/how-the-ornstein-uhlenbeck-process-can-be-considered-as-the-continuous-time-anal

Let’s think simply about pairs trading. We have two securities that we want to trade if their prices change too much, so our variable of interest is

\[e = P_1 - P_2\]and we will enter a trade if \(e\) becomes large enough in both the positive and negative directions.

To translate that into a statistical problem we have two steps.

- Work out the difference between the two securities
- Model how the difference changes over time.

Step 1 is a simple regression of the stock vs the ETF we are trading against. Step 2 needs a bit more thought, but is still only a simple regression.

In our data, we have the daily returns of JPM, the XLF ETF, and the SPY ETF. To work out the interdependence, it’s just a case of simple linear regression.

```
regModel = lm(@formula(JPM ~ XLF + SPY), modelData)
```

```
JPM ~ 1 + XLF + SPY
Coefficients:
──────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────────
(Intercept) 0.000188758 0.000162973 1.16 0.2469 -0.0001309 0.000508417
XLF 1.35986 0.0203485 66.83 <1e-99 1.31995 1.39977
SPY -0.363187 0.0260825 -13.92 <1e-41 -0.414345 -0.312028
──────────────────────────────────────────────────────────────────────────────────
```

From the slope of the model, we can see that JPM = 1.36XLF - 0.36SPY, so JPM has a \(\beta\) of 1.36 to the XLF index and a \(\beta\) of -0.36 to the SPY ETF, or general market. So each day, we can approximate JPMs return by multiplying the XLF returns and SPY returns.

This is our economic factor model, which describes from a ‘big picture’ kind of way how the stock trades vs the general market (SPY) and its sector-specific market (XLF).

What we need to do next is look at what this model *doesn’t* explain
and try and describe that.

Any difference around this model can be explained by the summation of the residuals over time. In the paper the sum of the residuals over time is called the ‘auxiliary process’ and this is the data behind the second regression.

```
plot(scatter(modelData.Date, residuals(regModel), label = "Residuals"),
plot(modelData.Date,cumsum(residuals(regModel)),
label = "Aux Process"),
layout = (2,1))
```

We believe the auxiliary process (cumulative sum of the residuals) can be modeled using a Ornstein-Uhlenbeck (OU) process.

An OU process is a type of differential equation that displays mean reversion behaviour. If the process falls away from its average level then it will be forced back.

\[dX = \kappa (m - X(t))dt + \sigma \mathrm{d} W\]\(\kappa\) represents how quickly the mean reversion occurs.

To fit this type of process we need to recognise that the above differential form of an OU process can be discretised to become a simple AR(1) model where the model parameters can be transformed to get the OU parameters.

We now fit the OU process onto the cumulative sum of the residuals from the first model. If the residuals have some sort of structure/pattern then this means our original model was missing some variable that explains the difference.

```
X = cumsum(residuals(regModel))
xDF = DataFrame(y=X[2:end], x = X[1:end-1])
arModel = lm(@formula(y~x), xDF)
```

```
y ~ 1 + x
Coefficients:
─────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept) 4.41618e-6 0.000162655 0.03 0.9783 -0.000314618 0.000323451
x 0.997147 0.00186733 534.00 <1e-99 0.993484 1.00081
─────────────────────────────────────────────────────────────────────────────────
```

We take these coefficients and transform them into the parameters from the paper.

```
varEta = var(residuals(arModel))
a, b = coef(arModel)
k = -log(b)*252
m = a/(1-b)
sigma = sqrt((varEta * 2 * k) / (1-b^2))
sigma_eq = sqrt(varEta / (1-b^2))
[m, sigma_eq]
```

```
2-element Vector{Float64}:
0.0015477568390823153
0.08709971423424319
```

So \(m\) gives us the average level and \(\sigma_{\text{eq}}\) the appropriate scale.

Now to build the mean reversion signal. We still have \(X\) as our auxiliary process which we believe is mean reverting. We now have the estimated parameters on the scale of this mean reversion so we can transform the auxiliary process by these parameters and use this to see when the process is higher or lower than the model suggests it should be.

```
modelData.Score = (X .- m)./sigma_eq;
plot(modelData.Date, modelData.Score, label = "s")
hline!([-1.25], label = "Long JPM, Short XLF", color = "red")
hline!([-0.5], label = "Close Long Position", color = "red", ls=:dash)
hline!([1.25], label = "Short JPM, Long XLF", color = "purple")
hline!([0.75], label = "Close Short Position", color = "purple", ls = :dash, legend=:topleft)
```

The red lines indicate when JPM has diverged from XLF on the negative side, i.e. we expect JPM to move higher and XLF to move lower. We enter the position if s < -1.25 (solid red line) and exit the position when s > -0.5 (dashed red line).

- Buy to open if \(s < -s_{bo}\) (< -1.25) Buy 1 JPM, sell Beta XLF
- Close long if \(s > -s_{c}\) (-0.5)

The purple line is the same but in the opposite direction.

- Sell to open if \(s > s_{so}\) (>1.25) Sell 1 JPM, buy Beta XLF
- Close short if \(s < s_{bc}\) (<0.75)

That’s the modeling part done. We model how the stock moves based on the overall market and then any differences to this we use the OU process to come up with the mean reversion parameters.

So, does it make money?

To backtest this type of model we have to roll through time and calculate both regressions to construct the signal.

A couple of new additions too

- We shift and scale the returns when doing the macro regression.
- The auxiliary process on the last day is always 0, which makes calculating the signal simple.

```
paramsRes = Array{DataFrame}(undef, length(90:(nrow(modelData) - 90)))
for (j, i) in enumerate(90:(nrow(modelData) - 90))
modelDataSub = modelData[i:(i+90), :]
modelDataSub.JPM = (modelDataSub.JPM .- mean(modelDataSub.JPM)) ./ std(modelDataSub.JPM)
modelDataSub.XLF = (modelDataSub.XLF .- mean(modelDataSub.XLF)) ./ std(modelDataSub.XLF)
modelDataSub.SPY = (modelDataSub.SPY .- mean(modelDataSub.SPY)) ./ std(modelDataSub.SPY)
macroRegr = lm(@formula(JPM ~ XLF + SPY), modelDataSub)
auxData = cumsum(residuals(macroRegr))
ouRegr = lm(@formula(y~x), DataFrame(x=auxData[1:end-1], y=auxData[2:end]))
varEta = var(residuals(ouRegr))
a, b = coef(ouRegr)
k = -log(b)*252
m = a/(1-b)
sigma = sqrt((varEta * 2 * k) / (1-b^2))
sigma_eq = sqrt(varEta / (1-b^2))
paramsRes[j] = DataFrame(Date= modelDataSub.Date[end],
MacroBeta_XLF = coef(macroRegr)[2], MacroBeta_SPY = coef(macroRegr)[3], MacroAlpha = coef(macroRegr)[1],
VarEta = varEta, OUA = a, OUB = b, OUK = k, Sigma = sigma, SigmaEQ=sigma_eq,
Score = -m/sigma_eq)
end
paramsRes = vcat(paramsRes...)
last(paramsRes, 4)
```

4 rows × 11 columns (omitted printing of 4 columns)

Date | MacroBeta_XLF | MacroBeta_SPY | MacroAlpha | VarEta | OUA | OUB | |
---|---|---|---|---|---|---|---|

Date | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |

1 | 2023-06-30 | 0.974615 | -0.230273 | 1.10933e-17 | 0.331745 | 0.175358 | 0.830417 |

2 | 2023-07-03 | 0.96943 | -0.228741 | -5.73883e-17 | 0.331222 | 0.198176 | 0.826816 |

3 | 2023-07-05 | 0.971319 | -0.230438 | 2.38846e-17 | 0.335844 | 0.242754 | 0.841018 |

4 | 2023-07-06 | 0.974721 | -0.232765 | 5.09875e-17 | 0.331695 | 0.256579 | 0.823822 |

The benefit of doing it this way also means we can see how each \(\beta\) in the macro regression evolves.

```
plot(paramsRes.Date, paramsRes.MacroBeta_XLF, label = "XLF Beta")
plot!(paramsRes.Date, paramsRes.MacroBeta_SPY, label = "SPY Beta")
```

Good to see they are consistent in their signs and generally don’t vary a great deal.

In the OU process, we are also interested in the speed of the mean reversion as we don’t want to take a position that is very slow to revert to the mean level.

```
kplot = plot(paramsRes.Date, paramsRes.OUK, label = :none)
kplot = hline!([252/45], label = "K Threshold")
```

In the paper, they suggest making sure the reversion happens with half of the estimation period. As we are using 90 days, that means the horizontal line shows when \(k\) is above this value.

Plotting the score function also shows how the model wants to go long/short the different components over time.

```
splot = plot(paramsRes.Date, paramsRes.Score, label = "Score")
hline!([-1.25], label = "Long JPM, Short XLF", color = "red")
hline!([-0.5], label = "Close Long Position", color = "red", ls=:dash)
hline!([1.25], label = "Short JPM, Long XLF", color = "purple")
hline!([0.75], label = "Close Short Position", color = "purple", ls = :dash)
```

We run through the allocation procedure and label whether we are long (+1) or short (-\(\beta\)) an amount of either the stock or ETFs.

```
paramsRes.JPM_Pos .= 0.0
paramsRes.XLF_Pos .= 0.0
paramsRes.SPY_Pos .= 0.0
for i in 2:nrow(paramsRes)
if paramsRes.OUK[i] > 252/45
if paramsRes.Score[i] >= 1.25
paramsRes.JPM_Pos[i] = -1
paramsRes.XLF_Pos[i] = paramsRes.MacroBeta_XLF[i]
paramsRes.SPY_Pos[i] = paramsRes.MacroBeta_SPY[i]
elseif paramsRes.Score[i] >= 0.75 && paramsRes.JPM_Pos[i-1] == -1
paramsRes.JPM_Pos[i] = -1
paramsRes.XLF_Pos[i] = paramsRes.MacroBeta_XLF[i]
paramsRes.SPY_Pos[i] = paramsRes.MacroBeta_SPY[i]
end
if paramsRes.Score[i] <= -1.25
paramsRes.JPM_Pos[i] = 1
paramsRes.XLF_Pos[i] = -paramsRes.MacroBeta_XLF[i]
paramsRes.SPY_Pos[i] = -paramsRes.MacroBeta_SPY[i]
elseif paramsRes.Score[i] <= -0.5 && paramsRes.JPM_Pos[i-1] == 1
paramsRes.JPM_Pos[i] = 1
paramsRes.XLF_Pos[i] = -paramsRes.MacroBeta_XLF[i]
paramsRes.SPY_Pos[i] = -paramsRes.MacroBeta_SPY[i]
end
end
end
```

To make sure we use the right price return we lead the return columns by one so that we enter the position and get the next return.

```
modelData = @transform(modelData, :NextJPM= lead(:JPM, 1),
:NextXLF = lead(:XLF, 1),
:NextSPY = lead(:SPY, 1))
paramsRes = leftjoin(paramsRes, modelData[:, [:Date, :NextJPM, :NextXLF, :NextSPY]], on=:Date)
portRes = @combine(groupby(paramsRes, :Date), :Return = :NextJPM .* :JPM_Pos .+ :NextXLF .* :XLF_Pos .+ :NextSPY .* :SPY_Pos);
plot(portRes.Date, cumsum(portRes.Return), label = "Stat Arb Return")
```

Sad trombone noise. This is not a great result as we’ve ended up
negative over the period. However, given the paper is 15 years old it
would be very rare to still be able to make money this way
after *everyone* knows how to do it. Plus, I’ve only used one stock vs
the ETF portfolio, you typically want to diversify out and use all the
stocks in the ETF to be long and short multiple single names and use
the ETF as a minimal hedge,

The good thing about it being a negative result means that we don’t have to start considering transaction costs or other annoying things like that.

When we break out the components of the strategy we can see that it appears to pick out the right times to short/long JPM and SPY, its the hedging with the XLF ETF that is bringing the portfolio down.

```
plot(paramsRes.Date, cumsum(paramsRes.NextJPM .* paramsRes.JPM_Pos), label = "JPM Component")
plot!(paramsRes.Date, cumsum(paramsRes.NextXLF .* paramsRes.XLF_Pos), label = "XLF Component")
plot!(paramsRes.Date, cumsum(paramsRes.NextSPY .* paramsRes.SPY_Pos), label = "SPY Component")
plot!(portRes.Date, cumsum(portRes.Return), label = "Stat Arb Portfolio")
```

So whilst naively trying to trade the stat arb portfolio is probably a loss maker, there might be some value in using the model as a signal input or overlay to another strategy.

What about if we up the frequency and look at intraday stat arb?

Crypto markets are open 24 hours a day 7 days a week and so gives that much more opportunity to build out a continuous trading model. We look back since the last year and repeat the backtesting process to see if this bares any fruit.

Once again AlpacaMarkets gives us an easy way to pull the hourly bar data for both ETH and BTC.

```
btcRaw = AlpacaMarkets.crypto_bars("BTC/USD", "1Hour"; startTime = now() - Year(1), limit = 10000)[1]
ethRaw = AlpacaMarkets.crypto_bars("ETH/USD", "1Hour"; startTime = now() - Year(1), limit = 10000)[1];
btc = @transform(btcRaw, :ts = DateTime.(chop.(:t)), :Ticker = "BTC")
eth = @transform(ethRaw, :ts = DateTime.(chop.(:t)), :Ticker = "ETH")
btc = btc[:, [:ts, :Ticker, :c]]
eth = eth[:, [:ts, :Ticker, :c]]
allPrices = vcat(btc, eth)
allPrices = sort(allPrices, :ts)
allPrices = @transform(groupby(allPrices, :Ticker),
:Return = [NaN; diff(log.(:c))]);
modelData = unstack(@select(allPrices, :ts, :Ticker, :Return), :ts, :Ticker, :Return);
modelData = @subset(modelData, .! isnan.(:ETH .+ :BTC))
```

Plotting out the returns we can see they are loosely related just like the stock example.

```
plot(modelData.ts, cumsum(modelData.BTC), label = "BTC")
plot!(modelData.ts, cumsum(modelData.ETH), label = "ETH")
```

We will be using BTC as the ‘index’ and see how ETH is related.

```
regModel = lm(@formula(ETH ~ BTC), modelData)
```

```
ETH ~ 1 + BTC
Coefficients:
─────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept) 7.72396e-6 3.64797e-5 0.21 0.8323 -6.37847e-5 7.92327e-5
BTC 1.115 0.00673766 165.49 <1e-99 1.10179 1.12821
─────────────────────────────────────────────────────────────────────────────
```

Fairly high beta for ETH and against BTC. We use a 90-hour rolling window now instead of a 90 day.

```
window = 90
paramsRes = Array{DataFrame}(undef, length(window:(nrow(modelData) - window)))
for (j, i) in enumerate(window:(nrow(modelData) - window))
modelDataSub = modelData[i:(i+window), :]
modelDataSub.ETH = (modelDataSub.ETH .- mean(modelDataSub.ETH)) ./ std(modelDataSub.ETH)
modelDataSub.BTC = (modelDataSub.BTC .- mean(modelDataSub.BTC)) ./ std(modelDataSub.BTC)
macroRegr = lm(@formula(ETH ~ BTC), modelDataSub)
auxData = cumsum(residuals(macroRegr))
ouRegr = lm(@formula(y~x), DataFrame(x=auxData[1:end-1], y=auxData[2:end]))
varEta = var(residuals(ouRegr))
a, b = coef(ouRegr)
k = -log(b)/((1/24)/252)
m = a/(1-b)
sigma = sqrt((varEta * 2 * k) / (1-b^2))
sigma_eq = sqrt(varEta / (1-b^2))
paramsRes[j] = DataFrame(ts= modelDataSub.ts[end], MacroBeta = coef(macroRegr)[2], MacroAlpha = coef(macroRegr)[1],
VarEta = varEta, OUA = a, OUB = b, OUK = k, Sigma = sigma, SigmaEQ=sigma_eq,
Score = -m/sigma_eq)
end
paramsRes = vcat(paramsRes...)
```

Again, looking at \(\beta\) overtime we see there has been a sudden shift

```
plot(plot(paramsRes.ts, paramsRes.MacroBeta, label = "Macro Beta", legend = :left),
plot(paramsRes.ts, paramsRes.OUK, label = "K"), layout = (2,1))
```

Interesting that there has been a big change in \(\beta\) between ETH and BTC recently that has suddenly reverted. Ok, onto the backtesting again.

```
paramsRes.ETH_Pos .= 0.0
paramsRes.BTC_Pos .= 0.0
for i in 2:nrow(paramsRes)
if paramsRes.OUK[i] > (252/(1/24)/45)
if paramsRes.Score[i] >= 1.25
paramsRes.ETH_Pos[i] = -1
paramsRes.BTC_Pos[i] = paramsRes.MacroBeta[i]
elseif paramsRes.Score[i] >= 0.75 && paramsRes.ETH_Pos[i-1] == -1
paramsRes.ETH_Pos[i] = -1
paramsRes.BTC_Pos[i] = paramsRes.MacroBeta[i]
end
if paramsRes.Score[i] <= -1.25
paramsRes.ETH_Pos[i] = 1
paramsRes.BTC_Pos[i] = -paramsRes.MacroBeta[i]
elseif paramsRes.Score[i] <= -0.5 && paramsRes.ETH_Pos[i-1] == 1
paramsRes.ETH_Pos[i] = 1
paramsRes.BTC_Pos[i] = -paramsRes.MacroBeta[i]
end
end
end
modelData = @transform(modelData, :NextETH= lead(:ETH, 1), :NextBTC = lead(:BTC, 1))
paramsRes = leftjoin(paramsRes, modelData[:, [:ts, :NextETH, :NextBTC]], on=:ts)
portRes = @combine(groupby(paramsRes, :ts), :Return = :NextETH .* :ETH_Pos .+ :NextBTC .* :BTC_Pos);
plot(portRes.ts, cumsum(portRes.Return))
```

This looks slightly better. At least it is positive at the end of the testing period.

```
plot(paramsRes.ts, cumsum(paramsRes.NextETH .* paramsRes.ETH_Pos), label = "ETH Component")
plot!(paramsRes.ts, cumsum(paramsRes.NextBTC .* paramsRes.BTC_Pos), label = "BTC Component")
plot!(portRes.ts, cumsum(portRes.Return), label = "Stat Arb Portfolio", legend=:topleft)
```

Again, the components of the portfolio seem to be ok in the ETH case but generally, this is from the overall long bias. Unlike the JPM/XLF example, there isn’t much more diversification we can add anything that might help. We could add in more crypto assets, or an equity/gold angle, but it becomes more of an asset class arb than something truly statistical.

The original paper is one of those that all quants get recommended to read and statistical arbitrage is a concept that you probably understand in theory but practically doing is another question. Hopefully, this blog post gets you up to speed with the basic concepts and how to implement them. It can be boiled down to two steps.

- Model as much as you can with a simple regression
- Model what’s left over as an OU process.

It can work with both high-frequency and low-frequency data, so have a look at different combinations or assets and see if you have more luck then I did backtesting.

If you do end up seeing something positive, make sure you are backtesting properly!

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

Step zero is to get yourself a GPS watch. I’ve got a Garmin 245 but any watch that can track your route and also your heart rate will do the job. You need the GPS to know how far you’ve gone, how fast you are running, and the heart rate monitoring to know how hard you worked. You’ll also want to be recording your heart rate throughout the day and when you sleep to get an accurate picture of your resting heart rate. Additionally, most watches also let you program various types of runs into the watch and schedule them into your calendar. This can save the mental load of trying to count laps or guess how fast you have been going whilst keeping you organised in the training.

When it comes to running in an actual race, Garmin will also pace the route out with PacePro and account for the elevation so your watch will tell you every kilometer how fast to go to hit your target time.

So in short, your watch will become your best friend in this training process.

Once you have the watch you’ll want to connect it to Runalyze. You might have heard of Strava, Runalyze is Strava that went to Uni and got a PhD in Sports Science. You get more of an idea of your training status and better tracking of each run and the effect the training is having on your fitness. It will also provide you with training paces that use ‘the science’ and also race time predictions. The predictions are also pretty accurate and lined up with my maximum efforts.

Buying a chest strap heart rate monitor is an optional extra. I had one from Garmin, but it broke, and never felt the need to replace it. The newer watches track the additional metrics it used to provide (ground contact time, vertical oscillation). I think the heart rate monitoring tech in the watches is always improving and the actual benefits of an additional heart rate monitor are lower now than say a few years ago.

What do you wear when you are running? This is less important. Just go to your TK Max and get whatever is cheap, shorts/T-shirts socks, etc. That’s what I did! You just want something, light, comfortable and that won’t get soaked in sweat. Maybe a jacket if you are running when it is cold. If it is going to be cold, get a Merino base layer, specifically a Merino one, they are very effective at keeping you warm. You’ll also want a running cap or hat to keep the sun off your face/ears warm. You do you and whatever makes you comfortable. Spend loads on Soar Running or go cheap like me.

That money you saved on clothes, spend on running trainers. Notice that’s plural, you’ll want (you don’t need) multiple different pairs. A slow pair, a medium pair (optional), and a fast pair.

The slow pair is for your easy runs. They need to be comfortable with plenty of cushioning and feel like clouds for your feet. My choice in this category is the Brooks Glycerins.

The fast pair is what you run the race in. They will have all that technology, like inbuilt plates in the sole and super modern foam that regenerates energy into your legs. You’ll want to minimise the amount of training in this pair, as generally, they are more brittle than everyday shoes, but you want to make sure you can run the distance in them. I used Saucony Endorphin Speed 3’s to run the marathon in and they have a nylon plate and feel quick on your feet. You might have heard of the Nike supershoes (Alphafly, Vapourfly) that have carbon plates. This is what we are after, something to run quickly in and get as much help from technology.

You then want something in-between the slow shoes and fast shoes, a medium pair per se. This is for quicker runs where the slow shoes can feel clunky. I went for Sacouny Guide 15’s. Handy to pack in a suitcase too as they are a bit more lightweight than the Glycerins.

Ok now, you are fully dressed, new shoes laced up, how do you approach the training?

Run lots. To get better at running you have to put the hours in a do plenty of running. Throughout my training, I was on average running 6 of the 7 days a week. Just the process of running helps improve your fitness and gets you used to running long distances and for a long time. So unfortunately, there is no secret sauce, no hidden training method just the harsh reality of giving up a part of your day to pound the streets. The majority of your running still needs to be slow, you should be comfortable and get through the easy runs without any trouble. Running more miles is more important than running fast miles. If you kill yourself running quickly one day and need to take two days off to recover then you are at a net loss in terms of progress. So just slow down and get out more often.

Once you’ve got used to running frequently you can start to introduce more structure into the runs. You’ll want a weekly ‘long run’ where you are out for more than 90 minutes. Beyond the hour-and-a-half mark is when the body stops burning short-term energy reserves and switches to long-term energy stores. The long run should be slow enough that you can reach that magic 90-minute mark easily and continue for some time afterward. This long-run is where your body adapts to going further and running for longer and gets you used to changing energy stores. You’ll also want to practice refueling on these long runs, as anything over an hour needs some sort of food to keep yourself performing.

You’ll also want to include two speed sessions each week. These are runs where you’ll be tuning the top end of your running, training to go faster and longer at quicker paces. The recommended way to hit the fast paces is up a hill with intervals. The increased incline means you can reach higher heart rates without putting as much stress on your legs. Doing it as an interval, i.e. running up the hill for a minute and then jogging down for a minute also means you can do a bit more, as the recovery can help extend the workout.

Then there is a ‘threshold’ run, where you are running at your anaerobic threshold for an extended amount of time. This is where the heart rate monitoring comes in from the watch. This is a zone 4 run, where you want to be in the zone for an extended amount of time (>5 minutes). The long runs train your aerobic capacity, the threshold runs train your anaerobic capacity.

But more importantly, the speed session can help break up the monotony of always running slow. Plus it’s more fun to start to see your records (PRs) for the shorter distances improve as you build up your fitness. But that only comes from the full package: lots of running, a long run, and some speed work.

So in conclusion, run lots, mostly slow, once long, and on the odd occasion fast up a hill.

“It never gets easier, you just go faster” - Greg LeMond

You’ve been running for a few months now, how do you know if you are having any effect? Firstly some qualitative effects before looking at the data. One, you should be hungry all the time, or at least I was in the training. Luckily I needed to shift a few kilograms, so the weight loss was welcomed. Secondly, you should find that the longer runs are getting longer for the same effort. It never gets easier, you just find a 10km run feels like what a 5km run felt like a few months ago.

What about something measurable? You should see your resting heart rate coming down.

This is my resting heart rate from Garmin, I started training in October/November and ramped up to a maximum weekly mileage in 2023. We can see a continuous downward trend as my heart is getting stronger. Feb was a tough month where I got ill briefly and then in April, I was on a few long-haul flights that interrupted my training flow. Overall, year on year I’ve dropped my heart rate by about 10 bpm. Make sure it tracks when you sleep though, otherwise, it will be undersampling when you are resting.

You should also start to see your average pace improving.

Given that my weekly running was of mixed paces (long run, intervals, etc.), this isn’t a pure comparison but still shows a trend showing that I was getting faster. This is taken from Runanlyze.

Seeing your V02max go up will also reassure you the right things are happening. Although it’s only an estimate of your VO2 max, it should hopefully be somewhat correlated with cardio performance.

This is a deep dive into the science behind human physiology in different endurance tasks. The more obvious ones like running and cycling plus the more extreme trail running, Mount Everest climbing and Antarctica exploring. It’s an engaging read that is very quanty in the sense it wants experiments to back up claims rather than just anecdotes. So for example, taking ibuprofen in a marathon, swishing energy drinks around your mouth instead of swallowing, and getting yourself mentally tired before going out training are all things backed up by science that will help your performance. Although, experiments in sport science always involve a handful of people and they are usually elite athletes too. So not quite as rigorous as other fields.

Matt Fitzgerlad had the chance to train with an elite group of runners in the high-altitude area of Flagstaff, Arizona. This book details his training and shows how different the elite athletes are compared to us mere mortals. But it also highlights how elite anything is still a job. They wake up run, think about running, eat/nap to make sure they are fresh for more running. One injury can derail your life and how you earn money. Very stressful. Great book though, would recommend it to both inspire and humble.

This website is a repository of information and inspired me to write this blog post. It covers similar topics, but probably in better/more detail. So if you like what I’ve written, you’ll love this website.

Running is a simple activity but needs a little bit of thinking to get the most out of it. I couldn’t be any further from an expert but have experienced everything written above and seen my performance and health improve because of it. In the end, I ran my marathon in 04:02:27 which is roughly the modal/mean performance for the average male. In the training process, I managed to drop my 5k from 25 minutes to 23:24, my 10k from 52:44 to 49:17, half marathon from 02:15:14 to 01:52:47.

When I plug all these times into the VDOT calculator:

Distance | PB | VDOT | 5km equiv | 10km equiv | Half equiv | Marathon equiv |
---|---|---|---|---|---|---|

5km | 0:23:24 | 41.4 | 00:23:24 | 00:48:35 | 1:47:43 | 3:43:11 |

10km | 0:49:17 | 40.7 | 00:23:46 | 00:49:17 | 1:49:22 | 3:46:29 |

Half Marathon | 1:52:47 | 39.2 | 00:24:32 | 00:50:54 | 1:52:47 | 3:53:27 |

Marathon | 4:02:27 | 37.4 | 00:25:31 | 00:52:58 | 1:57:24 | 4:02:27 |

The VDOT calculator gives you a barometer of what your records mean for other distances and can give you an idea of both your training paces and also target times for other distances.

My half-marathon time looks weaker compared to the others. So my next target will be to get a sub 1:50 half marathon. This means I’ll be lacing up those (probably new) shoes and running lots, mostly slow with the odd fast session thrown in. Who knows, maybe another marathon could be on the horizon too.

]]>Enjoy these types of posts? Then you should sign up for my newsletter.

Over the past few months, I’ve been training for a marathon and have been trying to understand the best way to train and maximise my performance. This means extensive research and reading to get an idea of what the science says. Endure by Alex Hutchinson is a book I recommend and it takes a look at the way the human body functions over long distances/extreme tasks - such as climbing Mount Everest with no oxygen or ultra, ultra marathoners with an overarching reference to the Breaking2 project by Nike.

In one section the book references something called the Himalayan Database which is a database of expeditions to Mount Everest and other mountains in the Himalayas. As a data lover, this piqued my interest as an interesting data source and something a bit different from my usual data explorations around finance/sports. So I downloaded the database, worked out how to load it, and had a poke around the data.

If you go to the website, himalayandatabase, you can download the data yourself and follow along.

The database is distributed in the DBF format and the website itself is a bit of a blast from the past. It expects you to download a custom data viewer program to look at the data, but thankfully there are people in the R world that demonstrated how to load the raw DBF files. I’ve taken inspiration from this, downloaded the DBF files, loaded up `DBFTables.jl`

and loaded the data into Julia.

```
using DataFrames, DataFramesMeta
using Plots, StatsPlots
using Statistics
using Dates
```

I hit a roadblock straight away and had to patch `DBFTables.jl`

with a new datatype that the Himalayan database uses that isn’t in the original spec. Pull request here if you are interested: DBFTables.jl - Add M datatype. Another feather to my open-source contributions hat!

```
using DBFTables
```

There are 6 tables in the database but we are only interested in 3 of them:

`exped`

details the expeditions. So each trip up a mountain by one or more people.`peaks`

has the details on the mountains in the mountains in the Himalayas.`members`

which has information on each person that has attempted to climb one of the mountains.

```
function load_dbf(fn)
dbf = DBFTables.Table(fn)
DataFrame(dbf)
end
exped = load_dbf("exped.DBF")
peaks = load_dbf("peaks.DBF")
members = load_dbf("members.DBF");
```

Taking a look at the mountains with the most entries.

```
first(sort(@combine(groupby(exped, :PEAKID), :N = length(:YEAR)),
:N, rev=true), 3)
```

Row | PEAKID | N |
---|---|---|

String | Int64 | |

1 | EVER | 2191 |

2 | AMAD | 1456 |

3 | CHOY | 1325 |

Unsurprisingly Mount Everest is the most attempted mountain with Ama Dablam in second and Cho Oyu in third place.

We start with some basic groupings to look at how the data is distributed per year.

```
expSummary = @combine(groupby(@subset(members, :CALCAGE .> 0), :EXPID),
:N = length(:CALCAGE),
:YoungestAge=minimum(:CALCAGE),
:AvgAge = mean(:CALCAGE),
:NFemale = sum(:SEX .== "F"))
expSummary = leftjoin(expSummary,
@select(exped, :EXPID, :PEAKID, :BCDATE, :SMTDATE, :MDEATHS, :HDEATHS, :SUCCESS1), on = :EXPID)
expSummary = leftjoin(expSummary, @select(peaks, :PEAKID, :PKNAME), on = :PEAKID)
everest = dropmissing(@subset(expSummary, :PKNAME .== "Everest"))
everest = @transform(everest, :DeathRate = (:MDEATHS .+ :HDEATHS) ./ :N, :Year = floor.(:BCDATE, Dates.Year))
everestYearly = @combine(groupby(everest, :Year), :N = sum(:N),
:Deaths = sum(:MDEATHS + :HDEATHS),
:Success = sum(:SUCCESS1))
everestYearly = @transform(everestYearly, :DeathRate = :Deaths ./ :N, :SuccessRate = :Success ./ :N)
everestYearly = @transform(everestYearly,
:DeathRateErr = sqrt.(:DeathRate .* (1 .- :DeathRate)./:N),
:SuccessRateErr = sqrt.(:SuccessRate .* (1 .- :SuccessRate)./:N));
```

What is the average age of those who climb Mount Everest?

```
scatter(everest.SMTDATE, everest.AvgAge, label = "Average Age of Attempting Everest")
```

By eye, it looks like the average age has been steadily increasing. Generally, your expedition’s average age needs to be at least 30. Given it costs a small fortune to climb Everest this is probably more of a ‘need money’ rather than a look at the overall fitness of a 30-year-old.

When we look at the number of attempts yearly and the annual death rate:

```
plot(bar(everestYearly.Year, everestYearly.N, label = "Number of Attempts in a Year"),
scatter(everestYearly.Year, everestYearly.DeathRate, yerr=everestYearly.DeathRateErr,
label = "Yearly Death Rate"),
layout = (2,1))
```

```
scatter(everestYearly[everestYearly.Year .> Date("2000-01-01"), :].Year,
everestYearly[everestYearly.Year .> Date("2000-01-01"), :].DeathRate,
yerr=everestYearly[everestYearly.Year .> Date("2000-01-01"), :].DeathRateErr,
label = "Yearly Death Rate")
```

But how ‘easy’ has it been to conquer Everest over the years? Looking at the success rate at best 10% of attempted expeditions are completed, which highlights how tough it is. Given some of the photos of people queueing to reach the summit, you’d think it would be much easier, but out of the 400 expeditions, less than 100 will make it.

```
scatter(everestYearly.Year, everestYearly.SuccessRate, yerr=everestYearly.SuccessRateErr,
label = "Mt. Everest Success Rate")
```

A couple of interesting points from this graph:

- 2014 was an outlier due to an avalanche that lead to Mount Everest being closed from April until the rest of the year.
- No one successfully climbed Mt Everest in 2015 because of the earthquake.
- Only 1 success in 2020 before the pandemic closed everything.

So a decent amount of variation in what can happen in a given year on Mt Everest.

The data has some interesting quirks and we now turn to our next step, trying to build a model. Endurance was about what it takes to complete impressive human feats. So let’s do that here, can we use the database to predict and explain what leads to success?

We will be using the `MLJ.jl`

package again to fit some machine learning models easily.

```
using MLJ, LossFunctions
```

To start with we are going to pull out the relevant factors that we think will help climb a mountain. Not specifically Everest, but any of the Himalayan peaks from the database.

```
modelData = members[:, ["MSUCCESS", "PEAKID","MYEAR",
"MSEASON", "SEX", "CALCAGE", "CITIZEN", "STATUS",
"MROUTE1", "MO2USED"]]
modelData = @subset(modelData, :PEAKID .== "EVER")
modelData.MROUTE1 = modelData.PEAKID .* "_" .* string.(modelData.MROUTE1)
modelData = dropmissing(modelData)
modelData.MYEAR = parse.(Int, modelData.MYEAR)
modelData = @subset(modelData, :CALCAGE .> 0)
print(size(modelData))
```

```
(22583, 10)
```

```
first(modelData, 4)
```

4×10 DataFrame

Row | MSUCCESS | PEAKID | MYEAR | MSEASON | SEX | CALCAGE | CITIZEN | STATUS | MROUTE1 | MO2USED |
---|---|---|---|---|---|---|---|---|---|---|

Bool | String | Int64 | Int64 | String | Int64 | String | String | String | Bool | |

1 | false | EVER | 1963 | 1 | M | 36 | USA | Climber | EVER_2 | true |

2 | true | EVER | 1963 | 1 | M | 31 | USA | Climber | EVER_1 | true |

3 | false | EVER | 1963 | 1 | M | 27 | USA | Climber | EVER_1 | false |

4 | false | EVER | 1963 | 1 | M | 26 | USA | Climber | EVER_2 | true |

Just over 22k rows and 10 columns, so plenty of data to sink our teeth into. MLJ needs us to define the `Multiclass`

type of the factor variables and we also want to split out the predictor and predictors then split out into the test/train sets.

```
modelData2 = coerce(modelData,
:MSUCCESS => OrderedFactor,
:MSEASON => Multiclass,
:SEX => Multiclass,
:CITIZEN => Multiclass,
:STATUS => Multiclass,
:MROUTE1 => Multiclass,
:MO2USED => OrderedFactor);
```

```
y, X = unpack(modelData2, ==(:MSUCCESS), colname -> true; rng=123);
train, test = partition(eachindex(y), 0.7, shuffle=true);
```

All these multi-class features need to be one-hot encoded, so we use the continuous encoder. The workflow is:

- Create the encoder/standardizer.
- Train on the data
- Transform the data

This gives confidence that you aren’t leaking the training data into the test data.

```
encoder = ContinuousEncoder()
encMach = machine(encoder, X) |> fit!
X_encoded = MLJ.transform(encMach, X);
X_encoded.MO2USED = X_encoded.MO2USED .- 1;
```

```
standardizer = @load Standardizer pkg=MLJModels
stanMach = fit!(machine(
standardizer(features = [:CALCAGE]),X_encoded);
rows=train)
X_trans = MLJ.transform(stanMach, X_encoded);
X_trans.MYEAR = X_trans.MYEAR .- minimum(X_trans.MYEAR);
```

```
plot(
histogram(X_trans.CALCAGE, label = "Age"),
histogram(X_trans.MYEAR, label = "Year"),
histogram(X_trans.MO2USED, label = "02 Used")
)
```

Looking at the distribution of the transformed data gives a good indication of how varied these variables change post-transformation.

I’ll now explore some different models using the MLJ.jl workflow similar to my previous post on Machine Learning Property Loans for Fun and Profit. MLJ.jl gives you a common interface to fit a variety of different models and evaluate their performance all from one package, so handy here when we want to look at a simple linear model and also an XGBoost model.

Let’s start with our null model to get the baseline.

```
constantModel = @load ConstantClassifier pkg=MLJModels
constMachine = machine(constantModel(), X_trans, y)
evaluate!(constMachine,
rows=train,
resampling=CV(shuffle=true),
operation = predict_mode,
measures=[accuracy, balanced_accuracy, kappa],
verbosity=0)
```

Model | Accuracy | Kappa |
---|---|---|

Null | 0.512 | 0.0 |

For classification tasks, the null model is essentially tossing a coin, so the accuracy will be around 50% and the \(\kappa\) is zero.

Next we move on to the simple linear model using all the features.

```
logisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0
lmMachine = machine(logisticClassifier(lambda=0), X_trans, y)
fit!(lmMachine, rows=train, verbosity=0)
evaluate!(lmMachine,
rows=train,
resampling=CV(shuffle=true),
operation = predict_mode,
measures=[accuracy, balanced_accuracy, kappa], verbosity = 0)
```

Model | Accuracy | Kappa |
---|---|---|

Null | 0.512 | 0.0 |

Linear Regression | 0.884 | 0.769 |

This gives a good improvement over the null model, so indicates our included features have some sort of information useful in predicting success.

Inspecting the parameters indicates how strong each variable is. Route 0 leads to a large reduction in the probability of success whereas using oxygen increases the probability of success. Climbing in the Autumn or Winter also looks like it reduces your chance of success.

```
params = mapreduce(x-> DataFrame(Param=collect(x)[1], Value = collect(x)[2]),
vcat, fitted_params(lmMachine).coefs)
params = sort(params, :Value)
vcat(first(params, 5), last(params, 5))
```

10×2 DataFrame

Row | Param | Value |
---|---|---|

Symbol | Float64 | |

1 | MROUTE1__EVER_0 | -4.87433 |

2 | SEX__F | -1.97957 |

3 | SEX__M | -1.94353 |

4 | MSEASON__3 | -1.39251 |

5 | MSEASON__4 | -1.1516 |

6 | MROUTE1__EVER_2 | 0.334305 |

7 | CITIZEN__USSR | 0.43336 |

8 | CITIZEN__Russia | 0.518197 |

9 | MROUTE1__EVER_1 | 0.697601 |

10 | MO2USED | 3.85578 |

What’s a model if we’ve not tried xgboost to squeeze the most performance out of all the data? Easy to fit using MLJ and without having to do any special lifting.

```
xgboostModel = @load XGBoostClassifier pkg=XGBoost verbosity = 0
xgboostmodel = xgboostModel()
xgbMachine = machine(xgboostmodel, X_trans, y)
evaluate!(xgbMachine,
rows=train,
resampling=CV(nfolds = 6, shuffle=true),
measures=[accuracy,balanced_accuracy, kappa],
verbosity=0)
```

Model | Accuracy | Kappa |
---|---|---|

Null | 0.512 | 0.0 |

Linear Regression | 0.884 | 0.769 |

XGBoost | 0.889 | 0.778 |

We get 88.9% accuracy compared to the linear regression 88.4% and a \(\kappa\) increase too, so looking like a good model.

The whole point of these models is to try and work out what combination of these parameters gets us the highest probability of success on a mountain. We want some idea of feature importance that can direct us to the optimal approach to a mountain. Should I be an Austrian Doctor or is there an easier route that should be taken?

With xgboost we can use the `feature_importances`

function to do exactly what it says on the tin and look at what features are most important in the model.

```
fi = feature_importances(xgbMachine)
fi = mapreduce(x-> DataFrame(Param=collect(x)[1], Value = collect(x)[2]),
vcat, fi)
first(fi, 5)
```

5×2 DataFrame

Row | Param | Value |
---|---|---|

Symbol | Float32 | |

1 | MO2USED | 388.585 |

2 | MROUTE1__EVER_0 | 46.3129 |

3 | STATUS__H-A Worker | 15.0079 |

4 | CITIZEN__Nepal | 11.6299 |

5 | CITIZEN__UK | 4.25651 |

So using oxygen, taking the 0th route up, being an H-A Worker, and either being from Nepal or a UK citizen appears to have the greatest impact on being successful. Using oxygen is an obvious benefit/cannot be avoided and I don’t think anyone believes that their chance of success would be higher without oxygen. Being Nepalese is the one I would struggle with.

How does the model perform on the hold-out set? We’ve got 30% of the data that hasn’t been used in the fitting that can also validate how well the model performs.

```
modelNames = ["Null", "LM", "XGBoost"]
modelMachines = [constMachine,
lmMachine,
xgbMachine]
aucRes = DataFrame(Model = modelNames,
AUC = map(x->auc(MLJ.predict(x,rows=test), y[test]),
modelMachines))
kappaRes = DataFrame(Kappa = map(x->kappa(MLJ.predict_mode(x,rows=test), y[test]), modelMachines),
Accuracy = map(x->accuracy(MLJ.predict_mode(x,rows=test), y[test]), modelMachines),
Model = modelNames)
evalRes = leftjoin(aucRes, kappaRes, on =:Model)
```

3×4 DataFrame

Row | Model | AUC | Kappa | Accuracy |
---|---|---|---|---|

String | Float64 | Float64? | Float64? | |

1 | Null | 0.5 | 0.0 | 0.512768 |

2 | LM | 0.937092 | 0.768528 | 0.883838 |

3 | XGBoost | 0.939845 | 0.775408 | 0.88738 |

On the test set, the XGBoost model is only slightly better than the linear model in terms of \(\kappa\) and accuracy. It’s worse when measuring the AUC, so this is setting alarm bells ringing that the model isn’t quite there yet.

```
X_trans2 = copy(X_trans[1:2, :])
X_trans2.MO2USED = 1 .- X_trans2.MO2USED
predict(xgbMachine, vcat(X_trans[1:2, :], X_trans2))
```

```
4-element CategoricalDistributions.UnivariateFiniteVector{OrderedFactor{2}, Bool, UInt32, Float32}:
UnivariateFinite{OrderedFactor{2}}(false=>0.245, true=>0.755)
UnivariateFinite{OrderedFactor{2}}(false=>1.0, true=>0.000227)
UnivariateFinite{OrderedFactor{2}}(false=>0.901, true=>0.0989)
UnivariateFinite{OrderedFactor{2}}(false=>1.0, true=>0.000401)
```

By taking the first two entries and switching whether they used oxygen or not we can see how the outputted probability of success changes. In each case, it provides a dramatic shift in the probabilities. Again, from the feature importance output, we know this is the most important variable but it does seem to be a bit dominating in terms of what happens with and without oxygen.

Finally, let’s look at the calibration of the models.

```
using CategoricalArrays
modelData.Prediction = pdf.(predict(xgbMachine, X_trans), 1)
lData = @transform(modelData, :prob = cut(:Prediction, (0:0.1:1.1)))
gData = groupby(lData, :prob)
calibData = @combine(gData, :N = length(:MSUCCESS),
:SuccessRate = mean(:MSUCCESS),
:PredictedProb = mean(:Prediction))
calibData = @transform(calibData, :Err = 1.96 .* sqrt.((:PredictedProb .* (1 .- :PredictedProb)) ./ :N))
p = plot(calibData[:, :PredictedProb],
calibData[:, :SuccessRate],
yerr = calibData[:, :Err],
seriestype=:scatter, label = "XGBoost Calibration")
p = plot!(p, 0:0.1:1, 0:0.1:1, label = :none)
h = histogram(modelData.Prediction, normalize=:pdf, label = "Prediction Distribution")
plot(p, h, layout = (2,1))
```

To say the model is poorly calibrated is an understatement. There is no association of an increased success rate with the increase in model probability and from the distribution of predictions we can see it’s quite binary, there isn’t an even distribution to the output. So whilst the evaluation metrics look better than a null model, the reality is that the model isn’t doing anything. With all the different factors in the model matrix, there is likely some degeneracy in the data, such that a single occurrence of a variable ends up predicting success or not. There is potentially an issue with using the member’s table instead of the expedition table, as whether the expedition was successful or not will lead to multiple members being successful.

Overall it’s an interesting data set even if it does take a little work to get it loaded into Julia. There is a wealth of different features in the data that lead to some nice graphs, but using these features to predict whether you will be successful or not in climbing Mount Everest doesn’t lead to a useful model.

Enjoy these types of posts? Then you should sign up for my newsletter.

A few packages to get started and I’m running Julia 1.8 for this project.

```
using AlpacaMarkets
using DataFrames, DataFramesMeta
using Dates
using Plots, PlotThemes, StatsPlots
using RollingFunctions, Statistics, StatsBase
using GLM
```

All good data analysis starts with the data. I’m downloading the daily statistics of SPY the S&P 500 stock index ETF which will represent the overall stock market.

```
function parse_date(t)
Date(string(split(t, "T")[1]))
end
function clean(df, x)
df = @transform(df, :Date = parse_date.(:t), :Ticker = x, :NextOpen = [:o[2:end]; NaN])
@select(df, :Date, :Ticker, :c, :o, :NextOpen)
end
spyPrices = stock_bars("SPY", "1Day"; startTime = now() - Year(10), limit = 10000, adjustment = "all")[1]
spyPrices = clean(spyPrices, "SPY")
last(spyPrices, 3)
```

3×5 DataFrame

Row | Date | Ticker | c | o | NextOpen |
---|---|---|---|---|---|

Date | String | Float64 | Float64 | Float64 | |

1 | 2023-02-22 | SPY | 398.54 | 399.52 | 401.56 |

2 | 2023-02-23 | SPY | 400.66 | 401.56 | 395.42 |

3 | 2023-02-24 | SPY | 396.38 | 395.42 | NaN |

I’m doing the usual close-to-close returns and then taking the 100-day moving average as my trend signal.

```
spyPrices = @transform(spyPrices, :Return = [missing; diff(log.(:c))])
spyPrices = @transform(spyPrices, :Avg = lag(runmean(:Return, 100), 1))
spyPrices = @transform(spyPrices, :BigMove = abs.(:Return) .>= 0.025)
dropmissing!(spyPrices);
```

```
sp = scatter(spyPrices[spyPrices.BigMove, :].Date, spyPrices[spyPrices.BigMove, :].Return, legend = :none)
sp = scatter!(sp, spyPrices[.!spyPrices.BigMove, :].Date, spyPrices[.!spyPrices.BigMove, :].Return)
plot(sp, plot(spyPrices.Date, spyPrices.Avg), layout = (2,1), legend=:none)
```

By calling a ‘big move’ anything greater than \(\pm\) 0.025 (in log terms) we can see that they, the blue dots, are slightly clustered around common periods. In the plot below, the 100-day rolling average of the returns, our trend signal, also appears to be slightly correlated with these big returns.

```
scatter(spyPrices.Avg, abs.(spyPrices.Return), label = :none,
xlabel = "Trend Signal", ylabel = "Daily Return")
```

Here we have the 100-day rolling average on the x-axis and the absolute return on that day on the y-axis. If we squint a little we can imagine there is a slight quadratic pattern, or at the least, these down trends appear to correspond with the more extreme day moves. We want to try and understand if this is a significant effect.

We will start by looking at the probability that each day might have a ‘large move’. We first split into a train/test split of 70/30.

```
trainData = spyPrices[1:Int(floor(nrow(spyPrices)*0.7)), :]
testData = spyPrices[Int(ceil(nrow(spyPrices)*0.7)):end, :];
```

The `GLM.jl`

package lets you write out the formula and fit a wide
variety of linear models. We have two models, the proper one that uses
the `Avg`

column (our trend signal) as our features and a null model
that just fits an intercept.

```
binomialModel = glm(@formula(BigMove ~ Avg + Avg^2), trainData, Binomial())
nullModel = glm(@formula(BigMove ~ 1), trainData, Binomial())
spyPrices[!, :Binomial] = predict(binomialModel, spyPrices);
```

To look at the model we can plot the output of the model relative to the signal at the time.

```
plot(scatter(spyPrices.Avg, spyPrices[!, :Binomial], label ="Response Function"),
plot(spyPrices.Date, spyPrices[!, :Binomial], label = "Probability of a Large Move"), layout = (2,1))
```

From the top graph, we see the higher probability of an extreme move comes from when the moving average is a large negative number. The probability then flatlines beyond zero, which suggests there isn’t that much of an effect for large moves when the momentum in the market is positive.

We also plot the daily probability of a large move and see that it has been pretty bad in the few months lots of big moves!

We need to check if the model is any good though. We will just check the basic accuracy.

```
using Metrics
binary_accuracy(predict(binomialModel, testData), testData.BigMove)
binary_accuracy(predict(nullModel)[1] .* ones(nrow(testData)), testData.BigMove)
```

```
0.93
0.95
```

So the null model has an accuracy of 95% on the test set, but the fitted model has an accuracy of 93%. Not good, looks like the trend signal isn’t adding anything. We might be able to salvage the model with a robust windowed fit and test procedure or look at a single stock name but overall, I think it’s more of a testament to how hard it is to model this data rather than anything too specific.

We could also consider the self-exciting nature of these large moves. If one happens, is there a higher probability of another happening? Given my Ph.D. was in Hawkes processes, I have done lots of writing around them before and this is just another example of how they can be applied.

Hawkes processes! The bane of my life for four years. Still, I am forever linked with them now so might as well put that Ph.D. to use. If you haven’t come across Hawkes processes before it is a self-exciting point process where the occurrence of one event can lead to further events. In our case, this means one extreme event can cause further extreme events, something we are trying to use the downtrend to predict. With the Hawkes process, we are checking whether the events are just self-correlated.

I’ve built the HawkesProcesses.jl package to make it easy to work with Hawkes processes.

```
using HawkesProcesses, Distributions
```

Firstly, we get the data in the right shape by pulling the number of days since the start of the data of each big event.

```
startDate = minimum(spyPrices.Date)
allEvents = getfield.(spyPrices[spyPrices.BigMove, :Date] .- startDate, :value);
allDatesNorm = getfield.(spyPrices.Date .- startDate, :value);
maxT = getfield.(maximum(spyPrices[spyPrices.BigMove, :Date]) .- startDate, :value)
```

We then fit the Hawkes process using the standard Bayesian method for 5,000 iterations.

```
bgSamps1, kappaSamps1, kernSamps1 = HawkesProcesses.fit(allEvents .+ rand(length(allEvents)), maxT, 5000)
bgSamps2, kappaSamps2, kernSamps2 = HawkesProcesses.fit(allEvents .+ rand(length(allEvents)), maxT, 5000)
bgEst = mean(bgSamps1[2500:end])
kappaEst = mean(kappaSamps1[2500:end])
kernEst = mean(kernSamps1[2500:end])
intens = HawkesProcesses.intensity(allDatesNorm, allEvents, bgEst, kappaEst, Exponential(1/kernEst));
spyPrices[!, :Intensity] = intens;
```

We get three parameters out of the Hawkes process. The background rate \(\mu\), the self-exciting parameter \(\kappa\) and an exponential parameter that describes how long each event has an impact on the probability of another event, \(\beta\).

```
(bgEst, kappaEst, kernEst)
```

```
(0.005, 0.84, 0.067)
```

We get \(\kappa = 0.84\) and \(\beta = 0.07\) which we can interpret as a high probability that another large move follows and that takes around 14 days (business days) to decay. So with each large move, expect another large move within 3 weeks.

When we compare the Hawkes intensity to the previous binomial intensity we get a similar shape between both models.

```
plot(spyPrices.Date, spyPrices.Binomial, label = "Binomial")
plot!(spyPrices.Date, intens, label = "Hawkes")
```

They line up quite well, which is encouraging and shows they are on a similar path. If we zoom in specifically to 2022.

```
plot(spyPrices[spyPrices.Date .>= Date("2022-01-01"), :].Date,
spyPrices[spyPrices.Date .>= Date("2022-01-01"), :].Binomial, label = "Binomial")
plot!(spyPrices[spyPrices.Date .>= Date("2022-01-01"), :].Date,
spyPrices[spyPrices.Date .>= Date("2022-01-01"), :].Intensity, label = "Hawkes")
```

Here we can see the binomial intensity stays higher for longer whereas the Hawkes process goes through quicker bursts of intensity. This is intuitive as the binomial model is using a 100-day moving average under the hood, whereas the Hawkes process is much more reactive to the underlying events.

To check whether the Hawkes process is any good we compare its likelihood to a null likelihood of a constant Poisson process.

We first fit the null point process model by optimising the
`null_likelihood`

across the events.

```
using Optim
null_likelihood(events, lambda, maxT) = length(events)*log(lambda) - lambda*maxT
opt = optimize(x-> -1*null_likelihood(allEvents, x[1], maxT), 0, 10)
Optim.minimizer(opt)
```

```
0.031146179404103084
```

Which gives a likelihood of:

```
null_likelihood(allEvents, Optim.minimizer(opt), maxT)
```

```
-335.1797769669301
```

Whereas the Hawkes process has a likelihood of:

```
likelihood(allEvents, bgEst, kappaEst, Exponential(1/kernEst), maxT)
```

```
-266.63091365640366
```

A substantial improvement, so all in the Hawkes process looks pretty good.

Overall, the Hawkes model subdues quite quickly, but the binomial model can remain elevated. They are covering two different behaviours. The Hawkes model can describe what happens *after* one of these large moves happens. The binomial model is mapping the momentum onto a probability of a large event.

How do we combine both the binomial and the Hawkes process model?

To start with, we need to consider a point process with variable intensity. This is known as an inhomogeneous point process. In our case, these events depend on the value of the trend signal.

\[\lambda (t) \propto \hat{r} (t)\] \[\lambda (t) = \beta _0 + \beta _1 \hat{r} (t) + \beta_2 \hat{r} ^2 (t)\]Like the binomial model, we will use a quadratic combination of the values. Then, given we know how to write the likelihood for a point process, we can do some maximum likelihood estimation to find the appropriate parameters.

Our `rhat`

function need to return the signal at a given time.

```
function rhat(t, spy)
dt = minimum(spy.Date) + Day(Int(floor(t)))
spy[spy.Date .<= dt, :Avg][end]
end
```

And our likelihood which uses the `rhat`

function, plus making it compatible with arrays.

```
function lambda(t, params, spy)
exp(params[1] + params[1] * rhat(t, spy) + params[2] * rhat(t, spy) * rhat(t, spy))
end
lambda(t::Array{<:Number}, params::Array{<:Number}, spy::DataFrame) = map(x-> lambda(x, params, spy), t)
```

The likelihood of a point process is

\[\mathcal{L} = \sum _{t_i} log(\lambda (t_i)) - \int _0 ^T \lambda (t) \mathrm{d} t\]We have to use numerical integration to do the second half of the equation which is where the `QuadGK.jl`

package comes in. We pass it a function and it will do the integration for us. Job done!

```
function likelihood(params, rate, events, maxT, spy)
sum(log.(rate(events, params, spy))) - quadgk(t-> rate(t, params, spy), 0, maxT)[1]
end
```

With all the functions ready we can optimise and find the correct parameters.

```
using Optim, QuadGK
opt = optimize(x-> -1*likelihood(x, lambda, allEvents, maxT, spyPrices), rand(3))
Optim.minimizer(opt)
```

```
3-element Vector{Float64}:
-3.4684622926014783
1.6204408269570916
2.902098418452392
```

This also has a maximum likelihood of -334. Which if you scroll up isn’t much better compared to the null model. So warning bells should be ringing that this isn’t a good model.

```
plot(minimum(spyPrices.Date) + Day.(Int.(collect(0:maxT))),
lambda(collect(0:maxT), Optim.minimizer(opt), spyPrices), label = :none,
title = "Poisson Intensity")
```

The intensity isn’t showing too much structure over time.

To check the fit of this model we simulate some events with the same intensity pattern.

```
lambdaMax = maximum(lambda(collect(0:0.1:maxT), Optim.minimizer(opt), spyPrices)) * 1.1
rawEvents = rand(Poisson(lambdaMax * maxT), 1)[1]
unthinnedEvents = sort(rand(Uniform(0, maxT), rawEvents))
acceptProb = lambda(unthinnedEvents, Optim.minimizer(opt), spyPrices) / lambdaMax
events = unthinnedEvents[rand(length(unthinnedEvents)) .< acceptProb];
histogram(events,label= "Simulated", bins = 100)
histogram!(allEvents, label = "True", bins = 100)
```

It’s not a great model as the simulated events don’t line up with the true events. Looking back at the intensity function we can see it doesn’t vary much around 0.03, so whilst the intensity function looks varied, zooming out shows it is quite flat.

I wanted to integrate the variable background into the Hawkes process
so we could combine both models. As my Hawkes sampling is Bayesian I
have an old blog post to turn the above from an MLE to full Bayesian
estimation, but that code doesn’t work anymore. You need to use the
`LogDensityProblems.jl`

package to get it working, so I’m going to
have to invest some time in learning that. I’ll be honest, I’m not
sure how bothered I can be, I’ve got a long list of other things I
want to explore and learning some abstract interface doesn’t feel like
it’s a good use of my time. Frustrating because the whole point of
Julia is composability, I could write a pure Julia function and
use HCMC on it, but now I’ve got to get another package involved. I’m sure
there is good reason and the LogDensityProblems package solves some issues but it feels a bit like the Javascript ecosystem where everything changes and the way to do something is outdated the minute it is pushed to main.

So overall we’ve shown that the large moves don’t happen more often in down-trending markets, at least in the broad S&P500 view of the market. Both a binomial and point process model showed no improvement on a null model for predicting these extreme days whereas the Hawkes model shows that they are potentially self-exciting.

]]>