Dean Markwick

Alpha Capture and Acquired

2024-09-19T00:00:00+00:00

People are never short of a trade idea. There is a whole industry of researchers, salespeople and amateurs coming up with trading ideas and making big calls on what stock will go up, what country will cut interest rates and what the price of gold will do next. Alpha capture is about systematically assessing ideas and working out who has alpha and generates profitable ideas and who is just making it up as they are going along.

Enjoy these types of posts? Then you should sign up for my newsletter.

Alpha capture started as a way of profiling a broker’s stock recommendation. If you have 50 people recommending you 50 different ideas, how do you know who is good? You’ll quickly run out of money if you blindly follow all the recommendations that hit your inbox. Instead, you need to profile each person’s idea and see who on average can make good recommendations. Whoever is good at picking stocks probably deserves more of your business.

It has since expanded that some hedge fund have internal desks that are doing a similar analysis on their portfolio managers (PMs) to double down on profitable bets and mitigate risks of all the PMs picking the same stock. Picking stocks and managing a portfolio across many PMs are two different skills and different departments at your modern hedge fund.

A simple way to measure the alpha of a PM or broker recommendation will be to see if the price of a stock they buy (or recommend) goes up after the day they suggest it. Those with alpha would see their picks move higher on a large enough sample and those without alpha would average out to zero, some ideas would go higher, some ideas lower, the net result being 0 alpha. If a PM has the opposite effect, every stock they buy goes down they are a contrarian indicator so take their idea and do the opposite!

Alpha Capture Systems: Past, Present, and Future Directions goes through the history of alpha capture and is a good short read that inspired this blog post.

Basic Alpha Capture

What if we wanted to try our own Alpha Capture? We need some stock recommendations and a way of calculating what happens to the price after the recommendation. This is where the Acquired podcast comes in.

Acquired tells the stories and strategies of great companies (taken from their website). It’s a pretty popular podcast and each episode gets close to a million listeners. So this makes it an ideal Alpha Capture study - when they release an episode about a company does the stock price of that company go higher or lower on average? If it were to go higher then each time an episode is released call your broker and go long the stock!

They aren’t explicitly recommending a stock by talking about it, as they say in their intro. So it’s just a toy exercise to see if there is any correlation between the stock price and the release date of an episode.

To systematically test this we need to get a list of the episodes and calculate a ‘markout’ from each episode.

Collecting Podcast Data

The internet is a wonderful thing and each episode of Acquired is available as a XML feed from transistor.fm. So doing some fun parsing of XML I can get the full history of the podcast with each date and title.

function parseEpisode(x)
  rawDate = first(simplevalue.(x[tag.(x) .== "pubDate"]))
  date = ZonedDateTime(rawDate, dateformat"eee, dd uuu yyyy HH:MM:ss z")

  Dict("title" => first(simplevalue.(x[tag.(x) .== "title"])),
       "date" =>date)
end

function parse_date(t)
   Date(string(split(t, "T")[1]))
end

url = "https://feeds.transistor.fm/acquired"

data = parse(Node, String(HTTP.get(url).body))

episodes = children(data[3][1])
filter!(x -> tag(x) == "item", episodes)
episodes = children.(episodes)

episodeData = parseEpisode.(episodes)

episodeFrame = vcat(DataFrame.(episodeData)...)
CSV.write("episodeRaw.csv", episodeFrame)

After writing the data to a CSV I need to somehow parse the episode title into a stock ticker. This is a tricky task as the episode names are human friendly not computer friendly. So time for our LLM overlords to lend a hand a do the heavy lifting. I drop the CSV into Perplexity and prompt it to add the relevant stock ticker to the file. I then reread the CSV into my notebook.

episodeFrame = CSV.read("episodeTicker.csv", DataFrame)
episodeFrame.date = ZonedDateTime.(String.(episodeFrame.date), dateformat"yyyy-mm-ddTHH:MM:SS.sss-z")

vcat(first(@subset(episodeFrame, :stock_ticker .!= "-"), 4),
        last(@subset(episodeFrame, :stock_ticker .!= "-"), 4))

date `ZonedDateTime`	title `String`	stock_ticker `String15`	sector_etf `String7`
2024-03-17T17:54:00.400+07:00	Renaissance Technologies	RNR	PSI
2024-02-19T17:56:00.410+08:00	Hermès	RMS.PA	GXLU
2024-01-21T17:59:00.450+08:00	Novo Nordisk (Ozempic)	NOVO-B.CO	IHE
2023-11-26T16:24:00.250+08:00	Visa	V	IPAY
2018-09-23T18:28:00.550+07:00	Season 3, Episode 5: Alibaba	BABA	KWEB
2018-08-20T09:20:00.370+07:00	Season 3, Episode 3: The Sonos IPO	SONO	GAMR
2018-08-05T18:15:00.030+07:00	Season 3, Episode 2: The Xiaomi IPO	XIACF	KWEB
2018-07-16T21:40:00.560+07:00	Season 3, Episode 1: Tesla	TSLA	TSLA

It’s done an ok job. Most of the episodes seem to correspond to the right ticker but we can see it has hallucinated the RenTech stock ticker as RNR. RenTech is a private company, no stock ticker and instead, Perplexity has decided the RNR (a reinsurance company) is the correct stock ticker. So not 100% accurate. Still, it has saved me a good chunk of time and we can move on to getting the stock price data.

We want to measure the average price move of a stock after an episode is released. If Acquired had stock-picking skill, you expect the price to increase after the release of an episode as they are generally speaking positively about the various companies.

So using AlpacaMarkets.jl we get the stock price for the days before and the days after the episode. As AlpacaMarkets only has US stock data then only some of the episodes end up with a full dataset.

What is a Markout?

We calculate the percentage change relative to the episode date and then aggregate all the stock tickers together.

\[\text{Markout} = \frac{p - p_{\text{episode released}}}{p_{\text{episode released}}}\]

Acquired is about great companies so they choose to speak favourably about a company, therefore I think it’s a reasonable assumption that we expect the stock price to increase after everyone gets round to listening to it. So once we aggregate all the episodes we should hopefully have enough data to decide if this is true.

function getStockData(stock, startDate)
  prices = AlpacaMarkets.stock_bars(stock, "1Day", startTime=startDate - Month(1), limit=10000)[1]
  prices.date .= startDate
  prices.t = parse_date.(prices.t)
  prices[:, [:t, :symbol, :vw, :date]]
end

function calcMarkout(data)
   arrivalInd = findlast(data.t .<= data.date)
   arrivalPrice = data[arrivalInd, :vw]
   data.arrivalPrice .= arrivalPrice
   data.ts = [x.value for x in (data.t .- data.date)]
   data.markout = 1e4*(data.vw .- data.arrivalPrice) ./ data.arrivalPrice
   data
end

res = []

for row in eachrow(episodeFrame)
    
    try 
        stockData = getStockData(row.stock_ticker, Date(row.date))
        stockData = calcMarkout(stockData)
        append!(res, [stockData])
    catch e
        println(row.stock_ticker)
    end
end

res = vcat(res...)

With the data pulled we now aggregate by each day before and after the episode.

markoutRes = @combine(groupby(res, :ts), :n = length(:markout), 
                                         :avgMarkout = mean(:markout),
                                         :devMarkout = std(:markout))
markoutRes = @transform(markoutRes, :errMarkout = :devMarkout ./sqrt.(:n))

Always need error bars as this data gets noisy.

markoutResSub = @subset(markoutRes, :ts .<= 60, :n .>= 10)
plot(markoutResSub.ts, markoutResSub.avgMarkout, yerr=markoutResSub.errMarkout, 
     xlabel = "Days", ylabel = "Markout", title = "Acquired Alpha Capture", label = :none)
hline!([0], ls = :dash, color = "grey", label = :none)
vline!([0], ls = :dash, color = "grey", label = :none)

Not really a pattern. The majority of the error bars are intercepting zero after the podcast is released. If you squint a little bit there seems to be a bit of a downward trend post-episode which would suggest they talk about a company at the peak of the stock price.

Beforehand there is a bit of positive momentum, again suggesting that they release the podcast at the peak of the stock price. Now this is even more of a stretch given there is only 1 podcast a month and it takes more than 20 days to prepare an episode (I imagine!), so more noise than signal.

markoutIndRes = @combine(groupby(res, [:symbol, :ts]), :n = length(:markout), 
                                         :avgMarkout = mean(:markout),
                                         :devMarkout = std(:markout))
markoutIndRes = @transform(markoutIndRes, :errMarkout = :devMarkout ./sqrt.(:n))

p = plot()
hline!(p, [0], ls = :dash, color = "grey", label = :none)
vline!(p, [0], ls = :dash, color = "grey", label = :none)
for sym in ["TSLA", "V", "META"]
   markoutResSub = sort(@subset(markoutIndRes, :symbol .== sym, :ts .<= 60, :n .>= 1), :ts)
    plot!(p, markoutResSub.ts, markoutResSub.avgMarkout, yerr=markoutResSub.errMarkout, 
     xlabel = "Days", ylabel = "Markout", title = "Acquired Alpha Capture", label = sym, lw =2) 
end
p

When we pull out 3 examples of episodes we can see the randomness and specifically the volatility of TSLA here.

Conclusion

From this, we would not put any specific weight on the stock performance after an episode is released. There doesn’t appear to be any statistical pattern to exploit. No alpha means no alpha capture. It is a nice exercise though and has hopefully explained the concept of a markout.

Solving the Almgren Chris Model

2024-06-06T00:00:00+00:00

The Almgren Chris model from Optimal Execution of Portfolio Transactions is the most well known optimal execution model and provides the foundational math about how to think about trading some quantity of an asset. This blog post goes through the math and how we set the problem up and arrived at the various solutions.

Enjoy these types of posts? Then you should sign up for my newsletter.

I first encountered the Almgren Chriss model in my initial PhD year through a Microstructure and Machine Learning course. It was for 2 hours at 18:00 on a Friday night and on the other side of London from where I lived, so a bit of a pain for me to attend. This post in essence is inspired by these notes as I’ve always wanted to summarise them into a digital version. So this is a maths-heavy post that will act as a springboard for some more future content.

The Trading Problem

We have $X$ amount of something to trade over some time$0$ to $T$ such that $X_T = 0$. How should we slice and dice our trades to minimise the execution cost?

We need a model of

How the price moves
How our trading affects prices

then we can build a trading cost function that we then optimise in different ways.

Price Dynamics

The price evolves like $S_t = \bar{S} _t + \eta v_t + \theta (X_0 - X_t),$

$\bar{S} _t$ is the unperturbed stock price
$\eta \cdot v_t$ is the temporary market impact that scales with the trading speed $v_t$
$\theta \cdot (X_0 - X_T)$ is the permanent market impact

The unperturbed price is a simple Gaussian random walk with no drift: $\mathrm{d} \bar{S} _t = \sigma S_0 \mathrm{d} W_t$

The trading rate $v_t = - \frac{\mathrm{d} X_t}{\mathrm{d}t} = - \dot{X} _t$ so simply the speed at which we are executing the trades.

So the fundamental price ($\bar{S}$) evolves as a random walk but our actions of trading means that the observed price is higher by an amount proportional to our trading speed. The signs of the components are set up such that we are buying - so the faster we trade the more we distort the price from the true price by pushing it higher

Trading Costs

The final cost of the execution is the sum of the amount we traded multiplied by the price of all the trades. In continuous time this is simply the integral of this observed stock price multiplied by the trading speed over the execution window:

\[C_{0, T} = \int _0 ^T S_t v_t \mathrm{d} t,\]

which after inserting the equation for the asset price gives us three different components

\[C_{0_,T} = \underbrace {\int _0 ^T \bar{S_t} v_t \mathrm{d} t}_\text{(1)} + \underbrace{\int_0 ^T \eta v_t ^2 \mathrm{d} t}_\text{(2)} + \underbrace{\int _0 ^T \theta (X_0 - X_t) v_t \mathrm{d}t}_\text{(3)}\]

Term $(1)$ we use integration by parts:

\[\begin{align*} \int _0 ^T \bar{S_t} v_t \mathrm{d} t & =- \int _0 ^T \bar{S_t} \mathrm{d}X_t \\ & = - \left[\bar{S_t} X_t \right]_0^T + \int _0 ^T X_t \mathrm{d} \bar{S_t} \\ & = -(\bar{S}_TX_T - \bar{S}_0X_0) + \int _0 ^T X_t \sigma S_0 \mathrm{d} W_t \\ & = \bar{S_0} X_0 + \int _0 ^T X_t \sigma S_0 \mathrm{d} W_t \end{align*}\]

$\int _0 ^T \bar{S} _t v_t \mathrm{d}t = - \int _0 ^T \bar{S} _t \mathrm{d} x_t$ which with integration by parts and substituting in the GBM part

\[X_0 S_0 + \int _0 ^T x_t \sigma S_0 \mathrm{d} W_t\]

For term (3)

\[\theta \int _o ^T (X_0 - X_t) v_t \mathrm{d} t= -\theta \int _0 ^T (X_0 - X_t) \mathrm{d} X_t\] \[= \frac{\theta ^2}{2}\]

which gives us a formula for $C_{0, T}$

\[C_{0, T} = X_0 S_0 + \int _0 ^T X_t \sigma S_0 \mathrm{d} W_t + \eta \int _0 ^T v_t ^2 \mathrm{d}t + \frac{\theta ^2}{2}.\]

This is our expected cost function and we want to find the $v_t$ that minimises the final cost.

Minimising the Expected Cost

If we take expectations (we want to minimise the average execution path - each path will be different as it is a stochastic problem) we end up with just one term we can influence the expected cost:

\[\mathbb{E}[C] = \underbrace{X_0 S_0 + \frac{\theta ^ 2}{2}}_{\text{Constant}} + \underbrace{\mathbb{E} \left[\int _0 ^T X_t \sigma S_0 \mathrm{d} W_t \right]}_{ \mathbb{E}[ \mathrm{d}W_t] = 0} + \mathbb{E} \left[ \eta \int _0 ^T v_t ^2 \mathrm{d}t \right]\]

So we minimise the expected cost by finding the trading speed that minimises this term

\[\min _{v_t} \eta \int _0 ^T v^2_t \mathrm{d} t.\]

To solve this we apply the Euler-Lagrange equation to minimise the action. The action is the term inside the integral.

\[\frac{\partial f}{\partial X} = \frac{\mathrm{d}}{\mathrm{d}t} \frac{\partial f}{\partial v}\]

And from the above

\[\begin{align*} f & = v^2_t \\ \frac{\partial f}{\partial X} & = 0 \\ \frac{\partial f}{\partial v} & = 2 v_t, \end{align*}\]

\[\frac{\mathrm{d}}{\mathrm{d} t} v_t = 0,\]

which means the speed of the execution must be constant $v_t = B$.

\[X_t = A + B t.\]

We have the boundary conditions

\[X_0 = A,\] \[X_T = X_0 + BT = 0,\] \[B = \frac{-X_0}{T},\] \[X_t = X_0 - \frac{X_0}{T} t.\]

Putting this trading schedule back into the expected cost formula gives us an overall result

\[\int _0 ^T v_t^2\mathrm{d} t = \frac{X^2_0}{T^2} (T - 0) = \frac{X_0^2}{T}.\]

When we plot this schedule we can see that the speed is constant and we are simply running a TWAP (time-weighted average price).

The maths is telling us:

To minimise cost for an amount $X_0$ then you should run your TWAP for an infinite amount of time.

This neglects the price risk, so sure, run a very long TWAP but don’t complain when the market trends against you!

How can we account for this price risk?

Mean-Variance Optimisation of the Almgren Chriss Model

We now need to minimise both the expected cost and the variance of the expected cost with our trading schedule. This means we will now be sensitive to cases where the price moves far away from the starting value.

We introduce a new parameter, $\lambda$, that controls our risk aversion. So now we are worried about the price potentially running away from us if we take too long to finish the trade

\[\min _ {v_t} \left( \mathbb{E} [C] + \lambda \text{Var} [C] \right ),\]

so now we want to minimise the average and the variation of the trading cost and see what schedule that produces.

When we took the expectation, only the deterministic bits remained. When we calculate the variance only the random bits remain

\[\text{Var} [C] = \mathbb{E} \left[ \sigma _0 \bar{S} _0 \int _0 ^T X_t \mathrm{d} t \right] ^2 = \sigma ^2 \bar{S}_0^2 \int _0 ^T X_t ^2 \mathrm{d} t,\]

which means our minimisation problem can be written as:

\[\text{min} _{v_t} \int _0 ^T v_t ^2 \mathrm{d} t + \lambda \sigma ^2 \bar{S}_0^2 \int _0 ^T X_t ^2 \mathrm{d} t.\]

Using the Euler-Lagrange equations again

\[\begin{align*} f & = A v_t^2 + B X_t^2 \\ \frac{\partial f}{\partial X} & = 2B X_t \\ \frac{\partial f}{\partial v} & = 2A v_t \\ B X_t & = A\frac{\mathrm{d} }{\mathrm{d} t} v_t \\ & = - \frac{A}{B} \frac{\mathrm{d}^2}{\mathrm{d} t^2} X_t. \end{align*}\]

This is a second-order linear ordinary differential equation with solution

\[X_t = c_1 e^{\sqrt{\frac{A}{B}} t} + c_2 e ^{- \sqrt{\frac{A}{B}} t},\]

Again, applying boundary conditions

\[X_0 = c_1 + c_2,\] \[X_T = 0 = c_1 e^{\sqrt{\frac{A}{B}} T} + c_2 e^{-\sqrt{\frac{A}{B}T}},\] \[X_t = X_0 \frac{\text{sinh} \sqrt{\frac{\eta}{\lambda \sigma ^2 \bar{S}_0}} T-t}{\text{sinh} \sqrt{\frac{\eta}{\lambda \sigma ^2 \bar{S}_0}} T}.\]

Which is a funny expression, but underneath it is just an exponential.

We now have the additional $\lambda$ parameter and so plot the execution schedule for different risk aversions

A higher $\lambda$ means a higher risk tolerance so it becomes closer to the TWAP. In general, we can see that the Almgren Chriss solution is front-loaded - most of the trading is done early on in the time window.

Summary

Ok maths over, put down your pencils and breathe. We’ve gone through the full problem set-up and show how the TWAP minimises expected costs for a risk-neutral investor and how an exponential execution schedule minimises cost for a risk-sensitive investor.

Now we know the maths we can go on to do some interesting things.

Currency Hedging and Principal Component Analysis

2024-04-25T00:00:00+00:00

Principal component analysis (PCA) reduces a dataset to its main components. When we apply it to a dataset of different currencies it helps us understand how each currency drives the overall portfolio and what currency might be a common factor.

Enjoy these types of posts? Then you should sign up for my newsletter.

This post was inspired by a problem on the r/quant subreddit where someone posted their interview/take-home question.

A client is considering using SGD to (proxy) hedge their exposure to a basket of other Asian currencies. Is this likely to be effective? What analysis could you produce that would help inform their decision? The client is a US Corporate. The client is exposed to medium-term changes (say monthly) in the currency. The client has equal (USD equivalent) revenues in each Asian currency. We are not considering hedging costs for this analysis (spot-only component). The data for daily close spot values against USD for each pair is provided. Which currency pairs will it work better for? Would it work for an equally weighted currency portfolio? Would another (single) currency work better? Which correlations should we consider and how reliable are these?

This is an interesting question and not too dissimilar to the occasional question I answer in my day job. So I thought I’d run through how I might answer it.

Getting FX Data

First, we need to get some data and I’ll be using Alphavantage to pull daily closing prices of the different currencies. I’ll calculate the log returns and save the data to cache it for future use. Plus AlphaVantage only lets you make 25 calls a day, so each time I mucked up I got locked out for the day - delaying the analysis. We have to start from 2014 as this is the earliest common date across all currencies.

function _pull_data(ccy)
    println(ccy)
    res = AlphaVantage.fx_daily("USD", ccy, outputsize="full", datatype="csv")
    res = DataFrame(Dict(:Date=>Date.(res[1][:, 1]), :c=>Float64.(res[1][:,5]), :ccy => ccy));
    res = sort(res, :Date)
    res = @transform(res, :LogReturn = [0; diff(log.(:c))])
    res
end

function pull_data(ccy)
    if isfile("$ccy.csv")
        res = CSV.read("$ccy.csv", DataFrame)
    else
        res = _pull_data(ccy)
        CSV.write("$ccy.csv", res)
    end
    res
end

ccys = ["JPY", "CNH", "SGD", "THB", "HKD", "KRW", "TWD"]
res = vcat(pull_data.(ccys)...);
res = sort(res, :Date)
res = @transform(groupby(res, :ccy), :LogReturn = [0; diff(log.(:c))])
res = @subset(res, :Date .>= Date("2014-11-24"))

Like all good blog posts, let’s start with the plot of the cumulative returns. Only HKD stands out as something different given its peg to USD.

p = plot(ylabel = "Cummulative Return")
for ccy in ccys
    plot!(p, res[res.ccy .== ccy, :].Date, cumsum(res[res.ccy .== ccy, :].LogReturn), label = ccy, lw = 2)
end
p

According to the problem, our client is long equal amounts of these Asian currencies, so it makes sense to calculate the market returns by taking the average return each day.

market = @combine(groupby(res, :Date), :LogReturn = mean(:LogReturn))
market[!, :ccy] .= "Market"
market[!, :c] .= NaN;

Which we add to the original plot.

p = plot!(p, market.Date, cumsum(market.LogReturn)
    label = "Market", color = "black", lw  = 2)

The client thinks that hedging with SGD alone is enough to protect against the overall market returns. We can see from the graph that this probably isn’t the case. But how do we recommend a better approach?

First, we will start with the correlation in returns between the different currencies. This will shed some light on how linked they are and is also simple to explain to the client.

cr = cor(Matrix(modelData[:, [:JPY, :CNH, :SGD, :THB, :HKD, :KRW, :TWD]]))
heatmap(ccys, ccys, cr .> 0.5)

We use a heat-map, but only highlight when two currencies have a correlation > 0.5, otherwise it’s a bit of a psychedelic nightmare.

We can see that HKD has a low correlation with most, KRW and SGD have a high correlation between each other and KRW has a high correlation with the majority of these currencies. However, we will use the covariance matrix to analyse the best hedging portfolio rather than the correlation matrix.

Principal Component Analysis

Principal component analysis (or PCA) is a tool that tries to find a common basis of variation in a matrix. It’s about transforming the data into uncorrelated components through linear algebra.

For this we are using the covariance matrix, so now the diagonals are the individual price series variances and the off-diagonals are the covariances between two currencies. If this were a different problem we might rescale the returns so they all had the same volatility but this would mean applying leverage, which our hypothetical customer probably wouldn’t be up for it.

We pull out the covariance matrix

modelData = dropmissing(unstack(res, :Date, :ccy, :LogReturn))
cm = cov(Matrix(modelData[:, [:JPY, :CNH, :SGD, :THB, :HKD, :KRW, :TWD]]))

The MultivariateStats.jl package has the functions for doing PCA and the appropriate functions for pulling out the right data after fitting the PCA model.

pcaRes = fit(PCA, cm; maxoutdim=3)

Firstly the weights of all the currencies for the three principal components.

	PC1 Weights	PC2 Weights	PC3 Weights
JPY	4.96845E-06	9.11362E-06	-2.98467E-07
CNH	2.11372E-06	-1.1987E-06	-4.78571E-08
SGD	3.35545E-06	-5.17405E-07	-1.00414E-07
THB	3.21579E-06	-7.50513E-07	3.05907E-06
HKD	4.21256E-08	-7.74387E-08	-1.84514E-08
KRW	7.67389E-06	-4.39207E-06	-8.40943E-07
TWD	2.42907E-06	-2.01299E-06	-6.01965E-07

PC1 shows the weights for each currency but is unnormalised. The key thing we can see here is that HKD is magnitudes smaller than the others.
PC2 is long JPY and short all the others
PC3 is long THB and short all the others

Then the explained variance of the three components.

	PC1	PC2	PC3
Eigenvalues	1.15544e-10	1.08674e-10	1.05292e-11
Variance explained	0.47267	0.444567	0.0430731
Cumulative variance	0.47267	0.917237	0.96031

The first component can explain 49% of the variance and then including the second component 91% of the variance, with the final component making up 5% to take it to 96% in total. This means that this dataset can be broken down quite nicely into the two principal components and this explains most of the variation.

The first principal component is commonly called the ‘market’ portfolio and represents the overall combined market dynamics of the portfolio. The next portfolio (using the 2nd PC weights) is uncorrelated to the market and thus more diversified to the overall market.

In our problem then we can see that we are trying to come up with a representation of the market and use that to decide how to hedge out our currencies. So the first principal component is the most relevant.

We take these principal component weights and join them to the original dataframe to start exploring what the market portfolio looks like.

evFrame = DataFrame(Dict(:ccy => String.([:JPY, :CNH, :SGD, :THB, :HKD, :KRW, :TWD]), 
          :ev1 => eigvecs(pcaRes)[:,1],
          :ev2 => eigvecs(pcaRes)[:,2]))
sort!(evFrame, :ev1)

res = leftjoin(res, dropmissing(evFrame), on = :ccy)

evFrame = sort(evFrame, :ev1);

Then plotting the weights by currency pair

bar(evFrame.ccy, evFrame.ev1 ./ sum(evFrame.ev1), label = "Eigen Weights")

These are the weights of the different currencies of the first eigen portfolio. This combination of currencies is what we would recommend if the client was exposed to a similar basket. The key points:

The client is long these currencies through their business
They short this portfolio and thus are market-neutral

We now calculate the returns of the eigen portfolios, the portfolio that only uses the largest 2 (and 3) weights.

evPortfolios = @combine(groupby(res, :Date), 
         :ReturnEV1 = sum(:LogReturn .* :ev1) ./ sum(:ev1), 
         :ReturnEV2 = sum(:LogReturn .* :ev2) ./ sum(:ev2));

ccy2Portfolio = @combine(groupby(res[in.(res.ccy, Ref(["KRW", "JPY"])), :], :Date), 
         :Return2Ccy = sum(:LogReturn .* :ev1) ./ sum(:ev1));

ccy3Portfolio = @combine(groupby(res[in.(res.ccy, Ref(["KRW", "JPY", "SGD"])), :], :Date), 
         :Return3Ccy = sum(:LogReturn .* :ev1) ./ sum(:ev1));

And plotting these returns

plot(market.Date, cumsum(market.LogReturn), label = "Market", color = "black", lw = 2)
plot!(evPortfolios.Date,  cumsum(evPortfolios.ReturnEV1), label = "Eigen Portfolio", lw = 2)
plot!(ccy2Portfolio.Date,  cumsum(ccy2Portfolio.Return2Ccy), label = "2 Ccy", lw =2)
plot!(ccy3Portfolio.Date,  cumsum(ccy3Portfolio.Return3Ccy), label = "3 Ccy", lw = 2)

Then finally, looking at the correlation between these portfolios

	Market Return	Market Eigen Portfolio	2nd Eigen Portfolio	KRW + JPY	KRW + JPY + SGD
Market Return	1.0	0.99	0.01	0.93	0.95
Market Eigen Portfolio	0.99	1.0	0.01	0.97	0.98
2nd Eigen Portfolio	0.01	0.01	1.0	0.11	0.08
KRW + JPY	0.93	0.97	0.11	1.0	0.99
KRW + JPY + SGD	0.95	0.99	0.08	0.99	1.0

The Eigen Portfolio 1 is most correlated with the equal-weighted portfolio.
With just KRW and JPY you get to a 93% correlation with the market.
KRW, JPY and SGD gets you to a 95% with the market.

As expected Eigen portfolio 2 is the most uncorrelated with the market.

Summary

So our final answer to the client would be:

We have a proprietary portfolio (the market eigen portfolio) that you should hedge with - this will give you the best outcome.
If you don’t want the full portfolio use a 60/40 ratio of KRW and JPY.
SGD probably isn’t a great idea and will leave you exposed.

Now, we are assuming that these weightings are stable through time and haven’t changed recently and are therefore valid for the future returns too. We are ignoring transaction costs, KRW being an NDF and more expensive to trade compared to a spot currency (like JPY) means that this approach will break down if the client needs to hedge a significant amount.

Calibrating an Ornstein–Uhlenbeck Process

2024-03-09T00:00:00+00:00

Read enough quant finance papers or books and you’ll come across the Ornstein–Uhlenbeck (OU) process. This is a post that explores the OU process, the equations, how we can simulate such a process and then estimate the parameters.

Enjoy these types of posts? Then you should sign up for my newsletter.

I’ve briefly touched on mean reversion and OU processes before in my Stat Arb - An Easy Walkthrough blog post where we modelled the spread between an asset and its respective ETF. The whole concept of ‘mean reversion’ is something that comes up frequently in finance and at different time scales. It can be thought of as the first basic extension as Brownian motion and instead of things moving randomly there is now a slight structure where it be oscillating around a constant value.

The Hudson Thames group have a similar post on OU processes (Mean-Reverting Spread Modeling: Caveats in Calibrating the OU Process) and my post should be a nice compliment with code and some extensions.

The Ornstein-Uhlenbeck Equation

As a continuous process, we write the change in $X_t$ as an increment in time and some noise

\[\mathrm{d}X_t = \theta (\mu - x_t) \mathrm{d}t + \sigma \mathrm{d}W_t\]

The amount it changes in time depends on the previous $X_t$ and to free parameters $\mu$ and $\theta$.

The $\mu$ is the long-term drift of the process
The $\theta$ is the mean reversion or momentum parameter depending on the sign.

If $\theta$ is 0 we can see the equation collapses down to a simple random walk.

If we assume $\mu = 0$, so the long-term average is 0, then a positive value of $\theta$ means we see mean reversion. Large values of $X$ mean the next change is likely to have a negative sign, leading to a smaller value in $X$.

A negative value of $\theta$ means the opposite and we end up with a large value in X generating a further large positive change and the process explodes. E If discretise the process we can simulate some samples with different parameters to illustrate these two modes.

\[X_{t+1} - X_t = \theta (\mu - X_t) \Delta t + \sigma \sqrt{\Delta t} W_t\]

where $W_t \sim N(0,1)$.

which is easy to write out in Julia. We can save some time by drawing the random values first and then just summing everything together.

using Distributions, Plots

function simulate_os(theta, mu, sigma, dt, maxT, initial)
    p = Array{Float64}(undef, length(0:dt:maxT))
    p[1] = initial
    w = sigma * rand(Normal(), length(p)) * sqrt(dt)
    for i in 1:(length(p)-1)
        p[i+1] = p[i] + theta*(mu-p[i])*dt + w[i]
    end
    return p
end

We have two classes of OU processes we want to simulate, a mean reverting $\theta > 0$ and a momentum version ($\theta < 0$) and we also want to simulate a random walk at the same time, so $\theta = 0$. We will assume $\mu = 0$ which keeps the pictures simple.

maxT = 5
dt = 1/(60*60)
vol = 0.005

initial = 0.00*rand(Normal())

p1 = simulate_os(-0.5, 0, vol, dt, maxT, initial)
p2 = simulate_os(0.5, 0, vol, dt, maxT, initial)
p3 = simulate_os(0, 0, vol, dt, maxT, initial)

plot(0:dt:maxT, p1, label = "Momentum")
plot!(0:dt:maxT, p2, label = "Mean Reversion")
plot!(0:dt:maxT, p3, label = "Random Walk")

The mean reversion (orange) hasn’t moved away from the long-term average ($\mu=0$) and the momentum has diverged the furthest from the starting point, which lines up with the name. The random walk, inbetween both as we would expect.

Now we have successfully simulated the process we want to try and estimate the $\theta$ parameter from the simulation. We have two slightly different (but similar methods) to achieve this.

OLS Calibration of an OU Process

When we look at the generating equation we can simply rearrange it into a linear equation.

\[\Delta X = \theta \mu \Delta t - \theta \Delta t X_t + \epsilon\]

and the usual OLS equation

\[y = \alpha + \beta X + \epsilon\]

such that

\[\alpha = \theta \mu \Delta t\] \[\beta = -\theta \Delta t\]

where $\epsilon$ is the noise. So we just need a DataFrame with the difference between subsequent observations and relate that to the current observation. Just a diff and a shift.

using DataFrames, DataFramesMeta
momData = DataFrame(y=p1)
momData = @transform(momData, :diffY = [NaN; diff(:y)], :prevY = [NaN; :y[1:(end-1)]])

Then using the standard OLS process from the GLM package.

mdl = lm(@formula(diffY ~ prevY), momData[2:end, :])
alpha, beta = coef(mdl)

theta = -beta / dt
mu = alpha / (theta * dt)

Which gives us $\mu = 0.0075, \theta = -0.3989$, so close to zero for the drift and the reversion parameter has the correct sign.

Doing the same for the mean reversion data.

mdl = lm(@formula(diffY ~ prevY), revData[2:end, :])
alpha, beta = coef(mdl)

theta = -beta / dt
mu = alpha / (theta * dt)

This time $\mu = 0.001$ and $\theta = 1.2797$. So a little wrong compared to the true values, but at least the correct sign.

Does Bootstrapping Help?

It could be that we need more data, so we use the bootstrap to randomly sample from the population to give us pseudo-new draws. We use the DataFrames again and pull random rows with replacement to build out the data set. We do this sampling 1000 times.

res = zeros(1000)
for i in 1:1000
    mdl = lm(@formula(diffY ~ prevY + 0), momData[sample(2:nrow(momData), nrow(momData), replace=true), :])
    res[i] = -first(coef(mdl)/dt)
end

bootMom = histogram(res, label = :none, title = "Momentum", color = "#7570b3")
bootMom = vline!(bootMom, [-0.5], label = "Truth", momentum = 2)
bootMom = vline!(bootMom, [0.0], label = :none, color = "black")

We then do the same for the reversion data.

res = zeros(1000)
for i in 1:1000
    mdl = lm(@formula(diffY ~ prevY + 0), revData[sample(2:nrow(revData), nrow(revData), replace=true), :])
    res[i] = first(-coef(mdl)/dt)
end

bootRev = histogram(res, label = :none, title = "Reversion", color = "#1b9e77")
bootRev = vline!(bootRev, [0.5], label = "Truth", lw = 2)
bootRev = vline!(bootRev, [0.0], label = :none, color = "black")

Then combining both the graphs into one plot.

plot(bootMom, bootRev, 
  layout=(2,1),dpi=900, size=(800, 300),
  background_color=:transparent, foreground_color=:black,
     link=:all)

The momentum bootstrap has worked and centred around the correct value, but the same cannot be said for the reversion plot. However, it has correctly guessed the sign.

AR(1) Calibration of a OU Process

If we continue assuming that $\mu = 0$ then we can simplify the OLS to a 1-parameter regression - OLS without an intercept. From the generating process, we can see that this is an AR(1) process - each observation depends on the previous observation by some amount.

\[\phi = \frac{\sum _i X_i X_{i-1}}{\sum _i X_{i-1}^2}\]

then the reversion parameter is calculated as

\[\theta = - \frac{\log \phi}{\Delta t}\]

This gives us a simple equation to calculate $\theta$ now.

For the momentum sample:

phi = sum(p1[2:end] .* p1[1:(end-1)]) / sum(p1[1:(end-1)] .^2)
-log(phi)/dt

Givens $\theta = -0.50184$, so very close to the true value.

For the reversion sample

phi = sum(p2[2:end] .* p2[1:(end-1)]) / sum(p2[1:(end-1)] .^2)
-log(phi)/dt

Gives $\theta = 1.26$, so correct sign, but quite a way off.

Finally, for the random walk

phi = sum(p3[2:end] .* p3[1:(end-1)]) / sum(p3[1:(end-1)] .^2)
-log(phi)/dt

Produces $\theta = -0.027$, so quite close to zero.

Again, values are similar to what we expect, so our estimation process appears to be working.

Using Multiple Samples for Calibrating an OU Process

If you aren’t convinced I don’t blame you. Those point estimates above are nowhere near the actual values that simulated the data so it’s hard to believe the estimation method is working. Instead, what we need to do is repeat the process and generate many more price paths and estimate the parameters of each one.

To make things a bit more manageable code-wise though I’m going to introduce a struct that contains the parameters and allows to simulate and estimate in a more contained manner.

struct OUProcess
    theta
    mu 
    sigma
    dt
    maxT
    initial
end

We now write specific functions for this object and this allows us to simplify the code slightly.

function simulate(ou::OUProcess)
    simulate_os(ou.theta, ou.mu, ou.sigma, ou.dt, ou.maxT, ou.initial)
end

function estimate(ou::OUProcess)
   p = simulate(ou)
   phi =  sum(p[2:end] .* p[1:(end-1)]) / sum(p[1:(end-1)] .^2)
   -log(phi)/ou.dt
end

function estimate(ou::OUProcess, N)
    res = zeros(N)
    for i in 1:N
        p = simulate(ou)
        res[i] = estimate(ou)
    end
    res
end

We use these new functions to draw from the process 1,000 times and sample the parameters for each one, collecting the results as an array.

ou = OUProcess(0.5, 0.0, vol, dt, maxT, initial)
revPlot = histogram(estimate(ou, 1000), label = :none, title = "Reversion")
vline!(revPlot, [0.5], label = :none);

And the same for the momentum OU process

ou = OUProcess(-0.5, 0.0, vol, dt, maxT, initial)
momPlot = histogram(estimate(ou, 1000), label = :none, title = "Momentum")
vline!(momPlot, [-0.5], label = :none);

Plotting the distribution of the results gives us a decent understanding of how varied the samples can be.

plot(revPlot, momPlot, layout = (2,1), link=:all)

We can see the heavy-tailed nature of the estimation process, but thankfully the histograms are centred around the correct number. This goes to show how difficult it is to estimate the mean reversion parameter even in this simple setup. So for a real dataset, you need to work out how to collect more samples or radically adjust how accurate you think your estimate is.

Summary

We have progressed from simulating an Ornstein-Uhlenbeck process to estimating its parameters using various methods. We attempted to enhance the accuracy of the estimates through bootstrapping, but we discovered that the best approach to improve the estimation is to have multiple samples.

So if you are trying to fit this type of process on some real world data, be it the spread between two stocks (Statistical Arbitrage in the U.S. Equities Market), client flow (Unwinding Stochastic Order Flow: When to Warehouse Trades) or anything else you believe might be mean reverting, then understand how much data you might need to accurately model the process.

Cross Asset Skew - A Trading Strategy

2024-02-08T00:00:00+00:00

I recently listened to S7E3 of Flirting with Models which had Nick Baltas talking about Multi Asset and Multi-Strategy portfolios. Nick highlighted his work on cross-asset skew and how it can compliment your typical equity factors (momentum, growth, value etc.) and is an under-explored topic in portfolio construction. After reading the original paper, Cross-Asset Skew, I decided to try and replicate the results and see whether skew comes out in the wash and produces any alpha.

Enjoy these types of posts? Then you should sign up for my newsletter.

In this post, I’ll go through what skew is, how it can be used as a trading strategy, and backtest the portfolio across different asset classes. We will then see if it produces any alpha ($\alpha$) and or if skew is just market beta ($\beta$). I’ll then take a deeper dive into the equity performance and how it compares to the typical factors.

I’ll be working through everything in Julia (1.9) and pulling daily data from AlpacaMarkets.

using AlpacaMarkets, Dates,CSV, DataFrames, DataFramesMeta, RollingFunctions
using Plots, StatsBase
using Distributions

function parse_date(t)
   Date(string(split(t, "T")[1]))
end

function clean(df, x) 
    df = @transform(df, :Date = parse_date.(:t), 
        :Ticker = x, :NextOpen = [:o[2:end]; NaN], :LogReturn = [NaN; diff(log.(:c))])
   @select(df, :Date, :Ticker, :c, :o, :NextOpen, :LogReturn)
end

function load(etf)
   df = AlpacaMarkets.stock_bars(etf, "1Day"; startTime = now() - Year(10), limit = 10000, adjustment = "all")[1]
   clean(df, etf)
end

What is Skew?

Skew (or skewness) measures how symmetric the distribution is around the mean value. A distribution of values with more values to the right of the mean is a positively skewed distribution and vice versa for the left of the mean.

We can demonstrate this by generating some random values from a skewed distribution (lognormal) and unskewed (normal).

Which shows the general tilt in the x-axis across the 3 different distributions.

Skew is weird in the sense that there isn’t a single way to calculate how skewed a distribution is. For our defined distributions above we can calculate the analytical values of skew and see that it is zero for the middle graph and positive (as expected) for the right-hand graph. Given that we flip the sign of the left-hand graph, that has the negative skew.

skewness.([Normal(1,1), LogNormal(0, 0.5)])

2-element Vector{Float64}:
 0.0
 1.7501896550697178

In the paper, the skew of an asset is calculated as

\[S = \frac{1}{N} \sum _{i=1} ^N \frac{(r_i - \mu ) ^3}{\sigma ^3},\]

where $\mu$ is the average and $\sigma ^2$ is the variance of the returns of an asset with a lookback window of $N$. We can look at the skewness of the SPY ETF over a 256-day rolling window using the RollingFunctions package.

spy = load("SPY")
spy = @transform(spy, :Avg = runmean(:LogReturn, 256), :Dev = runstd(:LogReturn, 256))
spy = @transform(spy, :SkewDay = ((:LogReturn .- :Avg) ./ :Dev) .^3)

spy = @transform(spy, :Skew = runmean(:SkewDay, 256))
spy = @subset(spy, .!isnan.(:Skew))
plot(spy.Date, spy.Skew, label = "SPY Skew", dpi=900, size=(800, 200))
hline!([0], color="black", label = :none)

It’s jumpy, but the jumps make sense as it’s a $^3$ calculation, so large values will be amplified. SPY became very negatively skewed over COVID-19 as there were all the market corrections leading to large down days. In recent days it’s now more positively skewed as we’ve seen some larger positive returns.

Skew as a Trading Strategy

The paper believes that skew can predict future returns and that we want to be long assets with a negative skew and short assets with a positive skew. This gives it a ‘mean reversion’ explanation for future returns, so over COVID-19 when there were lots of down days, we should be buying because the movement is likely to be overblown and the market will correct higher. Likewise, large jumps up mean that it’s a positive move that is overblown and will come back down. So again, looking at the skew of SPY in recent weeks, the skew is positive therefore we would be inclined to short this ETF.

The overall strategy is looking at cross-sectional skew, so how skewed an asset its relative to it’s peers rather than looking at the raw skew number on a given day. The paper looks at equity indexes across countries, bond futures across different countries, different currencies, and commodities. In our replication, we are going to be using different ETFs that look at similar themes and should capture the broad cross-section of finance.

The ETF Trading Universe

The original paper uses futures data from 1990 up to 2017 to run the backtest, I will be instead using different ETFs and a much shorter timescale, just because that’s all the data I have available from my AlpacaMarkets free account using AlpacaMarkets.jl.

Blackrock is nice enough to publish this document for their different equity funds across the globe, Around the World with iShares Country ETFs, which I use to get the different country equity performance plus some broader indexes.

For the fixed income part I just try and take a cross-section of the different types of fixed income instruments available and different durations, mixing long-term, short-term, government, corporates, etc.

Commodities, again, just trying to get a broad mix, and the Other class is mainly real-estate and whatever other cruff comes up on the ETF database website. Finally, the currency ETFs each represent a different currency, so cover that part of the paper.

universe = [("Equity", ["SPY", "EWU", "EWJ", "INDA", "EWG", "EWL", "EWP", "EWQ", 
                        "VTI", "FXI", "EWZ", "EWY", "EWA", "EWC", "EWG",
                        "EWH", "EWI", "EWN", "EWD", "EWT", "EZA", "EWW", "ENOR", "EDEN", "TUR"]),
            ("FI", ["AGG", "TLT", "LQD", "JNK", "MUB", "MBB", "IAGG", "IGOV", "EMB", "BND", "BNDX", "VCIT", "VCSH", "BSV", "SRLN"]),
            ("Commodities", ["GLD", "SLV", "GSG", "USO", "PPLT", "UNG", "DBA"]),
            ("Other", ["IYR", "REET", "USRT", "ICF", "VNQ"]),
            ("Ccy", ["UUP", "FXY", "FXE", "FXF", "FXB", "FXA", "FXC"])
           ]

We iterate through all the asset classes and pull the most amount of daily data possible.

allDataRaw = Array{DataFrame}(undef, length(universe))

for (j, (assetClass, etfs)) in enumerate(universe)
    println(assetClass)
    resdf = Array{DataFrame}(undef, length(etfs))
    for (i, etf) in enumerate(etfs)
        #println(etf)
        df = load(etf)
        resdf[i] = df
    end
    resdfC = vcat(resdf...)
    resdfC.AssetClass .= assetClass
    allDataRaw[j] = resdfC
end

allData = vcat(allDataRaw...);

We then add in the averages $\mu$, standard deviation $\sigma$, and calculate the skew value for that day before taking the rolling average to arrive at the overall skew measure. We need to group by each ETF (the Ticker column).

allData = groupby(allData, :Ticker)

allData = @transform(allData, :Avg = runmean(:LogReturn, 256), :Dev = runstd(:LogReturn, 256))
allData = @transform(allData, :SkewDay = ((:LogReturn .- :Avg) ./ :Dev) .^3)
allData = @transform(allData, :Skew = runmean(:SkewDay, 256))
allData = @subset(allData, .!isnan.(:Skew));

To check we’ve pulled the right data we plot the cumulative log returns.

plot(allData[allData.Ticker .== "SPY", :].Date, cumsum(allData[allData.Ticker .== "SPY", :].LogReturn), label = "SPY", 
      title="Returns", dpi=900, size=(800, 200))
plot!(allData[allData.Ticker .== "GLD", :].Date, cumsum(allData[allData.Ticker .== "GLD", :].LogReturn), label = "GLD")
plot!(allData[allData.Ticker .== "AGG", :].Date, cumsum(allData[allData.Ticker .== "AGG", :].LogReturn), label = "AGG")

Everything looks as we would expect. We can now look at the skew for these three assets.

The skews move differently and with different magnitudes notably GLD has the least variable skew but equity and bonds have a similar pattern. The paper looks at the skew of the asset on the last day of the month and uses that to rebalance the portfolio so that with a groupby and last we can pull the skew value on the last day of the month.

Building the Backtest

We need to avoid the look-ahead bias in the backtest. The portfolio weight is calculated using the last day of the month, so we observe the closing price and use that to calculate the return and update the parameters - average return, volatility, and finally the skew. This skew then goes into the weighting calculation but it is only active on the next working day, otherwise, we are getting a ‘free’ day of return.

So on the 31st of the Jan, we update the weights and then do the rebalance on the 1st of Feb (assuming that’s a working day). There is also the additional cost of trading into the position, at the minute we are assuming we can trade at the previous closing price but that is a problem to solve for another day.

allData = @transform(allData, :Month = floor.(:Date, Month(1)), :Week = floor.(:Date, Week(1)));
allData = @transform(groupby(allData, :Ticker), :NextDay = [:Date[2:end]; Date(2015)])
monthlyVals = @combine(groupby(allData, [:Month, :AssetClass, :Ticker]), 
                       :Date = last(:Date), :NextDate = last(:NextDay), 
                        :EOMSkew = last(:Skew));

We rank each asset in its respective asset class using the negative of the skew value, so the most positive skew gets the lowest rank and the most negative skew gets the highest rank. We also normalise the ranks by the number of assets in the group.

To come up with the portfolio weight, we want all the long positions (positive ranks) to have a total weighting of 1 and short positions (negative ranks) to have a total weighting of -1. This corresponds to being long 1 dollar and short 1 dollar so self-financed overall.

monthlyVals = groupby(monthlyVals, [:Date, :AssetClass])
monthlyVals = @transform(monthlyVals, :SkewWeightRaw = ordinalrank(-1*:EOMSkew) .- ((length(:EOMSkew) + 1) /2))
monthlyVals = groupby(monthlyVals, [:Date, :AssetClass])
monthlyVals = @transform(monthlyVals, :SkewWeight = :SkewWeightRaw ./ sum(1:maximum(:SkewWeightRaw)))

For example, if we look at the commodity ETFs and their latest skew values and how that changes the portfolio weights.

Date	Asset Class	Ticker	EOM Skew	SkewWeightRaw	Skew Weight
2024-02-07	Commodities	GLD	0.23	-3	-0.5
2024-02-07	Commodities	SLV	0.02	-2	-0.333
2024-02-07	Commodities	DBA	-0.04	-1	-0.167
2024-02-07	Commodities	PPLT	-0.07	0	0
2024-02-07	Commodities	GSG	-0.12	1	0.167
2024-02-07	Commodities	UNG	-0.16	2	0.333
2024-02-07	Commodities	USO	-0.19	3	0.5

The most negatively skewed ETF, USO, gets the highest positive weight and vice versa. If we look at the weights over the period for the three example assets.

The portfolio weights for both SPY and AGG show that the last two months have been short SPY and no position in AGG. GLD has been allocated in the opposite direction to the other two, right now we are short GLD.

We join the weights to the original dataframe and forward fill the weightings to look at the daily performance. I pulled a forward fill function from https://hongtaoh.com/en/2021/06/27/julia-ffill/ and joining the portfolio weights to the daily returns allows us to understand the daily changes in the portfolios.

ffill(v) = v[accumulate(max, [i*!ismissing(v[i]) for i in 1:length(v)], init=1)]

weightings = @select(monthlyVals, :NextDate, :Ticker, :SkewWeight)
rename!(weightings,:NextDate => :Date)

allDataWeights = leftjoin(allData, weightings, on=[:Date, :Ticker]);
allDataWeights = sort(allDataWeights, :Date)
allDataWeights = @transform(groupby(allDataWeights, :Ticker), :SkewWeight2 = ffill(:SkewWeight));

Plotting the resulting portfolios gives us an idea of their performance.

assetPortfolios = dropmissing(@combine(groupby(allDataWeights, [:Date, :AssetClass]), 
                           :PortfolioReturn = sum(:SkewWeight2 .* :LogReturn),
                           :MktReturn = mean(:LogReturn)))

p = plot(title = "Skew Portfolios")
for ac in unique(assetPortfolios.AssetClass)
    plot!(p, assetPortfolios[assetPortfolios.AssetClass .== ac, :].Date, 
             cumsum(assetPortfolios[assetPortfolios.AssetClass .== ac, :].PortfolioReturn), label =ac) 
end
hline!([0], color = "black", label = :none)
p

These are the results for each asset class. Interestingly, all of them (except Other) have a positive return as of February and most have never fallen below their starting returns. Commodities are very volatile and swung back and forth quite dramatically, equities have been one-way traffic in the right direction!

We also want to combine all the asset classes to produce a single portfolio but first have to normalise the returns by the volatility so that they are equally weighted on a risk basis.

assetPortfolios = @transform(groupby(assetPortfolios, :AssetClass), :Vol = sqrt.(runvar(:PortfolioReturn, 256)))
assetPortfolios = @transform(groupby(assetPortfolios, :AssetClass), 
                             :NormReturn = 0.1*:PortfolioReturn ./ :Vol,
                             :NormMarketReturn = 0.1*:MktReturn ./ :Vol)
gcf = @combine(groupby(assetPortfolios, :Date), :Return = mean(:NormReturn), :MktReturn = mean(:NormMarketReturn));

plot(gcf.Date[2:end], cumsum(gcf.Return[2:end]), label = "Global Skew Factor", title = "Global Portfolio")
plot!(gcf.Date[2:end], cumsum(gcf.MktReturn[2:end]), label = "Global Market Return")
hline!([0], color = "black", label = :none)

Again, a positive result, well at least recently. This indicates that skew has some associated premium. Now we want to see if this is alpha or beta.

Alpha, Beta or Something Else?

It’s great that these portfolios both at an asset level and global level have ended up in the green but we want to compare the performance to the general market and see if it’s riding the market or adding something new.

This is simple enough to compare, we can look at the equal-weighted return of all the assets in the group and see how that ended up.

Again, all of the skew portfolios have outperformed the market portfolio (except the Other asset class). so this is a good indication that this skew strategy is adding something new.

A more systematic approach is to regress the portfolio return against the market return and this will give us a measure of the $\alpha$ and $\beta$ of the strategy.

\[\text{Skew Return} = \alpha + \beta \cdot \text{Market Return}\]

using GLM

for ac in unique(assetPortfolios.AssetClass)
    ols = lm(@formula(PortfolioReturn ~ MktReturn), assetPortfolios[assetPortfolios.AssetClass .== ac, :])
    println(ac)
    println(coeftable(ols))
    println(r2(ols))
end

Asset Class	$\alpha$	$p$ value	$\beta$	$p$ value	$R^2$
Equity	0.0003	0.0544	-0.01	0.4465	0.0003
FI	0.0001	0.1796	-0.05	0.0728	0.002
Commodities	0.0004	0.4799	0.113	0.0232	0.003
Other	-0.00004	0.5845	0.007	0.1690	0.001
Ccy	0.0001	0.3622	0.498	<1e-27	0.08

The first thing to note is the low $R^2$’s across the board, which is to be expected in these types of models. Generally, the $\alpha$’s are all statistically insignificant with only the equity portfolio getting close to significance which indicates that the skew factor isn’t providing ‘new returns’. Interestingly though, only commodities and currencies have a statistically significant $\beta$ which means for other asset classes the modelling is essentially noise. So whilst the lack of $\alpha$ is a problem, the lack of $\beta$ sort of makes up for it. Essentially I think this is a promising sign that there is perhaps something more to be done.

A Deeper Dive With More Equity Factors

An equity fund manager who wants to allocate to skew also needs to verify that skew is providing something unique and not a repackaging of momentum/value/growth/carry factors. This is easy enough as there are ETFs that represent these factors, so we just include it in the regression.

mtum = load("MTUM") #momentum
vtv = load("VTV") #value
vug = load("VUG") #growth
cry = load("VIG") #carry
equityFactors = vcat([mtum, vtv, vug, cry]...);

Joining these with the equity data gives us a bigger dataset to construct the OLS regression.

equity = assetPortfolios[assetPortfolios.AssetClass .== "Equity", :]

equity = leftjoin(equity, 
         unstack(@select(equityFactors, :Date, :Ticker, :LogReturn), :Date, :Ticker, :LogReturn),
         on = "Date")

coeftable(lm(@formula(PortfolioReturn ~ MktReturn + MTUM + VTV + VUG + VIG), 
equity))

	Coef.	Std. Error	t	Pr(> $\mid t \mid$)	Lower 95%	Upper 95%
(Intercept)	0.000280318	0.000180867	1.55	0.1214	-7.44597e-5	0.000635095
MktReturn	-0.300453	0.0312806	-9.61	<1e-20	-0.361811	-0.239094
MTUM	-0.0881885	0.0305466	-2.89	0.0039	-0.148107	-0.0282701
VTV	0.450562	0.0614928	7.33	<1e-12	0.329942	0.571183
VUG	0.109752	0.0358138	3.06	0.0022	0.0395015	0.180002
VIG	-0.140079	0.0739041	-1.90	0.0582	-0.285045	0.00488637

Again, no $\alpha$, significant market $\beta$, and significant momentum, value, and growth coefficients but no significance with carry. This isn’t great for the Skew factor as this regression suggests we can replicate it using the other factors, namely, it’s anti-correlated to the market and momentum and correlated with value and growth. Given it’s a mean-reversion-esq strategy this makes sense as value is generally about finding underpriced assets.

Conclusion

This has been a successful replication of the original paper, which used ETFs of different asset sectors to explore skew. We now understand that skew is a measure of how left or right-tailed a distribution is, and how it can be exploited in a trading strategy. By calculating skew across different assets and ranking the skew in asset class groups, we allocate long positions to the most negatively skewed assets and short positions to positively skewed assets. This portfolio has produced a positive return in equities, fixed income, currencies, and commodities (but not Other), and has outperformed the market portfolio. A global skew portfolio was also constructed by scaling each asset class to 10% volatility and combining the returns, which also outperformed the market.

The use of the Other asset class was the only sector where skew didn’t work, so it would be hurting the overal skew portfolio, so going forward we would know to restrict the universe to equity, fixed income, currencies and commodities.

However, when we regressed the portfolio return onto the market returns, we found no statistically significant alphas and significant betas. The equity portfolio was close to having a significant alpha, but given it had the largest number of underlying assets, it could be a function of asset size.

We have neglected the trading costs and potential capacity of the overall strategy, but given its low turnover (weights only updating every month), this is probably safe to ignore until you hit the super asset manager size.

Although the results are not as conclusive as the original paper, they are on a shorter timescale and smaller universe, and do not contradict the original findings. We have shown that skew is out there and can provide a source of returns.

Going forward, refining the calculation of the skew and tuning the lookback windows might improve the results. Also, expanding the universe into more specific funds could provide better insights. At the moment, the fixed income component is too broad to pick up on the skew changes.

Exploring Causal Regularisation

2023-12-28T00:00:00+00:00

A good prediction model isn’t necessarily a good causal model. You could be missing a key variable in your dataset that is driving the underlying behavior so you end up with a good predictive model but not the correct explanation as to why things behave that way. Taking a causal approach is a tougher problem and needs an understanding of whether we have access to the right variables or we are making the right link between variables and an outcome. Causal regularisation is a method that uses machine learning techniques (regularisation!) to try and produce models that can be interpreted causally.

Enjoy these types of posts? Then you should sign up for my newsletter.

Regularisation is normally taught as a method to reduce overfitting, you have a big model and you make it smaller by shrinking some of the factors. Work by Janzing (papers below) argues that this can help produce better causal models too and in this blog post I will work through two papers to try and understand the process better.

I’ll work off two main papers for causal regularisation:

In truth, I am working backward. I first encountered causal regularisation in Better AB testing via Causal Regularisation where it uses causal regularisation to produce better estimates by combining a biased and an unbiased dataset. I want to take a step back and understand casual regularisation from the original papers. Using free data from the UCI Machine Learning Repository we can attempt to replicate the methods from the papers and see how causal regularisation works to produce better causal models.

As ever, I’m in Julia (1.9), so fire up that notebook and follow along.

using CSV, DataFrames, DataFramesMeta
using Plots
using GLM, Statistics

Wine Tasting Data

The wine-quality dataset from the UCI repository provides measurements of the chemical properties of wine and a quality rating from someone drinking the wine. It’s a simple CSV file that you can download (winequality) and load with minimal data wrangling needed.

We will be working with the red wine data set as that’s what both Janzing papers use.

rawData = CSV.read("wine+quality/winequality-red.csv", DataFrame)
first(rawData)

APD! Always Plotting the Data to make sure the values are something you expect. Sometimes you need a visual confirmation that things line up with what you believe.

plot(scatter(rawData.alcohol, rawData.quality, title = "Alcohol", label = :none, color="#eac435"),
     scatter(rawData.pH, rawData.quality, title = "pH", label = :none, color="#345995"),
     scatter(rawData.sulphates, rawData.quality, title= "Sulphates", label = :none, color="#E40066"),
     scatter(rawData.density, rawData.quality, title = "Density", label = :none, color="#03CEA4"), ylabel = "Quality")

By choosing four of the variables randomly we can see that some are correlated with quality and some are not.

A loose goal is to come up with a causal model that can explain the quality of the wine using the provided factors. We will change the data slightly to highlight how causal regularisation helps, but for now, let’s start with the simple OLS model.

In the paper they normalise the variables to be unit variance, so we divide by the standard deviation. We then model the quality of the wine using all the available variables.

vars = names(rawData, Not(:quality))

cleanData = deepcopy(rawData)

for var in filter(!isequal("White"), vars)
    cleanData[!, var] = cleanData[!, var] ./ std(cleanData[!, var])
end

cleanData[!, :quality] .= Float64.(cleanData[!, :quality])

ols = lm(term(:quality) ~ sum(term.(Symbol.(vars))), cleanData)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

quality ~ 1 + fixed acidity + volatile acidity + citric acid + residual sugar + chlorides + free sulfur dioxide + total sulfur dioxide + density + pH + sulphates + alcohol

Coefficients:
────────────────────────────────────────────────────────────────────────────────────────
                           Coef.  Std. Error      t  Pr(>|t|)     Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────────────────
(Intercept)           21.9652     21.1946      1.04    0.3002  -19.6071      63.5375
fixed acidity          0.043511    0.0451788   0.96    0.3357   -0.0451055    0.132127
volatile acidity      -0.194027    0.0216844  -8.95    <1e-18   -0.23656     -0.151494
citric acid           -0.0355637   0.0286701  -1.24    0.2150   -0.0917989    0.0206716
residual sugar         0.0230259   0.0211519   1.09    0.2765   -0.0184626    0.0645145
chlorides             -0.088211    0.0197337  -4.47    <1e-05   -0.126918    -0.0495041
free sulfur dioxide    0.0456202   0.0227121   2.01    0.0447    0.00107145   0.090169
total sulfur dioxide  -0.107389    0.0239718  -4.48    <1e-05   -0.154409    -0.0603698
density               -0.0337477   0.0408289  -0.83    0.4086   -0.113832     0.0463365
pH                    -0.0638624   0.02958    -2.16    0.0310   -0.121883    -0.00584239
sulphates              0.155325    0.019381    8.01    <1e-14    0.11731      0.19334
alcohol                0.294335    0.0282227  10.43    <1e-23    0.238977     0.349693
────────────────────────────────────────────────────────────────────────────────────────

The dominant factor is the alcohol amount which is the strongest variable in predicting the quality, i.e. higher quality has a higher alcohol content. We also note that 5 out of the 12 variables are deemed insignificant at the 5% level. We save these parameters and then look at the regression without the alcohol variable.

olsParams = DataFrame(Dict(zip(vars, coef(ols)[2:end])))
olsParams[!, :Model] .= "OLS"
olsParams

1×12 DataFrame

Row	alcohol	chlorides	citric acid	density	fixed acidity	free sulfur dioxide	pH	residual sugar	sulphates	total sulfur dioxide	volatile acidity	Model
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	String
1	0.294335	-0.088211	-0.0355637	-0.0337477	0.043511	0.0456202	-0.0638624	0.0230259	0.155325	-0.107389	-0.194027	OLS

cleanDataConfounded = select(cleanData, Not(:alcohol))
vars = names(cleanDataConfounded, Not(:quality))

confoundOLS = lm(term(:quality) ~ sum(term.(Symbol.(vars))), cleanDataConfounded)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

quality ~ 1 + fixed acidity + volatile acidity + citric acid + residual sugar + chlorides + free sulfur dioxide + total sulfur dioxide + density + pH + sulphates

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────────
                             Coef.  Std. Error       t  Pr(>|t|)     Lower 95%    Upper 95%
───────────────────────────────────────────────────────────────────────────────────────────
(Intercept)           189.679       14.2665      13.30    <1e-37  161.696       217.662
fixed acidity           0.299551     0.0391918    7.64    <1e-13    0.222678      0.376424
volatile acidity       -0.176182     0.0223382   -7.89    <1e-14   -0.219997     -0.132366
citric acid             0.00912711   0.0292941    0.31    0.7554   -0.0483321     0.0665863
residual sugar          0.133781     0.0189031    7.08    <1e-11    0.0967031     0.170858
chlorides              -0.107215     0.0203052   -5.28    <1e-06   -0.147043     -0.0673877
free sulfur dioxide     0.0394281    0.023462     1.68    0.0931   -0.00659172    0.0854479
total sulfur dioxide   -0.128248     0.0246854   -5.20    <1e-06   -0.176668     -0.0798287
density                -0.355576     0.0276265  -12.87    <1e-35   -0.409765     -0.301388
pH                      0.0965662    0.0261087    3.70    0.0002    0.0453551     0.147777
sulphates               0.213697     0.0191745   11.14    <1e-27    0.176087      0.251307
───────────────────────────────────────────────────────────────────────────────────────────

citric acid and free sulfur dioxide are now the only insignificant variables, the rest are believed to contribute to the quality. This means we are experiencing confounding as alcohol is the better explainer but the effect of alcohol is now hiding behind these other variables.

Confounding - When a variable influences other variables and the outcome at the same time leading to an incorrect view on the correlation between the variables and outcomes.

This regression after dropping the alcohol variable is incorrect and provides the wrong causal conclusion. So can we do better and get closer to the true regression coefficients using some regularisation methods?

For now, we save these incorrect parameters and explore the causal regularisation methods.

olsParamsConf = DataFrame(Dict(zip(vars, coef(confoundOLS)[2:end])))
olsParamsConf[!, :Model] .= "OLS No Alcohol"
olsParamsConf[!, :alcohol] .= NaN

olsParamsConf

1×12 DataFrame

Row	chlorides	citric acid	density	fixed acidity	free sulfur dioxide	pH	residual sugar	sulphates	total sulfur dioxide	volatile acidity	Model	alcohol
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	String	Float64
1	-0.107215	0.00912711	-0.355576	0.299551	0.0394281	0.0965662	0.133781	0.213697	-0.128248	-0.176182	OLS No Alcohol	NaN

Regularisation and Regression

Some maths. Regression is taking our variables $X$ and finding the parameters $a$ that get us closest to $Y$.

\[Y = a X\]

$X$ is a matrix, and $a$ is a vector. When we fit this to some data, the values of $a$ are free to converge to any value they want, so long as it gets close to the outcome variable. This means we are minimising the difference between $Y$ and $X$

\[||(Y - a X)|| ^2.\]

Regularisation is the act of restricting the values $a$ can take.

For example, we can make the sum of all the $a$’s equal to a constant (L_1 regularisation), or the sum of the square of the $a$ values equal a constant (L_2 regularisation). In simpler terms, if we want to increase the coefficient of one parameter, we need to reduce the parameter of a different term. Think of there being a finite amount of mass that we can allocate to the parameters, they can’t take on whatever value they like, but instead need to regulate amongst themselves. This helps reduce overfitting as it constrains how much influence a parameter can have and the final result should converge to a model that doesn’t overfit.

In ridge regression we are minimising the $L_2$ norm, so restricting the sum of the square of the $a$’s and at the same time minimising the original OLS regression.

\[||(Y - a X)|| ^2 - \lambda || a || ^2.\]

So we can see how regularisation is an additional component of OLS regression. $\lambda$ is a hyperparameter that is just a number and controls how much restriction we place on the $a$ values.

To do ridge regression in Julia I’ll be leaning on the MLJ.jl framework and using that to build out the learning machines.

using MLJ

@load RidgeRegressor pkg=MLJLinearModels

We will take the confounded dataset (so the data where the alcohol column is deleted), partition it into train and test sets, and get started with some regularisation.

y, X = unpack(cleanDataConfounded, ==(:quality); rng=123);

train, test = partition(eachindex(y), 0.7, shuffle=true)

mdl = MLJLinearModels.RidgeRegressor()

RidgeRegressor(
  lambda = 1.0, 
  fit_intercept = true, 
  penalize_intercept = false, 
  scale_penalty_with_samples = true, 
  solver = nothing)

Can see the hyperparameter lambda is initialised to 1.

Basic Ridge Regression

We want to know the optimal $\lambda$ value so will use cross-validation to train the model on one set of data and verify on a hold-out set before repeating. This is all simple in MLJ.jl, we define a grid of penalisations between 0 and 1 and fit the regression using cross-validation across the different lambdas. We are optimising for the best $R^2$ value.

lambda_range = range(mdl, :lambda, lower = 0, upper = 1)

lmTuneModel = TunedModel(model=mdl,
                          resampling = CV(nfolds=6, shuffle=true),
                          tuning = Grid(resolution=200),
                          range = [lambda_range],
                          measures=[rsq]);

lmTunedMachine = machine(lmTuneModel, X, y);

fit!(lmTunedMachine, rows=train, verbosity=0)
report(lmTunedMachine).best_model

RidgeRegressor(
  lambda = 0.020100502512562814, 
  fit_intercept = true, 
  penalize_intercept = false, 
  scale_penalty_with_samples = true, 
  solver = nothing)

The best value of $\lambda$ is 0.0201. When we plot the $R^2$ vs the $\lambda$ values there isn’t that much of a change just a minor inflection around the small ones.

plot(lmTunedMachine)

Let’s save those parameters. This will be our basic ridge regression result that the other technique builds off.

res = fitted_params(lmTunedMachine).best_fitted_params.coefs

ridgeParams = DataFrame(res)
ridgeParams = hcat(ridgeParams, DataFrame(Model = "Ridge", alcohol=NaN))
ridgeParams

1×12 DataFrame

Row	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	Model	alcohol
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	String	Float64
1	0.190892	-0.157286	0.0410523	0.117846	-0.142458	0.0374597	-0.153419	-0.29919	0.0375852	0.232461	Ridge	NaN

Implementing Causal Regularisation

The main result from the paper is that we first need to estimate the confounding effect $\beta$ and then choose a penalisation factor $\lambda$ that satisfies

\[(1-\beta) || a || ^ 2\]

So the $L_2$ norm of the ridge parameters can only be so much. In the 2nd paper, they estimate $\beta$ to be 0.8. For us, we can use the above grid search, calculate the norm of the parameters, and find which ones satisfy those criteria.

So iterate through the above results of the grid search, and calculate the L2 norm of the parameters.

mdls = report(lmTunedMachine).history

l = zeros(length(mdls))
a = zeros(length(mdls))

for (i, mdl) in enumerate(mdls)
    l[i] = mdl.model.lambda
    a[i] = sum(map( x-> x[2], fitted_params(fit!(machine(mdl.model, X, y))).coefs) .^2)
end

Plotting the results gives us a visual idea of how the penalisation works. Larger values of $\lambda$ mean the model parameters are more and more restricted.

inds = sortperm(l)
l = l[inds]
a = a[inds]

mdlsSorted = report(lmTunedMachine).history[inds]

scatter(l, a, label = :none)
hline!([(1-0.8) * sum(coef(confoundOLS)[2:end] .^ 2)], label = "Target Length", xlabel = "Lambda", ylabel = "a Length")

We search the lengths for the one closest to the target length and save those parameters.

targetLength = (1-0.8) * sum(coef(confoundOLS)[2:end] .^ 2)
ind = findfirst(x-> x < targetLength, a)

res = fitted_params(fit!(machine(mdlsSorted[ind].model, X, y))).coefs

finalParams = DataFrame(res)
finalParams = hcat(finalParams, DataFrame(Model = "With Beta", alcohol=NaN))
finalParams

1×12 DataFrame

Row	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	Model	alcohol
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	String	Float64
1	0.0521908	-0.139099	0.0598797	0.0377729	-0.0786037	0.00654776	-0.0856938	-0.124057	0.00682623	0.11735	With Beta	NaN

What if we don’t want to calculate the confounding effect?

Now the code to calculate $\beta$ isn’t the easiest or straightforward to implement (hence why I took their estimate). Instead, we could take the approach from Better AB Testing via Causal Regularisation and use the test set to optimise the penalisation parameter $\lambda$ and then use that value when training the model on the train set.

Applying this method to the wine dataset isn’t a true replication of their paper, as their test and train data sets are instead two data sets, one with bias and one without like you might observe from an AB test. So it’s more of a demonstration of the method rather than a direct comparison to the Janzing method.

Again, MLJ makes this simple, we just fit the machine using the test rows to produce the best-fitting model.

lambda_range = range(mdl, :lambda, lower = 0, upper = 1)

lmTuneModel = TunedModel(model=mdl,
                          resampling = CV(nfolds=6, shuffle=true),
                          tuning = Grid(resolution=200),
                          range = [lambda_range],
                          measures=[rsq]);

lmTunedMachine = machine(lmTuneModel, X, y);

fit!(lmTunedMachine, rows=test, verbosity=0)
plot(lmTunedMachine)

report(lmTunedMachine).best_model

RidgeRegressor(
  lambda = 0.010050251256281407, 
  fit_intercept = true, 
  penalize_intercept = false, 
  scale_penalty_with_samples = true, 
  solver = nothing)

Our best $\lambda$ is 0.01 so we retrain the same machine, this time using the training rows.

res2 = fit!(machine(report(lmTunedMachine).best_model, X, y), rows=train)

Again saving these parameters down leaves us with three methods and three sets of parameters.

finalParams2 = DataFrame(fitted_params(res2).coefs)
finalParams2 = hcat(finalParams2, DataFrame(Model = "No Beta", alcohol=NaN))

allParams = vcat([olsParams, olsParamsConf, ridgeParams, finalParams, finalParams2]...)
allParams

5×12 DataFrame

Row	alcohol	chlorides	citric acid	density	fixed acidity	free sulfur dioxide	pH	residual sugar	sulphates	total sulfur dioxide	volatile acidity	Model
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	String
1	0.294335	-0.088211	-0.0355637	-0.0337477	0.043511	0.0456202	-0.0638624	0.0230259	0.155325	-0.107389	-0.194027	OLS
2	NaN	-0.107215	0.00912711	-0.355576	0.299551	0.0394281	0.0965662	0.133781	0.213697	-0.128248	-0.176182	OLS No Alcohol
3	NaN	-0.142458	0.0410523	-0.29919	0.190892	0.0374597	0.0375852	0.117846	0.232461	-0.153419	-0.157286	Ridge
4	NaN	-0.0786037	0.0598797	-0.124057	0.0521908	0.00654776	0.00682623	0.0377729	0.11735	-0.0856938	-0.139099	With Beta
5	NaN	-0.141766	0.031528	-0.323596	0.222812	0.03869	0.048907	0.127026	0.23961	-0.153488	-0.157603	No Beta

What method has done the best at uncovering the confounded relationship?

Relative Squared Error

We have our different estimates of the parameters of the model, we now want to compare these to the ‘true’ unconfounded variables and see whether we have recovered the correct variables. To do this we calculate the square difference and normalise by the overall $L_2$ norm of the parameters.

In practice, this just means we are comparing how far the fitted parameters are away from the true (unconfounded) model parameters.

allParamsLong = stack(allParams, Not(:Model))
trueParams = select(@subset(allParamsLong, :Model .== "OLS"), Not(:Model))
rename!(trueParams, ["variable", "truth"])
allParamsLong = leftjoin(allParamsLong, trueParams, on = :variable)
errorRes = @combine(groupby(@subset(allParamsLong, :variable .!= "alcohol"), :Model), 
         :a = sum((:truth .- :value) .^2),
         :a2 = sum(:value .^ 2))
errorRes = @transform(errorRes, :e = :a ./ :a2)
sort(errorRes, :e)

5×4 DataFrame

Row	Model	a	a2	e
	String	Float64	Float64	Float64
1	OLS	0.0	0.0920729	0.0
2	With Beta	0.0291038	0.0698576	0.416616
3	Ridge	0.129761	0.266952	0.486085
4	No Beta	0.157667	0.301286	0.523314
5	OLS No Alcohol	0.213692	0.349675	0.611116

Using the $\beta$ estimation method gives the best model (smallest $e$), which lines up with the paper and the magnitude of error is also inline with the paper (they had 0.35 and 0.45 for Lasoo/ridge regression respectively). The ridge regression and no beta method also improved on the naive OLS approach, so that indicates that there is some improvement from using these methods. The No Beta method is not a faithful reproduction of the Better AB testing paper because it requires the ‘test’ dataset to be an AB test scenario, which we don’t have from the above, so that might explain why the values don’t quite line up.

All methods improve on the naive ‘OLS No Alcohol’ parameters though, which shows this approach to causal regularisation can uncover better models if you have underlying confounding in your data.

Summary

We are always stuck with the data we are given and most of the time can’t collect more to try and uncover more relationships. Causal regularisation gives us a chance to use normal machine learning techniques to build better causal relationships by guiding what the regularisation parameters should be and using that to restrict the overall parameters. When we can estimate the expected confounding value $\beta$ we get the best results, but regular ridge regression and the Webster-Westray method also provide an improvement on just doing a naive regression. So whilst overfitting is the main driver for doing regularisation it also brings with it some causal benefits and lets you understand true relationships between variables in a truer sense.

Another Causal Post

I’ve written about causal analysis techniques before with Double Machine Learning - An Easy Introduction. This is another way of building causal models.

Free Finance Data Sets for the Quants

2023-11-25T00:00:00+00:00

Now and then I am asked how to get started in quant finance and my advice has always been to just get hold of some data and play about with different models. The first step is to get some data and this post takes you through several different sources and hopefully gives you the launchpad to start poking around with financial data.

Enjoy these types of posts? Then you should sign up for my newsletter.

I’ve tried to cover different assets and frequencies to hopefully inspire the various types of quant finance out there.

High-Frequency FX Market Data

My day-to-day job is in FX so naturally, that’s where I think all the best data can be found. TrueFX provides tick-by-tick in milliseconds, so high-frequency data is available for free and across lots of different currencies. So if you are interested in working out how to deal with large amounts of data (1 month of EURUSD is 600MB) efficiently, this source is a good place to start.

As a demo, I’ve downloaded the USDJPY October dataset.

using CSV, DataFrames, DataFramesMeta, Dates, Statistics
using Plots

It’s a big CSV file, so this isn’t the best way to store the data, instead, stick it into a database like QuestDB that are made for time series data.

usdjpy = CSV.read("USDJPY-2023-10.csv", DataFrame,
                 header = ["Ccy", "Time", "Bid", "Ask"])
usdjpy.Time = DateTime.(usdjpy.Time, dateformat"yyyymmdd HH:MM:SS.sss")
first(usdjpy, 4)

4×4 DataFrame

Row	Ccy	Time	Bid	Ask
	String7	DateTime	Float64	Float64
1	USD/JPY	2023-10-01T21:04:56.931	149.298	149.612
2	USD/JPY	2023-10-01T21:04:56.962	149.298	149.782
3	USD/JPY	2023-10-01T21:04:57.040	149.589	149.782
4	USD/JPY	2023-10-01T21:04:58.201	149.608	149.782

It’s simple data, just a bid and ask price with a time stamp.

usdjpy = @transform(usdjpy, :Spread = :Ask .- :Bid, 
                            :Mid = 0.5*(:Ask .+ :Bid), 
                            :Hour = round.(:Time, Minute(10)))

usdjpyHourly = @combine(groupby(usdjpy, :Hour), :open = first(:Mid), :close = last(:Mid), :avg_spread = mean(:Spread))
usdjpyHourly.Time = Time.(usdjpyHourly.Hour)

plot(usdjpyHourly.Hour, usdjpyHourly.open, lw =1, label = :none, title = "USDJPY Price Over October")

Looking at the hourly price over the month gives you flat periods over the weekend.

Let’s look at the average spread (ask - bid) throughout the day.

hourlyAvgSpread = sort(@combine(groupby(usdjpyHourly, :Time), :avg_spread = mean(:avg_spread)), :Time)

plot(hourlyAvgSpread.Time, hourlyAvgSpread.avg_spread, lw =2, title = "USDJPY Intraday Spread", label = :none)

We see a big spike at 10 pm because of the day roll and the secondary markets go offline briefly, which pollutes the data bit. Looking at just midnight to 8 pm gives a more indicative picture.

plot(hourlyAvgSpread[hourlyAvgSpread.Time .<= Time("20:00:00"), :].Time, 
     hourlyAvgSpread[hourlyAvgSpread.Time .<= Time("20:00:00"), :].avg_spread, label = :none, lw=2,
     title = "USDJPY Intraday Spread")

In October spreads have generally been wider in the later part of the day compared to the morning.

There is much more that can be done with this data across the different currencies though. For example:

How stable are correlations across currencies at different time frequencies?
Can you replicate my microstructure noise post? How does the microstructure noise change between currencies
Price updates are irregular, what are some statistical properties?

Daily Futures Market Data

Let’s zoom out a little bit now, decrease the frequency, and widen the asset pool. Futures cover many asset classes, oil, coal, currencies, metals, agriculture, stocks, bonds, interest rates, and probably something else I’ve missed. This data is daily and roll adjusted, so you have a continuous time series of an asset for many years. This means you can look at the classic momentum/mean reversion portfolio models and have a real stab at long-term trends.

The data is part of the Nasdaq data link product (formerly Quandl) and once you sign up for an account you have access to the free data. This futures dataset is Wiki Continuous Futures and after about 50 clicks and logging in, re-logging in, 2FA codes you can view the pages.

To get the data you can go through one of the API packages in your favourite language. In Julia, this means the QuandlAccess.jl package which keeps things simple.

using QuandlAccess

futuresMeta = CSV.read("continuous.csv", DataFrame)
futuresCodes = futuresMeta[!, "Quandl Code"] .* "1"

quandl = Quandl("QUANDL_KEY")

function get_data(code)
    futuresData = quandl(TimeSeries(code))
    futuresData.Code .= code
    futuresData
end
futureData = get_data.(rand(futuresCodes, 4));

We have an array of all the available contracts futuresCodes and sample 4 of them randomly to see what the data looks like.

p = []
for df in futureData
    append!(p, plot(df.Date, df.Settle, label = df.Code[1]))
end

plot(plot.(p)..., layout = 4)

ABY - WTI Brent Bullet - Spread between two oil futures on different exchanges.
TZ6 - Transco Zone 6 Non-N.Y. Natural Gas (Platts IFERC) Basis - Spread between two different natural gas contracts
PG - PG&E Citygate Natural Gas (Platts IFERC) Basis - Again, spread between two different natural gas contracts
FMJP - MSCI Japan Index - Index containing Japanese stocks

I’ve managed to randomly select 3 energy futures and one stock index.

Project ideas with this data:

Cross-asset momentum and mean reversion.
Cross-asset correlations, does the price of oil drive some equity indexes?
Macro regimes, can you pick out commonalities of market factors over the years?

Equity Order Book Data

Out there in the wild is the FI2010 dataset which is essentially a sample of the full order book for five different stocks on the Nordic stock exchange for 10 days. You have 10 levels of prices and volumes and so can reconstruct the order book throughout the day. It is the benchmark dataset for limit order book prediction and you will see it referenced in papers that are trying to implement new prediction models. For example Benchmark Dataset for Mid-Price Forecasting of Limit Order Book Data with Machine Learning Methods references some basic methods on the dataset and how they perform when predicting the mid-price.

I found the dataset (as a Python package) here https://github.com/simaki/fi2010 but it’s just stored as a CSV which you can lift easily.

fi2010 = CSV.read(download("https://raw.githubusercontent.com/simaki/fi2010/main/data/data.csv"),DataFrame);

Update on 7/01/2024

Since posting this the above link has gone offline and the user has deleted their Github account! Instead the data set can be found here: https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649/data . I’ve not verified if its in the same format, so there might be some additional work going from the raw data to how this blog post sets it up. Thank’s to the commentators below pointing this out.

The data is wide (each column is a depth level of the price and volume) so I turn each into a long data set and add the level, side and variable as a new column.

fi2010Long = stack(fi2010, 4:48, [:Column1, :STOCK, :DAY])
fi2010Long = @transform(fi2010Long, :a = collect.(eachsplit.(:variable, "_")))
fi2010Long = @transform(fi2010Long, :var = first.(:a), :level = last.(:a), :side = map(x->x[2], :a))
fi2010Long = @transform(groupby(fi2010Long, [:STOCK, :DAY]), :Time = collect(1:length(:Column1)))
first(fi2010Long, 4)

The ‘book depth’ is the sum of the liquidity available at all the levels and indicates how easy it is to trade the stock. As a quick example, we can take the average of each stock per day and use that as a proxy for the ease of trading these stocks.

intraDayDepth = @combine(groupby(fi2010Long, [:STOCK, :DAY, :var]), :avgDepth = mean(:value))
intraDayDepth = @subset(intraDayDepth, :var .== "VOLUME");
plot(intraDayDepth.DAY, intraDayDepth.avgDepth, group=intraDayDepth.STOCK, 
     marker = :circle, title = "Avg Daily Book Depth - FI2010")

Stock 3 and 4 have the highest average depth, so most likely the easier to trade, whereas Stock 1 has the thinnest depth. Stock 2 has an interesting switch between liquid and not liquid.

So if you want to look beyond top-of-book data, this dataset provides the extra level information needed and is closer to what a professional shop is using. Better than trying to predict daily Yahoo finance mid-prices with neural nets at least.

Build Your Own Crypto Datasets

If you want to take a further step back then being able to build the tools that take in streaming data directly from the exchanges and save that into a database is another way you can build out your technical capabilities. This means you have full control over what you download and save. Do you want just the top of book every update, the full depth of the book, or just the reported trades? I’ve written about this before, Getting Started with High Frequency Finance using Crypto Data and Julia, and learned a lot in the process. Doing things this way means you have full control over the entire process and can fully understand the data you are saving and any additional quirks around the process.

Conclusion

Plenty to get stuck into and learn from. Being able to get the data and loading it into an environment is always the first challenge and learning how to do that with all these different types of data should help you understand what these types of jobs entail.

Easy Reinforcement Learning - The Multi Armed Bandit

2023-09-27T00:00:00+00:00

This is another draft that’s been sitting on my laptop and I was sitting on the Eurostar on the way to TradeTech and thought I’d try and formalise it into a blog post. This is all about reinforcement learning and a basic model that can be easily implemented in Julia. This post is me walking through and implementing the 2nd chapter of Reinforcement Learning: An Introduction.

Enjoy these types of posts? Then you should sign up for my newsletter.

Reinforcement learning is a pillar of machine learning and it combines the use of data and learning how to make a better decision automatically. One of the basic models in reinforcement learning is the multi-armed bandit. A bit of an anachronistic name, but the single-armed bandit refers to a casino game where you pull the lever (or push a button), some cassettes roll round and you might win a prize.

The multi-armed bandit is an extension to this type of game and means we have different levers we can pull that lead to a different reward. The reward depends on the lever pulled.

This simple mental model is surprisingly applicable to lots of different problems and it can act as a good approximation to whatever you are trying to solve. For example, let’s use an advertising example. You have multiple adverts that you display to try and get people to click through to your website. Each time a page loads you can load one advertisement, you then record how many people click on that advert and use that to decide which advert to show next. With each page load you decide, do I show the most succesful advert so far or try a new advert to see how that performs? Over time you will find out which advert performs the best and show that as much as possible to get as many clicks.

A Simple Bandit

Imagine we have a multi-armed bandit machine, where we pull a lever and get a reward. The reward depends on the lever pulled, how do we learn what the best lever is?

First let’s build our bandit. We will have 5 levers and the reward will be a sample from a normal distribution where each lever will have a random mean and standard deviation.

using Plots, StatsPlots
using Distributions

nLevers = 5

rewardMeans = rand(Normal(0, 3), nLevers)
rewardSD = rand(Gamma(2, 2), nLevers)

hcat(rewardMeans, rewardSD)

5×2 Matrix{Float64}:
 -4.7724   5.88533
 -4.60967  0.627556
 -5.96987  1.14465
  8.96919  3.80253
  2.11311  4.84983

These are the parameters of our levers in our bandit, so lets look at the distribution of the rewards.

density(rand(Normal(rewardMeans[1], rewardSD[1]), 1000), label = "Lever 1")

for i in 2:nLevers
    density!(rand(Normal(rewardMeans[i], rewardSD[i]), 1000), label = "Lever " * string(i))
end
plot!()

So our levers giving us a sample from a normal distribution is illustrated above. The 4th lever looks like the best as it has the most likely chance of getting a positive value and has the wider tail too. As we are talking about rewards, large positive values are better.

So given we have a process of pulling a lever and getting a reward, how do we learn what the best lever is and importantly as quickly as possible?

Like all good statistics problems, we start with the most basic model and start pulling levers randomly.

The Random Strategy

Just pull a random lever every time. Nothing is being learned here though and we are just demonstrating how the problem setup works. With each play we generate a random integer that corresponds to the lever, pull the lever (draw a random normal variable with mean/deviation of that lever), record what lever was pulled and the reward amount. Then repeat several times.

function random_learner(rewardMeans, rewardSD, nPlays)

    nLevers = length(rewardMeans)
    
    selectedLever = zeros(Int64, nPlays)
    rewards = zeros(nPlays)

    cumSelection = zeros(Int64, nLevers)
    cumRewards = zeros(nLevers)
    
    optimalChoice = Array{Bool}(undef, nPlays)

    bestLever = findmax(rewardMeans)[2]
    
    for i = 1:nPlays
    
        selectedLever[i] = rand(1:nLevers)
        
        optimalChoice[i] = selectedLever[i] == bestLever
        
        rewards[i] = rand(Normal(rewardMeans[selectedLever[i]], rewardSD[selectedLever[i]]))
    
        cumSelection[selectedLever[i]] += 1
        cumRewards[selectedLever[i]] += rewards[i]
    
    end
    return selectedLever, rewards, cumSelection, cumRewards, optimalChoice
end

We run this learner for 1,000 steps and look at the number of times each lever is pulled.

randomStrat = random_learner(rewardMeans, rewardSD, 1000);

histogram(randomStrat[1], label = "Number of Time Lever Pulled")

Each of the levers is pulled a roughly equal amount of times, with no learning, just randomly pulling. Moving on, how do we learn?

Action Value Methods

Reinforcement learning is about balancing the explore/exploit set-up of the problem. We need to sample each of the levers and work out what kind of rewards they provide and then use that information to inform our next decision.

For each iteration, we randomly decide if we will pull any lever or do we use the old information to choose our best guess at the best lever. Our information in this case is the rolling average of the reward each time we pulled the lever. This is called a greedy learner. It’s just doing its best with what it knows and has no real ability to decide whether to explore a new lever.

The probability of choosing a random lever is called the learning rate ($\eta$) and controls how often we make the perceived optimal choice. A high value of $\eta$ means lots of exploring (learning) and a low value restricts the learning and means we pull the (perceived) best lever each time. So if we had many levers and a low learning rate it is possible that we never find the globally optimal lever and instead just stick to the locally optimal lever, hence why it is called a greedy learner, it can get stuck.

function greedy_learner(rewardMeans, rewardSD, nPlays, eta)

    nLevers = length(rewardMeans)
    
    selectedLever = zeros(Int64, nPlays)
    rewards = zeros(nPlays)

    cumSelection = zeros(Int64, nLevers)
    cumRewards = zeros(nLevers)
    
    optimalChoice = Array{Bool}(undef, nPlays)
    
    bestLever = findmax(rewardMeans)[2]

    for i = 1:nPlays

        if rand() < eta
            selectedLever[i] = rand(1:nLevers)
        else 
            q = cumRewards ./ cumSelection
            q[isnan.(q)] .= 0
            selectedLever[i] = findmax(q)[2]
        end
        
        optimalChoice[i] = selectedLever[i] == bestLever
        
        rewards[i] = rand(Normal(rewardMeans[selectedLever[i]], rewardSD[selectedLever[i]]))

        cumSelection[selectedLever[i]] += 1
        cumRewards[selectedLever[i]] += rewards[i]

    end
    return selectedLever, rewards, cumSelection, cumRewards, optimalChoice
end

Again, we can run it for 1,000 steps and we set our learning rate to 0.5.

greedyStrat = greedy_learner(rewardMeans, rewardSD, 1000, 0.5)

histogram(greedyStrat[1], label = "Number of Time Lever Pulled", legend = :topleft)

This has done what we thought, it has selected the 4th lever that we thought looked the best from the distribution. So we’ve learned something, hooray!

Varying in the Learning Rate

The $\eta$ parameter was set to 0.5 above, but how does varying change the outcome? To explore this we will do multiple runs of multiple plays of the game and also increase the number of levers. For each run, we will generate a new set of reward averages/standard deviations and run the random learner and the greedy learner with different $\eta$.

nRuns = 2000
nPlays = 1000
nLevers = 10

optimalLevel = zeros(nRuns)

randomRes = Array{Tuple}(undef, nRuns)
greedyRes = Array{Tuple}(undef, nRuns)
greedyRes05 = Array{Tuple}(undef, nRuns)
greedyRes01 = Array{Tuple}(undef, nRuns)
greedyRes001 = Array{Tuple}(undef, nRuns)
greedyRes0001 = Array{Tuple}(undef, nRuns)


for i=1:nRuns
    rewardMeans = rand(Normal(0, 1), nLevers)
    rewardSD = ones(nLevers)
   
    randomRes[i] = random_learner(rewardMeans, rewardSD, nPlays)
    greedyRes[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0)
    greedyRes05[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.5)
    greedyRes01[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.1)
    greedyRes001[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.01)
    greedyRes0001[i] = greedy_learner(rewardMeans, rewardSD, nPlays, 0.001)
    
    optimalLevel[i] = findmax(rewardMeans)[2]
    
end

For each of the runs we have the evolution of the reward, so we want to take the average of the reward on each time step and see how that evolves with each play of the game.

randomAvg = mapreduce(x-> x[2], +, randomRes) ./ nRuns
greedyAvg = mapreduce(x-> x[2], +, greedyRes) ./ nRuns
greedyAvg01 = mapreduce(x-> x[2], +, greedyRes01) ./ nRuns
greedyAvg09 = mapreduce(x-> x[2], +, greedyRes05) ./ nRuns
greedyAvg001 = mapreduce(x-> x[2], +, greedyRes001) ./ nRuns;
greedyAvg0001 = mapreduce(x-> x[2], +, greedyRes0001) ./ nRuns;

And plotting the average reward over time.

plot(1:nPlays, randomAvg, label="Random", legend = :bottomright, xlabel = "Time Step", ylabel = "Average Reward")
plot!(1:nPlays, greedyAvg, label="0")
plot!(1:nPlays, greedyAvg05, label="0.5")
plot!(1:nPlays, greedyAvg01, label="0.1")
plot!(1:nPlays, greedyAvg001, label="0.01")
plot!(1:nPlays, greedyAvg0001, label="0.001")

Good to see that all the greedy learners outperform the random learner, so that algorithm is doing something. If we focus on the gready learners we see how the learning rates changes performances.

plot(1:nPlays, greedyAvg, label="0", legend=:bottomright, xlabel = "Time Step", ylabel = "Average Reward")
plot!(1:nPlays, greedyAvg01, label="0.1")
plot!(1:nPlays, greedyAvg001, label="0.01")
plot!(1:nPlays, greedyAvg0001, label="0.001")

This is an interesting result! When $\eta = 0$ we see that it never reaches as high as the other learning rates. So when $\eta = 0$ we never explore the other options, we just select what we think is the best one from history and never stray away from our beliefs. This ultimately hurts us because if we don’t get the best level on the first try then we are stuck in a suboptimal. Likewise, when the learning rate is very low, it doesn’t get much better, so this shows there is always value in exploring the options.

Philosophically, this shows that with any procedure you need to iterate through different configurations and explore the outcomes rather than sticking with what you believe is optimal.

scatter([0, 0.5, 0.1,0.01, 0.001], 
    map(x-> mean(x[750:1000]), [greedyAvg, greedyAvg05, greedyAvg01, greedyAvg001, greedyAvg0001]),
    xlabel="Learning Rate",
    ylabel = "Converged Reward", legend=:none)

The learning rate looks like it is optimal around 0.1. You can do a grid search to see how the overall behaviour changes in terms of both the speed of convergence to the final state and how good that final reward state is.

Speed it Up - Incremental Implementation

We can improve the above implementation by just saving memory and CPU cycles by doing ‘online learning’ of the rewards and using that to drive the selection. We create one matrix $$Q$, update it with the average reward of each lever and use the maximum of each iteration to select our lever if we are not exploring.

function greedy_learner_incremental(rewardMeans, rewardSD, nPlays, eta)

    nLevers = length(rewardMeans)
    
    selectedLever = zeros(Int64, nPlays)
    rewards = zeros(nPlays)

    cumSelection = zeros(Int64, nLevers)
    cumRewards = zeros(nLevers)

    Q = zeros((nPlays+1, nLevers))
    rewardsArray = zeros(nLevers)
    
    optimalChoice = Array{Bool}(undef, nPlays)
    
    bestLever = findmax(rewardMeans)[2]
    
    for i = 1:nPlays

        if rand() < eta
            selectedLever[i] = rand(1:nLevers)
        else 
            selectedLever[i] = findmax(Q[i,:])[2]
        end
        
        optimalChoice[i] = selectedLever[i] == bestLever
        
        reward = rand(Normal(rewardMeans[selectedLever[i]], rewardSD[selectedLever[i]]))
        rewards[i] = reward
        rewardsArray[selectedLever[i]] = reward
        
        cumSelection[selectedLever[i]] += 1
        cumRewards[selectedLever[i]] += reward

        Q[i+1, :] = Q[i, :] + (1/i) * (rewardsArray - Q[i,:])
        
    end
    return selectedLever, rewards, cumSelection, cumRewards, optimalChoice
end

Using the normal Julia benchmarking tools we can get a good idea if this rewrite has changed anything materially.

using BenchmarkTools

oldImp = @benchmark greedy_learner(rewardMeans, rewardSD, nPlays, 0.1)
newImp = @benchmark greedy_learner_incremental(rewardMeans, rewardSD, nPlays, 0.1)

judge(median(oldImp), median(newImp))

BenchmarkTools.TrialJudgement: 
  time:   -43.91% => improvement (5.00% tolerance)
  memory: -70.15% => improvement (1.00% tolerance)

It’s 50% faster and uses 70% less memory, so a good optimisation.

Conclusion

This is the basic intro to reinforcement learning but a good foundation for how to think about these problems. The main step is going from data to decisions and how to update the decisions you make each time. You need to make sure you explore the problem space as otherwise you never know how much better some other options might be.

Modelling Soccer Goals as a Point Process

2023-08-30T00:00:00+00:00

Goals occur at random times during football matches but we can use a point process to model their occurrences and understand how they are distributed over time. This blog post goes through how to estimate this type of point process model.

Enjoy these types of posts? Then you should sign up for my newsletter.

I’ve written before about predicting the number of goals in a game and this is a compliment to that post. Part of my PhD involved fitting a multidimensional Hawkes process to the time of goals scored by the home and away teams and this post isn’t as complicated as that instead we look at something simpler.

This is a change of language too, I’m writing R instead of Julia for once!

require(jsonlite)
require(dplyr)
require(tidyr)
require(ggplot2)
knitr::opts_chunk$set(fig.retina=2)
require(hrbrthemes)
theme_set(theme_ipsum())
extrafont::loadfonts()
require(wesanderson)

I have a dataset that contains the odds and the times of goals for many different football matches.

finalData <- readRDS("/Users/deanmarkwick/Documents/PhD/Research/Hawkes and Football/Data/allDataOddsAndGoals.RDS")

We do some wrangling of the data, converting it from the JSON format to give us a vector of each team’s goals split into whether they are home or away.

homeGoalTimes <- lapply(finalData$home.mins.goal, fromJSON)
awayGoalTimes <- lapply(finalData$away.mins.goal, fromJSON)
allGoals <- c(unlist(homeGoalTimes), unlist(awayGoalTimes))

To clean the data we need to replace the games without scores to a numeric type and also truncate any goals scored in extra time. We need a fixed window for the point process modeling.

replaceEmptyWithNumeric <- function(x){
  if(length(x) == 0){
    return(numeric(0))
  }else{
    return(x)
  }
}

max90 <- function(x){
  x[x > 90] <- 90
  return(x)
}

homeGoalTimesClean <- lapply(homeGoalTimes, replaceEmptyWithNumeric)
homeGoalTimesClean <- lapply(homeGoalTimesClean, max90)

awayGoalTimesClean <- lapply(awayGoalTimes, replaceEmptyWithNumeric)
awayGoalTimesClean <- lapply(awayGoalTimesClean, max90)

As the number of goals scored for each team will be proportional to the strength of the team we will use the odds of the team winning the match as a proxy for their strength. This does a good job as my previous blog post Goals from team strengths explored.

homeProbsStrengths <- finalData$PSCH
awayProbsStrengths <- finalData$PSCA

allStrengths <- c(homeProbsStrengths, awayProbsStrengths)
allGoalTimes <- c(homeGoalTimesClean, awayGoalTimesClean)

Interestingly we can do the same cleaning in dplyr easily using the case_when function.

allGoalsFrame <- data.frame(Time = allGoals)
allGoalsFrame %>% 
  mutate(TimeClean = case_when(Time > 90 ~ 90, 
                               TRUE ~ as.numeric(Time))) -> allGoalsFrame

After all that we can plot our distribution of goal times.

ggplot(allGoalsFrame, aes(x=TimeClean, y=after_stat(density))) + 
  geom_histogram(binwidth = 1) + 
  xlab("Time (mins)") + 
  ylab("Goal Density")

Two bumps, 1 around 45 minutes where goals are scored during extra time in the first half and the 90+ minute goals.

This is what we are trying to model. We want to predict when the goals will happen based on that team’s strength, which will also control how many goals are scored.

Point Process Modelling

A point process is a mathematical model that describes when things happen in a fixed window. Our window is the 90 minutes of the football match and we want to know where the goals fall in this window.

A point process is described by its intensity $\lambda (t)$ which is proportional to the likelihood of seeing an event at time $t$. So a higher intensity, a larger chance of a goal occurring. From our plot above we can see there are two main features we want our model to capture:

The general increase in goals as the match as time progresses.
The spike at 90 because of extra time.

To fit this type of model we will write an intensity function $\lambda$ and optimise the parameters to minimise the likelihood.

The likelihood for a point process is the summation of the intensity $\lambda(t)$ at each event and the integration of the intensity function over the window

\[\mathcal{L} = \sum _{i} \log \lambda (t_i) - \int _0^T \lambda (t) \mathrm{d} t.\]

We have to specify the form of $\lambda$ with a function and parameters and then fit the parameters to the data. By looking at the data we can see the intensity appears to be increasing and we need to account for the spike at 90

\[\lambda (t) = w \beta _0 + \beta _1 \frac{t}{T} + \beta _{90} \delta (t-90),\]

where $w$ is the team strength, $T$ is 90 and $\delta (x)$ is the Dirac delta function. More on that later.

Which we can easily integrate.

\[\int _0^T \lambda(t) = w \beta_0 T + \beta _1 \frac{T}{2} + \beta_{90}.\]

This gives us our likelihood function so we can move on to optimising it over our data.

Starting via Simulation

It’s always good to make sure you are on the right track by simulating the models you are exploring. Jumping straight into the real data means you are hoping your methods are correct, but starting with a known model and using the methods to recover the parameters gives you some confidence that what you are doing is correct.

There are three components to our model:

the intensity function
the integrated intensity function
the likelihood

We will also be using a Dirac delta function to represent the 90 minute spike

The Dirac Delta Function

Given our data is measured in minutes and all the goals that happen in extra time have the value of t=90 this means we need a sensible way to account for this mega spike. Essentially, we want something that is 1 at a single point and 0 everywhere else. That way we can assign a weight to this component in the overall model and that helps describe the data that also integrates nicely.

Now I’m a physicist by training, so my mathematical rigour around the function might not be up to scratch.

diract <- function(t, x=90){
  2*as.numeric((round(t) == x))
}

qplot(seq(0, 100, 0.1), diract(seq(0, 100, 0.1))) + 
  xlab("Time") + 
  ylab("Weight")

As expected, 1 at 90 and 0 everywhere else.

We can now write the R code for our intensity function, and then the likelihood by combining the intensity and integrated intensity.

intensityFunction <- function(params, t, winProb, maxT){
  beta0 <- params[1]
  beta1 <- params[2]
  beta90 <- params[3]
  
  int <- (winProb * beta0) + (beta1 * (t/maxT)) + (beta90*diract(t))
  int[int < 0] <- 0
  int
}

intensitFunctionInt <- function(params, maxT, winProb){
  beta0 <- params[1]
  beta1 <- params[2]
  beta90 <- params[3]
  
  beta0*winProb*maxT + (beta1*maxT)/2 + beta90
}

likelihood <- function(params, t, winProb){
  ss <- sum(log(intensityFunction(params, t, winProb, 90)))
  int <- intensitFunctionInt(params, 90, winProb)
  ss - int
}

We now combine the three functions and simulate a point process from the intensity function. We will use thinning to simulate the inhomogeneous intensity. This means generating more points than expected from a larger intensity, and then choosing what ones remain as a ratio between the larger intensity and true intensity. For a more in-depth discussion I’ve written about it previously in my post.

sim_events <- function(params, winProb){
  lambdaMax <- 1.1*intensityFunction(params, 90, winProb, 90)
  nevents <- rpois(1, lambdaMax*90)
  tstar <- runif(nevents, 0, 90)
  accept_prob <- intensityFunction(params, tstar, winProb, 90) / lambdaMax
  (sort(tstar[runif(length(accept_prob)) < accept_prob]))
}

N <- 100
testParams <- c(3, 2, 2)
testWinProb <- 1

testEvents <- replicate(N, sim_events(testParams, testWinProb))
testWinProbs <- rep_len(testWinProb, N)

trueInt <- intensityFunction(testParams, 0:90, testWinProb, 90)

As we have multiple simulated games, we want to calculate the overall likelihood across the total sample and maximise that likelihood.

alllikelihood <- function(params, events, winProbs){
  ll <- sum(vapply(seq_along(events), 
             function(i) likelihood(params, events[[i]], winProbs[[i]]), 
             numeric(1)))
  if(ll == -Inf){
    return(-1e9)
  } else {
    return(ll)
  }
}

trueLikelihood <- alllikelihood(testParams, testEvents, testWinProbs)

Simple enough to do the optimisation, chuck the function into optim and away we go.

simRes <- optim(runif(3), function(x) -1*alllikelihood(c(x[1], x[2], x[3]), 
                                             testEvents, 
                                             testWinProbs), lower = c(0,0,0), method = "L-BFGS-B")

print(simRes$par)

3.005867 1.995551 1.932193

The parameters come out almost exactly as they were specified.

simResDF <- data.frame(Time = 0:90, 
                     TrueIntensity = trueInt, 
                     EstimatedIntensity = intensityFunction(simRes$par, 0:90, testWinProb, 90))

ggplot(simResDF, aes(x=Time, y=TrueIntensity, color = "True")) + 
  geom_line() + 
  geom_line(aes(y=EstimatedIntensity, color = "Estimated")) + 
  labs(color = NULL) + 
  xlab("Time") + 
  ylab("Intensity") + 
  theme(legend.position = "bottom")

Okay, so our method is good. We’ve recovered all three factors in the intensity so well that you can hardly tell the difference between the real and estimated intensities. So we can now go on looking at our data.

Optimising over our football data

Let’s do the train/test split and fit our model on the training data.

trainInds <- sample.int(length(allGoalTimes), size = floor(length(allGoalTimes)*0.7))

goalTimesTrain <- allGoalTimes[trainInds]
strengthTrain <- allStrengths[trainInds]

goalTimesTest <- allGoalTimes[-trainInds]
strengthTest <- allStrengths[-trainInds]

We start by using a null model. This is where we will just use the constant parameter and the team strengths and see how well that fits the data.

optNull <- optim(runif(1), function(x) -1*alllikelihood(c(x[1], 0, 0), 
                                                       goalTimesTrain, 
                                                       strengthTrain), lower = c(0,0,0), method = "L-BFGS-B")
optNull

We add in the next parameter, the linear trend.

optNull2 <- optim(runif(2), function(x) -1*alllikelihood(c(x[1], x[2], 0), 
                                                       goalTimesTrain, 
                                                       strengthTrain), lower = c(0,0,0), method = "L-BFGS-B")
optNull2

We can now use all the features previously described and fit the full model across the data.

optRes <- optim(runif(3), function(x) -1*alllikelihood(x, 
                                                       goalTimesTrain, 
                                                       strengthTrain), lower = c(0,0,0), method = "L-BFGS-B")
optRes

And then just to check, let’s remove the linear parameter.

optRes2 <- optim(runif(2), function(x) -1*alllikelihood(c(x[1], 0, x[2]), 
                                                       goalTimesTrain, 
                                                       strengthTrain), lower = c(0,0,0), method = "L-BFGS-B")
optRes2

Putting all the results into a table lets us compare nicely.

Model	$\beta _0$	$\beta _1$	$\beta _{90}$
Constant	0.0039	—–	—–
Linear	0.0006	0.025	—–
Delta	0.00096	0.022	0.05
No Linear	0.0037	—–	0.06

The positive linear parameter ($\beta _1$) shows that there is an increase in probability towards the end of the match.

It is easier to compare the resultant intensity functions though.

modelFits <- data.frame(Time = 0:90)
modelFits$Null <- intensityFunction(c(optNull$par[1],0,0), modelFits$Time, 2, 90)
modelFits$Linear <- intensityFunction(c(optNull2$par ,0), modelFits$Time, 2, 90)
modelFits$Delta <- intensityFunction(optRes$par, modelFits$Time, 2, 90)
modelFits$NoLinear <- intensityFunction(c(optRes2$par[1], 0, optRes2$par[2]), modelFits$Time, 2, 90)

modelFits %>% 
  pivot_longer(!Time, names_to="Model", values_to="Intensity") -> modelFitsTidy

ggplot(modelFitsTidy, aes(x=Time, y=Intensity, color = Model)) + 
  geom_line() + 
  theme(legend.position = "bottom")

So interesting differences between the three different models. Model 2 has a lower slope because it can accommodate the spike at the end. When looking at the final likelihoods from the models:

Model	Out of Sample Likelihood
Constant	-55337.35
Linear	-52268.48
Delta	-51917.7
No Linear	-54500.6

So, the best fitting model (largest likelihood) is the Delta model, so that 90-minute spike is doing some work. Also shows that the linear component of the model contributes something to the model as the No Linear result has a worse likelihood.

Using the likelihood to evaluate the model is only one approach though. We could go further with BIC/AIC/DIC values but given there are only three parameters in the model it probably won’t be instructive. Instead, we should look at what the model simulates results like.

We go through each of the test set matches and simulate a match 100 times, taking the maximum number of goals scored, we then compare this to the maximum observed number of goals across the data set and see how the distributions compare.

This is similar to the posterior p-values method for model checking but in this case slightly different because we do not have a chain of parameters and just the optimised values.

maxGoals <- vapply(strengthTest, 
       function(x) max(replicate(100, length(sim_events(optRes$par, x)))),
       numeric(1))

actualMaxGoals <- max(vapply(allGoalTimes, length, numeric(1)))

ggplot(data = data.frame(MaxGoals = maxGoals), aes(x=MaxGoals)) + 
  geom_histogram(binwidth = 1) + 
  geom_vline(xintercept = actualMaxGoals) + 
  xlab("Maximum Number of Goals")

10 is the largest number of goals observed, and our model congregates around 5 as the maximum, but we did see 2 simulations with 10 goals, and another 2 more with 10+ goals. So overall, the model can generate something that resembles reality, if not infrequently. But then again, how often do we see 10-goal games?

Conclusion and Next Steps

Overall this is a nice little model that shows the probability of a team scoring appearing to increase linearly over time. We added in a delta function to account for the fact that some games go beyond 90 minutes and many goals are scored in that period. We then did some model checking by simulating using the fitted parameters and it turns out the model can generate large enough amounts of goals compared to the real data.

I’ve fitted this model by optimising the likelihood, so the next logical step would be to take a Bayesian approach and throw the model into Stan so we have a proper sample of parameters that lets us judge the uncertainty around the model a bit better. Then the next direction would be to relax the linearity of the model throw a non-parametric approach at the data and see if anything interesting turns up. I have been trying this with my dirichletprocess package, but never managed to get a satisfying result that improved the above. Plus with the large dataset, it was taking forever to run. Maybe a blog post for the future!

Stat Arb - An Easy Walkthrough

2023-07-15T00:00:00+00:00

Statistical arbitrage (stat arb) is a pillar of quantitate trading that relies on mean reversion to predict the future returns of an asset. Mean reversion believes that if a stock has risen higher it’s more likely to revert in the short term which is the opposite of a momentum strategy that believes if a stock has been rising it will continue to rise. This blog post will walk you the ‘the’ statistical arbitrage paper Statistical Arbitrage in the US Equities Market apply it to a stock/ETF pair and then look at an intraday crypto stat arb strategy.

Enjoy these types of posts? Then you should sign up for my newsletter.

I’m using Julia 1.9 and my AlpacaMarkets.jl package gets all the data we need.

using AlpacaMarkets
using DataFrames, DataFramesMeta
using Dates
using Plots
using RollingFunctions, Statistics
using GLM

To start with we simply want the daily prices of JPM, XLF, and SPY. JPM is the stock we think will go through mean reversion, XLF is the financial sector ETF and SPY is the general SPY ETF.

We this that if JPM rises higher than XLF then it will soon revert and trade lower shortly. Likewise, if JPM falls lower than XLF then we think it will soon trade higher. Our mean reversion is all about JPM around XLF. We’ve chosen XLF as it represents the general financial sector landscape, so will represent the general sector outlook more consistently than JPM on its own.

jpm = AlpacaMarkets.stock_bars("JPM", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1]
xlf = AlpacaMarkets.stock_bars("XLF", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1];
spy = AlpacaMarkets.stock_bars("SPY", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1];

We want to clean the data to format the date correctly and select the close and open columns.

function parse_date(t)
   Date(string(((split(t, "T")))[1]))
end

function clean(df, x) 
    df = @transform(df, :Date = parse_date.(:t), :Ticker = x, :NextOpen = [:o[2:end]; NaN])
   @select(df, :Date, :c, :o, :Ticker, :NextOpen)
end

Now we calculate the close-to-close log returns and format the data into a column for each asset.

jpm = clean(jpm, "JPM")
xlf = clean(xlf, "XLF")
spy = clean(spy, "SPY")
allPrices = vcat(jpm, xlf, spy)
allPrices = sort(allPrices, :Date)

allPrices = @transform(groupby(allPrices, :Ticker), 
                      :Return = [NaN; diff(log.(:c))], 
                      :ReturnO = [NaN; diff(log.(:o))],
                      :ReturnTC = [NaN; diff(log.(:NextOpen))]);

modelData = unstack(@select(allPrices, :Date, :Ticker, :Return), :Date, :Ticker, :Return)
modelData = modelData[2:end, :];

last(modelData, 4)

4 rows × 4 columns

	Date	JPM	XLF	SPY
	Date	Float64?	Float64?	Float64?
1	2023-06-30	0.0138731	0.00864001	0.0117316
2	2023-07-03	0.00799894	0.00562049	0.00114985
3	2023-07-05	-0.00661524	-0.00206703	-0.0014883
4	2023-07-06	-0.00993581	-0.00860923	-0.00786148

Looking at the actual returns we can see that all three move in sync

plot(modelData.Date, cumsum(modelData.JPM), label = "JPM")
plot!(modelData.Date, cumsum(modelData.XLF), label = "XLF")
plot!(modelData.Date, cumsum(modelData.SPY), label = "SPY", legend = :left)

The key point is that they are moving in sync with each other. Given XLF has JPM included in it, this is expected but it also presents the opportunity to trade around any dispersion between the ETF and the individual name.

The Stat Arb Modelling Process

https://math.stackexchange.com/questions/345773/how-the-ornstein-uhlenbeck-process-can-be-considered-as-the-continuous-time-anal

Let’s think simply about pairs trading. We have two securities that we want to trade if their prices change too much, so our variable of interest is

\[e = P_1 - P_2\]

and we will enter a trade if $e$ becomes large enough in both the positive and negative directions.

To translate that into a statistical problem we have two steps.

Work out the difference between the two securities
Model how the difference changes over time.

Step 1 is a simple regression of the stock vs the ETF we are trading against. Step 2 needs a bit more thought, but is still only a simple regression.

The Macro Regression - Stock vs ETF

In our data, we have the daily returns of JPM, the XLF ETF, and the SPY ETF. To work out the interdependence, it’s just a case of simple linear regression.

regModel = lm(@formula(JPM ~ XLF + SPY), modelData)

JPM ~ 1 + XLF + SPY

Coefficients:
──────────────────────────────────────────────────────────────────────────────────
                    Coef.   Std. Error       t  Pr(>|t|)   Lower 95%     Upper 95%
──────────────────────────────────────────────────────────────────────────────────
(Intercept)   0.000188758  0.000162973    1.16    0.2469  -0.0001309   0.000508417
XLF           1.35986      0.0203485     66.83    <1e-99   1.31995     1.39977
SPY          -0.363187     0.0260825    -13.92    <1e-41  -0.414345   -0.312028
──────────────────────────────────────────────────────────────────────────────────

From the slope of the model, we can see that JPM = 1.36XLF - 0.36SPY, so JPM has a $\beta$ of 1.36 to the XLF index and a $\beta$ of -0.36 to the SPY ETF, or general market. So each day, we can approximate JPMs return by multiplying the XLF returns and SPY returns.

This is our economic factor model, which describes from a ‘big picture’ kind of way how the stock trades vs the general market (SPY) and its sector-specific market (XLF).

What we need to do next is look at what this model doesn’t explain and try and describe that.

The Reversion Regression

Any difference around this model can be explained by the summation of the residuals over time. In the paper the sum of the residuals over time is called the ‘auxiliary process’ and this is the data behind the second regression.

plot(scatter(modelData.Date, residuals(regModel), label = "Residuals"),
       plot(modelData.Date,cumsum(residuals(regModel)),
       label = "Aux Process"),
	  layout = (2,1))

We believe the auxiliary process (cumulative sum of the residuals) can be modeled using a Ornstein-Uhlenbeck (OU) process.

An OU process is a type of differential equation that displays mean reversion behaviour. If the process falls away from its average level then it will be forced back.

\[dX = \kappa (m - X(t))dt + \sigma \mathrm{d} W\]

$\kappa$ represents how quickly the mean reversion occurs.

To fit this type of process we need to recognise that the above differential form of an OU process can be discretised to become a simple AR(1) model where the model parameters can be transformed to get the OU parameters.

We now fit the OU process onto the cumulative sum of the residuals from the first model. If the residuals have some sort of structure/pattern then this means our original model was missing some variable that explains the difference.

X = cumsum(residuals(regModel))
xDF = DataFrame(y=X[2:end], x = X[1:end-1])
arModel = lm(@formula(y~x), xDF)

y ~ 1 + x

Coefficients:
─────────────────────────────────────────────────────────────────────────────────
                  Coef.   Std. Error       t  Pr(>|t|)     Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept)  4.41618e-6  0.000162655    0.03    0.9783  -0.000314618  0.000323451
x            0.997147    0.00186733   534.00    <1e-99   0.993484     1.00081
─────────────────────────────────────────────────────────────────────────────────

We take these coefficients and transform them into the parameters from the paper.

varEta = var(residuals(arModel))
a, b = coef(arModel)
k = -log(b)*252
m = a/(1-b)
sigma = sqrt((varEta * 2 * k) / (1-b^2))
sigma_eq = sqrt(varEta / (1-b^2))
[m, sigma_eq]

2-element Vector{Float64}:
 0.0015477568390823153
 0.08709971423424319

So $m$ gives us the average level and $\sigma_{\text{eq}}$ the appropriate scale.

Now to build the mean reversion signal. We still have $X$ as our auxiliary process which we believe is mean reverting. We now have the estimated parameters on the scale of this mean reversion so we can transform the auxiliary process by these parameters and use this to see when the process is higher or lower than the model suggests it should be.

modelData.Score = (X .- m)./sigma_eq;

plot(modelData.Date, modelData.Score, label = "s")
hline!([-1.25], label = "Long JPM, Short XLF", color = "red")
hline!([-0.5], label = "Close Long Position", color = "red", ls=:dash)

hline!([1.25], label = "Short JPM, Long XLF", color = "purple")
hline!([0.75], label = "Close Short Position", color = "purple", ls = :dash, legend=:topleft)

The red lines indicate when JPM has diverged from XLF on the negative side, i.e. we expect JPM to move higher and XLF to move lower. We enter the position if s < -1.25 (solid red line) and exit the position when s > -0.5 (dashed red line).

Buy to open if $s < -s_{bo}$ (< -1.25) Buy 1 JPM, sell Beta XLF
Close long if $s > -s_{c}$ (-0.5)

The purple line is the same but in the opposite direction.

Sell to open if $s > s_{so}$ (>1.25) Sell 1 JPM, buy Beta XLF
Close short if $s < s_{bc}$ (<0.75)

That’s the modeling part done. We model how the stock moves based on the overall market and then any differences to this we use the OU process to come up with the mean reversion parameters.

So, does it make money?

Backtesting the Stat Arb Strategy

To backtest this type of model we have to roll through time and calculate both regressions to construct the signal.

A couple of new additions too

We shift and scale the returns when doing the macro regression.
The auxiliary process on the last day is always 0, which makes calculating the signal simple.

paramsRes = Array{DataFrame}(undef, length(90:(nrow(modelData) - 90)))

for (j, i) in enumerate(90:(nrow(modelData) - 90))
    modelDataSub = modelData[i:(i+90), :]
    modelDataSub.JPM = (modelDataSub.JPM .- mean(modelDataSub.JPM)) ./ std(modelDataSub.JPM)
    modelDataSub.XLF = (modelDataSub.XLF .- mean(modelDataSub.XLF)) ./ std(modelDataSub.XLF)
    modelDataSub.SPY = (modelDataSub.SPY .- mean(modelDataSub.SPY)) ./ std(modelDataSub.SPY)
    
    macroRegr = lm(@formula(JPM ~ XLF + SPY), modelDataSub)
    auxData = cumsum(residuals(macroRegr))
    ouRegr = lm(@formula(y~x), DataFrame(x=auxData[1:end-1], y=auxData[2:end]))
    
    varEta = var(residuals(ouRegr))
    a, b = coef(ouRegr)
    k = -log(b)*252
    m = a/(1-b)
    sigma = sqrt((varEta * 2 * k) / (1-b^2))
    sigma_eq = sqrt(varEta / (1-b^2))
    
    
    paramsRes[j] = DataFrame(Date= modelDataSub.Date[end], 
                             MacroBeta_XLF = coef(macroRegr)[2], MacroBeta_SPY = coef(macroRegr)[3], MacroAlpha = coef(macroRegr)[1],
                             VarEta = varEta, OUA = a, OUB = b, OUK = k, Sigma = sigma, SigmaEQ=sigma_eq,
                             Score = -m/sigma_eq)
    
end

paramsRes = vcat(paramsRes...)
last(paramsRes, 4)

4 rows × 11 columns (omitted printing of 4 columns)

	Date	MacroBeta_XLF	MacroBeta_SPY	MacroAlpha	VarEta	OUA	OUB
	Date	Float64	Float64	Float64	Float64	Float64	Float64
1	2023-06-30	0.974615	-0.230273	1.10933e-17	0.331745	0.175358	0.830417
2	2023-07-03	0.96943	-0.228741	-5.73883e-17	0.331222	0.198176	0.826816
3	2023-07-05	0.971319	-0.230438	2.38846e-17	0.335844	0.242754	0.841018
4	2023-07-06	0.974721	-0.232765	5.09875e-17	0.331695	0.256579	0.823822

The benefit of doing it this way also means we can see how each $\beta$ in the macro regression evolves.

plot(paramsRes.Date, paramsRes.MacroBeta_XLF, label = "XLF Beta")
plot!(paramsRes.Date, paramsRes.MacroBeta_SPY, label = "SPY Beta")

Good to see they are consistent in their signs and generally don’t vary a great deal.

In the OU process, we are also interested in the speed of the mean reversion as we don’t want to take a position that is very slow to revert to the mean level.

kplot = plot(paramsRes.Date, paramsRes.OUK, label = :none)
kplot = hline!([252/45], label = "K Threshold")

In the paper, they suggest making sure the reversion happens with half of the estimation period. As we are using 90 days, that means the horizontal line shows when $k$ is above this value.

Plotting the score function also shows how the model wants to go long/short the different components over time.

splot = plot(paramsRes.Date, paramsRes.Score, label = "Score")
hline!([-1.25], label = "Long JPM, Short XLF", color = "red")
hline!([-0.5], label = "Close Long Position", color = "red", ls=:dash)

hline!([1.25], label = "Short JPM, Long XLF", color = "purple")
hline!([0.75], label = "Close Short Position", color = "purple", ls = :dash)

We run through the allocation procedure and label whether we are long (+1) or short (-$\beta$) an amount of either the stock or ETFs.

paramsRes.JPM_Pos .= 0.0
paramsRes.XLF_Pos .= 0.0
paramsRes.SPY_Pos .= 0.0

for i in 2:nrow(paramsRes)
    
    if paramsRes.OUK[i] > 252/45
    
        if paramsRes.Score[i] >= 1.25
            paramsRes.JPM_Pos[i] = -1
            paramsRes.XLF_Pos[i] = paramsRes.MacroBeta_XLF[i]
            paramsRes.SPY_Pos[i] = paramsRes.MacroBeta_SPY[i]
        elseif paramsRes.Score[i] >= 0.75 && paramsRes.JPM_Pos[i-1] == -1
            paramsRes.JPM_Pos[i] = -1
            paramsRes.XLF_Pos[i] = paramsRes.MacroBeta_XLF[i]    
            paramsRes.SPY_Pos[i] = paramsRes.MacroBeta_SPY[i]
        end

        if paramsRes.Score[i] <= -1.25
            paramsRes.JPM_Pos[i] = 1
            paramsRes.XLF_Pos[i] = -paramsRes.MacroBeta_XLF[i]   
            paramsRes.SPY_Pos[i] = -paramsRes.MacroBeta_SPY[i]
        elseif paramsRes.Score[i] <= -0.5 && paramsRes.JPM_Pos[i-1] == 1
            paramsRes.JPM_Pos[i] = 1
            paramsRes.XLF_Pos[i] = -paramsRes.MacroBeta_XLF[i] 
            paramsRes.SPY_Pos[i] = -paramsRes.MacroBeta_SPY[i]
        end
    end
        
end

To make sure we use the right price return we lead the return columns by one so that we enter the position and get the next return.

modelData = @transform(modelData, :NextJPM= lead(:JPM, 1), 
                                   :NextXLF = lead(:XLF, 1),
                                   :NextSPY = lead(:SPY, 1))

paramsRes = leftjoin(paramsRes, modelData[:, [:Date, :NextJPM, :NextXLF, :NextSPY]], on=:Date)

portRes = @combine(groupby(paramsRes, :Date), :Return = :NextJPM .* :JPM_Pos .+ :NextXLF .* :XLF_Pos .+ :NextSPY .* :SPY_Pos);

plot(portRes.Date, cumsum(portRes.Return), label = "Stat Arb Return")

Sad trombone noise. This is not a great result as we’ve ended up negative over the period. However, given the paper is 15 years old it would be very rare to still be able to make money this way after everyone knows how to do it. Plus, I’ve only used one stock vs the ETF portfolio, you typically want to diversify out and use all the stocks in the ETF to be long and short multiple single names and use the ETF as a minimal hedge,

The good thing about it being a negative result means that we don’t have to start considering transaction costs or other annoying things like that.

When we break out the components of the strategy we can see that it appears to pick out the right times to short/long JPM and SPY, its the hedging with the XLF ETF that is bringing the portfolio down.

plot(paramsRes.Date, cumsum(paramsRes.NextJPM .* paramsRes.JPM_Pos), label = "JPM Component")
plot!(paramsRes.Date, cumsum(paramsRes.NextXLF .* paramsRes.XLF_Pos), label = "XLF Component")
plot!(paramsRes.Date, cumsum(paramsRes.NextSPY .* paramsRes.SPY_Pos), label = "SPY Component")
plot!(portRes.Date, cumsum(portRes.Return), label = "Stat Arb Portfolio")

So whilst naively trying to trade the stat arb portfolio is probably a loss maker, there might be some value in using the model as a signal input or overlay to another strategy.

What about if we up the frequency and look at intraday stat arb?

Intraday Stat Arb in Crypto - ETH and BTC

Crypto markets are open 24 hours a day 7 days a week and so gives that much more opportunity to build out a continuous trading model. We look back since the last year and repeat the backtesting process to see if this bares any fruit.

Once again AlpacaMarkets gives us an easy way to pull the hourly bar data for both ETH and BTC.

btcRaw = AlpacaMarkets.crypto_bars("BTC/USD", "1Hour"; startTime = now() - Year(1), limit = 10000)[1]
ethRaw = AlpacaMarkets.crypto_bars("ETH/USD", "1Hour"; startTime = now() - Year(1), limit = 10000)[1];

btc = @transform(btcRaw, :ts = DateTime.(chop.(:t)), :Ticker = "BTC")
eth = @transform(ethRaw, :ts = DateTime.(chop.(:t)), :Ticker = "ETH")

btc = btc[:, [:ts, :Ticker, :c]]
eth = eth[:, [:ts, :Ticker, :c]]

allPrices = vcat(btc, eth)
allPrices = sort(allPrices, :ts)

allPrices = @transform(groupby(allPrices, :Ticker), 
                      :Return = [NaN; diff(log.(:c))]);

modelData = unstack(@select(allPrices, :ts, :Ticker, :Return), :ts, :Ticker, :Return);
modelData = @subset(modelData, .! isnan.(:ETH .+ :BTC))

Plotting out the returns we can see they are loosely related just like the stock example.

plot(modelData.ts, cumsum(modelData.BTC), label = "BTC")
plot!(modelData.ts, cumsum(modelData.ETH), label = "ETH")

We will be using BTC as the ‘index’ and see how ETH is related.

regModel = lm(@formula(ETH ~ BTC), modelData)

ETH ~ 1 + BTC

Coefficients:
─────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error       t  Pr(>|t|)    Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept)  7.72396e-6  3.64797e-5    0.21    0.8323  -6.37847e-5  7.92327e-5
BTC          1.115       0.00673766  165.49    <1e-99   1.10179     1.12821
─────────────────────────────────────────────────────────────────────────────

Fairly high beta for ETH and against BTC. We use a 90-hour rolling window now instead of a 90 day.

window = 90

paramsRes = Array{DataFrame}(undef, length(window:(nrow(modelData) - window)))

for (j, i) in enumerate(window:(nrow(modelData) - window))
    modelDataSub = modelData[i:(i+window), :]
    modelDataSub.ETH = (modelDataSub.ETH .- mean(modelDataSub.ETH)) ./ std(modelDataSub.ETH)
    modelDataSub.BTC = (modelDataSub.BTC .- mean(modelDataSub.BTC)) ./ std(modelDataSub.BTC)
    
    macroRegr = lm(@formula(ETH ~ BTC), modelDataSub)
    auxData = cumsum(residuals(macroRegr))
    ouRegr = lm(@formula(y~x), DataFrame(x=auxData[1:end-1], y=auxData[2:end]))
    varEta = var(residuals(ouRegr))
    a, b = coef(ouRegr)
    k = -log(b)/((1/24)/252)
    m = a/(1-b)
    sigma = sqrt((varEta * 2 * k) / (1-b^2))
    sigma_eq = sqrt(varEta / (1-b^2))
    
    
    paramsRes[j] = DataFrame(ts= modelDataSub.ts[end], MacroBeta = coef(macroRegr)[2], MacroAlpha = coef(macroRegr)[1],
                             VarEta = varEta, OUA = a, OUB = b, OUK = k, Sigma = sigma, SigmaEQ=sigma_eq,
                             Score = -m/sigma_eq)
    
end

paramsRes = vcat(paramsRes...)

Again, looking at $\beta$ overtime we see there has been a sudden shift

plot(plot(paramsRes.ts, paramsRes.MacroBeta, label = "Macro Beta", legend = :left), 
     plot(paramsRes.ts, paramsRes.OUK, label = "K"), layout = (2,1))

Interesting that there has been a big change in $\beta$ between ETH and BTC recently that has suddenly reverted. Ok, onto the backtesting again.

paramsRes.ETH_Pos .= 0.0
paramsRes.BTC_Pos .= 0.0

for i in 2:nrow(paramsRes)
    
    if paramsRes.OUK[i] > (252/(1/24)/45)
    
        if paramsRes.Score[i] >= 1.25
            paramsRes.ETH_Pos[i] = -1
            paramsRes.BTC_Pos[i] = paramsRes.MacroBeta[i]   
        elseif paramsRes.Score[i] >= 0.75 && paramsRes.ETH_Pos[i-1] == -1
            paramsRes.ETH_Pos[i] = -1
            paramsRes.BTC_Pos[i] = paramsRes.MacroBeta[i]     
        end

        if paramsRes.Score[i] <= -1.25
            paramsRes.ETH_Pos[i] = 1
            paramsRes.BTC_Pos[i] = -paramsRes.MacroBeta[i]   
        elseif paramsRes.Score[i] <= -0.5 && paramsRes.ETH_Pos[i-1] == 1
            paramsRes.ETH_Pos[i] = 1
            paramsRes.BTC_Pos[i] = -paramsRes.MacroBeta[i]     
        end
    end
        
end


modelData = @transform(modelData, :NextETH= lead(:ETH, 1), :NextBTC = lead(:BTC, 1))

paramsRes = leftjoin(paramsRes, modelData[:, [:ts, :NextETH, :NextBTC]], on=:ts)

portRes = @combine(groupby(paramsRes, :ts), :Return = :NextETH .* :ETH_Pos .+ :NextBTC .* :BTC_Pos);

plot(portRes.ts, cumsum(portRes.Return))

This looks slightly better. At least it is positive at the end of the testing period.

plot(paramsRes.ts, cumsum(paramsRes.NextETH .* paramsRes.ETH_Pos), label = "ETH Component")
plot!(paramsRes.ts, cumsum(paramsRes.NextBTC .* paramsRes.BTC_Pos), label = "BTC Component")
plot!(portRes.ts, cumsum(portRes.Return), label = "Stat Arb Portfolio", legend=:topleft)

Again, the components of the portfolio seem to be ok in the ETH case but generally, this is from the overall long bias. Unlike the JPM/XLF example, there isn’t much more diversification we can add anything that might help. We could add in more crypto assets, or an equity/gold angle, but it becomes more of an asset class arb than something truly statistical.

Conclusion

The original paper is one of those that all quants get recommended to read and statistical arbitrage is a concept that you probably understand in theory but practically doing is another question. Hopefully, this blog post gets you up to speed with the basic concepts and how to implement them. It can be boiled down to two steps.

Model as much as you can with a simple regression
Model what’s left over as an OU process.

It can work with both high-frequency and low-frequency data, so have a look at different combinations or assets and see if you have more luck then I did backtesting.

If you do end up seeing something positive, make sure you are backtesting properly!