A Seq2Seq model for ETF forecasting

Owing to the misguided belief that I can achieve the impossible, I decided to build a model with the goal of beating the stock market.

Strap in, we’re about to get rich.

Machine learning is increasingly being employed by hedge funds to help mitigate risk and identify patterns and opportunities, whether this is for optimisation of algo trading strategies, fraud detection, high-frequency trading, or sentiment analysis. Arguably the most obvious, difficult, and naïve application of fintech ML is direct stock market forecasting – sounds like the perfect place to start.

Target

First things first, we need to decide on a stock to forecast. Volatility provides opportunities, but predictable volatility is even better. We need a security that swings in response to actual, reported events, and one whose trends roughly move somehow with other stocks – our hypothesis being that wider events in the market can be used to forecast a single security. SPDR GLD seems like a reasonable option – gold is such a popular hedge against global instability it’s price usually moves in the opposite direction to stocks such as DJIA or SP500 and moves with global disaster.

Gold price (/oz) in Pounds from 1980-2024

Although the fintech sector generates and consumes huge amounts of data, the amount of stock market data that the average Joe has access to is pretty limited unless one is willing to pay for fine grain trading histories, but even these seldom span back further that 20 years. However, as SPDR GLD (launched in 2004) is a physically backed security, it closely mirrors actual gold price. Therefore, rather than forecast SPDR GLD, we can use a scaled 2-day moving average cost of a Troy Ounce of Gold as our target to maximise the time frame we train on, with a little smoothing, and hope this is a close enough representation of the stock.

Data

We need a list of stocks and indexes that either drive or move with or against the price of gold. After consulting the World Gold Council and various other sources, the following initially were chosen: precious metals prices (gold, silver, platinum, palladium), SP500, DJIA, BCOM, MSCIWorld, GLD, NBES, EGO, EURO/USD exchange index, Brent crude oil, CLG4, 10 year US treasury yield, US dollar index, VanEck Gold Miners Futures, emerging markets US dollar index, Gold volatility index, Caldara and Iacoviello Geopolitical risk index.

Publicly available data is limited, but we do have access to historical, daily open, close, low, and high prices for stocks, daily precious metal prices, and daily index values. We must just be mindful that the indexes’ data became available at different points in history:

Stock or Index vs Date data recording began

Feature engineering

While we could just throw in all our raw data and hope for the best, this is a tough problem to learn, and helping our model identify trends and momentum won’t hurt. There is a plethora of technical indicators that algo traders use to identify potential buy/sell positions, two of the most popular being MACD and Bollinger Bands.

MACD helps gauge momentum and potential price directions and consists of a MACD line usually calculated by subtracting the 26-period Exponential Moving Average (EMA) from the 12-period EMA, as well as a 9-period EMA of the MACD line. The difference between these lines gives a histogram which we can use as a feature to help identify potential reversal in momentum.

SPDR GLD closing price(blue) and its MACD (red) and signal (green) lines during 2022. MACD-signal line crossover can be used to identify potential buy/sell positions.

Bollinger bands are used to gauge volatility, and usually consists of 3 bands; a 20-day simple moving average, a 2 standard deviation multiplier added to the 20-day SMA, and a 2 standard deviation multiplier below the 20-day SMA. The widths of the band give an indication of volatility, while the distance between the price and the lower/upper band gives an indication of undersold/overbought. We can use both widths and relative distance to the upper and lower bands as features.

SP500 closing price (blue) and its Bollinger band lines (green and red) from Sept 2022 – December 2022

Data Structure

Financial markets are networks of interdependent assets, and our data is temporally sequential. We need a logical data structure that can capture dynamic relationships between features, allow the addition of more data when it comes available, allow the incorporation of diverse data sources should we wish to use alternative data, and one that I can comfortably and quickly work with – graphs!

How we structure the graph and decide connectivity probably warrants some thought –but initially I say we go with the lazy option and use a fully connected, undirected graph to start with. We can use a node to hold precious metals prices, a node for each closing stock price and associated MACD histogram and Bollinger bands, and a node for any risk indicators. We pad nodes for a consistent feature array length.

Data loading

We’re dealing with a sequence of graphs, and maintaining that sequence is crucial. Furthermore, each forecast should be made off several graphs to capture recent events from the past n days.

We can use a sliding window of say 10 days (=10 graphs), which moves one timestep forwards per forecast.  This level of control warrants use of the `torch_geometric.data.dataset class`, which allows us to load in and parse one graph at a time in sequence and append to an array. Although each array effectively gets minibatched by the torch_geometric DataLoader, I struggled to minibatch graphs of different sizes. However, a small modification to the torch_geometric.loader.Collater function fixes this:

elif isinstance(elem, Sequence) and not isinstance(elem, str):

            return [Batch.from_data_list(sequence) for sequence in batch]

Of course, we also need a validation set – the final 365 days in the sequence should do.

Architecture

We want a model that will not only learn high level-representations of the data but can also capture long-term dependencies; essentially, we need a feature extractor and a memory gate. We’ll use a custom graph attention network (GAT) with 3 sequentially connected node attention layers transforming node features from a dimension of 5 to a final output dimension of 128, followed by a decoder with 2 LSTM layers and a fully connected layer returning a tensor of length 5 to forecast 5 days ahead.

Training

Our training loop is relatively simple. We iterate through each batch in the epoch and through each sliding window in the batch. Each window is fed through the encoder in sequence, stepping one day at a time. Our encoded sequences should be padded and packed so they can be processed as a batch by the LSTM decoder, and then subsequently unpacked and indexed to retrieve the final prediction of each sliding window, since we are forecasting 5 days ahead of the window, not 5 days ahead of each step in the window. LSTM hidden layers are detached, and loss is calculated against the target tensors using Mean Squared Error Loss.

Next steps

The road to growing money on trees is long and winding, but we’ve already put in some mileage. Here we have constructed our sandbox environment, collated and prepared our data, built an encoder-decoder model, written our training loop, and learnt much about a sector I had no prior knowledge of. Next steps will involve training and tuning the model, address some limitations of the approach and data, and set up rapid-reporting visualisations of our forecast so we can speed up the development process.

Author