Machine Learning for Volatility Trading

15 min readJul 4, 2024

This is the 6th story our “Journey to Vol Trading” series that chronicles the evolution of our trading. Keep reading for updates.

We have previously implemented a simple algorithmic strategy for Long/Short Volatility trading, and identified Volatility ETFs such as VXX and SVIX as the most efficient products to implement for those trades. We now take the last step in our journey, where we build a more precise and effective trading system using Machine Learning.

Machine Learning in Finance

It is helpful to first define what ML is and is not. Given the huge amount of buzz recently around AI, we believe it is important to distinguish between the two. Machine Learning is taking large sets of data, and creating models that make accurate predictions on new data. AI on the other hand, are much more complex systems which take actions based on extensive unstructured training information. In Finance, almost every problem is better suited to machine learning, whereas AI work on tasks such as teaching a robot how to walk. Examples of ML in finance are predicting fraud on credit card charges, predicting whether a loan will be paid back, or, as in our case, predicting the direction of future prices.

ML System Structures

The goal of any ML system is to take a large set of data and find patterns that are not readily apparent to a human, and thus cannot be modeled by simpler methods like algorithmic formulas. To do so, the basic structure of any ML system is Input Data, Learning Mechanism and Output (Prediction) Data. Many different ML structures have been proposed, with the following being the most widely used today

Support Vector Machines (SVM): These systems take data, and using mathematical geometric formulas (lines, quadratic or polynomial curves etc) separate it in many dimensions. This works well in theory, but in practice SVMs consume way too much memory to be effective when you have large datasets, because they were never optimized for parallel processing with GPUs, because other methods became more popular. Given that our data for VIX Futures reached back to 2012 (and VXX data reaches back even further to 2009), and we are evaluating data points at 20 second intervals, we are working with millions of rows of data, which immediately makes SVMs unfeasible.
Neural Networks (NNs): A neural network’s Learning Mechanism at its most basic is a neuron, which outputs a value between 0 and 1. Neurons are grouped in layers, and NNs usually have multiple layers. NNs can be challenging to understand because when they are being trained, each neuron learns a portion of the “pattern” but it is effectively a black box model, and there are no easy ways to understand what each neuron has learned. Layers in a NN can be a bit more intuitive, as they tend to learn more global types of information. For example, in a NN that processes images one layer may create a black and white “stencil” of an image, while another takes that stencil and detects specific features in it to distinguish between a person and an animal. In tasks such as prediction of financial markets, the effect of each neuron or layer become even obscure. We initially tested various types of NNs, including LSTMs, without achieving significant predictive accuracy. Our view on this is that NNs work well in recognizing complex patterns that are hard to model mathematically, but in a price prediction task will underperform other models that are more efficient. The black-box nature of NNs also creates problems because one cannot easily map what the NN is predicting in various scenarios to actual trading knowledge.
Decision Trees/Forest/Boosters: Decision trees are similar to SVMs as they split the data by decision boundary, but in this case by multiple binary choices. In fact, our original algorithmic trading system was effectively a simple decision tree (short volatility if ratio is below 0.96, long if it’s above 1.04, and various other binary decisions to determine the correct trade action). Various improvements have been implemented to make them able to solve a variety of more complex tasks. One of the most popular implementations of decision trees is called XGBoost. Essentially, it is a combination of hundreds of decision trees, each of which checks different combinations of information from the data points (features), to make an overall prediction. The details of XGBoost are beyond the scope here, but you can learn more about them. XGBoost tends to work very well in classification tasks, and less well in regression (predicting a number) tasks. It is also very sensitive to feature engineering, which means that you have to have some idea of what data is important in your problem, and prepare it so that it can be easily interpreted by your model. Throughout our experimentation, it became immediately clear that XGBoost would be the optimal type of model to use for our volatility trading.

The difference between decision boundaries for SVM (a single equation), NN (a complex pattern, which cannot be represented mathematically) and XGBoost (several binary decisions) can be seen here:

Decision boundary for an SVM classifier. Usually a single decision boundary with an equation is not enough to accurately separate data that has any complexity

Neural Networks can understand very complex patterns, and create decision boundaries based on a large number of neurons and layers that are not possible to otherwise describe mathematically.

XGBoost is a series of decisions, which are effectively binary sub-classifiers. Very complex data patterns are hard to model, but XGBoost excels at explainability (each decision can be explicitly mapped and visualized) where domain knowledge can directly translate into an effective feature engineering strategy which results in a model that can generalize effectively

Training an XGBoost model to predict volatility

Now that we have identified XGBoost as the preferred structure, we need to first make a decision of how we want the model to make its predictions. In Machine Learning, there are really two predictive models: regression and classification.

Regression vs. Classification in Trading

Most of the work in finance has been focused on regression. That is intuitive because we want to predict a future price of a stock, future or other financial instrument. It would seem reasonable that if VXX is trading at $15 today, we would want to know whether it will next trade at $15.5 or $18. However, this approach actually misses the mark. First of all, regression tasks are not good at predicting prices far in advance. If we are trading intraday, on data that is 20 seconds intervals, it would be exceedingly hard to train a model to give us an estimate of future prices even a day ahead, much less several days or week. The best a regression model can do for price prediction is predict prices at the next interval (i.e in the next 20 seconds) or a few intervals ahead. But is this actually useful? Prices almost always follow trends, so that VXX ETF trading at $15 right now, is likely to trade very close to $15 in the next few 20 secon intervals. In short, the actual predicted price becomes meaningless. Even if we set the prediction task to predict a price at 1000 intervals from now (which would be about 2 weeks), the model has a hard time actually making a prediction, because it cannot really “connect the dots” between the current price and one so far out. This actually applies to all Machine Learning structures we used, but was very apparent in XGBoost (which is generally weaker at regression tasks anyway).

In our experience, what really matters for intraday trading is whether the price in the next intervals will be higher or lower. This is a simple classification task that XGBoost excels at. Now the question becomes what is a meaningful interval. In our case, looking at 1000 x 20seconds seemed to work, but that’s just an arbitrary number. It is equivalent to about 2 trading weeks and provides a meaningful answer to our key question: “What direction is volatility going in the near term?”

Feature Engineering for a Volatility Predictor

Given our previous experience in volatility trading, we have some intuition of what drives the prices of VIX Futures:

The S&P 500 index itself, and specifically its movements, such as slow upward trends or big moves down
The VIX Index, and its mean reversion (i.e whether it is close to its historical average of 10–15, or whether it has risen from a spike)
The Weighted Ratio, discussed in our initial algo trading system, which models the contango or backwardation of the VIX Futures compared to the VIX index itself.
The price of the Volatility Futures ETFs themselves (in our case, using VXX since it has the longest historical data) seem like a good idea as well, since ultimately we are trying to predict their movement and trade it.

Now that we have our initial set of data, it’s useful to think about how the model can interpret it. We can look at two sub-types of data:

The actual absolute price, which can be represented by moving averages of various duration so that it’s not overly sensitive to small movements
The difference between various sets of moving averages, which provides a good indication of how quickly the price is moving currently

Let’s look at each type of data and its sub-type:

S&P 500: It is immediately clear that the absolute price of the index is not relevant. This is because it has historically risen over time, so the model will not learn anything by comparing the current price to a price 10 years ago. On the other hand, the difference (delta) between moving averages of various duration, expressed as %, intuitively can be very helpful. For example, we can assume that if the index is falling 5% (delta between the short term moving average and a longer term one), that we would expect volatility to rise. This is true whether the index is at $5000 today, or whether it is at $10,000 at some point in the future.
VIX: Unlike the S&P500 the VIX is a volatility measure that is mean reverting, and tends to stay around a certain level when the market is calm, and then spiking high during periods of increased volatility. Therefore, the absolute price of the VIX and its moving averages is useful data. In addition, deltas between the MAs is also helpful. In this case we choose to use the absolutely delta, instead of %. This is because there is a noticeable difference between a change in the VIX from 10 to 15 (a relatively normal, often insignificant short term spike) vs 20 to 30 (which is the same percentage, but larger in absolute price terms and more aligned with longer terms volatility)
Weighted Ratio: This ratio works similarly to the VIX, with a longer term trends of hovering around a specific level, with noticeable spikes indicating long volatility. Similarly, it’s already a ratio, and therefore the deltas are expressed in absolute terms, rather than percentages for the same reason as the VIX.
VXX: It is obvious that the long term trends of VXX is down, the opposite of the S&P 500 index. Therefore the same logic applies, where we ignore the absolute price and its moving averages, because current values of VXX do not give us any ususeful information. Like the index, we instead focus on the percentage delta between sets of moving averages, as that indicates how quicly VXX is moving in a certain direction. For example, a large positive delta between a short term MA and a long term indicates that the current VXX price is spiking, which correlates with a long volatility trade (by buying VXX).

The specific moving averages and deltas is something the requires fine tuning and experimentation. It’s not our place to give you those details, so as to encourage your own journey and system. In general, we recommend to look at moving averages that represent short term prices (at the scale of less than an hour) as well as much longer term ones that give the moving average price over nearly a year. For some suggestion as to how these moving averages impact prices, and thus how to design them, we recommend the excellent work of Chris Ciovacco, a licensed Investment Advisor who uses similar concepts in his own trading and provides weekly market updates where he gives excellent illustrations. Our own work was heavily influenced by Chris, and our journey would not have happened without his inspiration and insight.

Loss Function

With any ML model, the importance of the loss function cannot be overstated, because is directly related to how the model learns to make the best predictions. In binary classifiers, the two most common loss metrics are:

Accuracy: This simply measures the % of times that the model predicted the right answer.
AUC: Area Under Curve, which measures the trade-off between the true positive rate and the false positive rate. It is considered the best practice as it can be better for models with imbalances datasets (which is the case with long/short vol trading since we already know the majority of the time, close to 80%+ we want to be short vol)
Custom Loss Function: Any formula that can be mathematically modeled can serve as a loss function. Ultimately, in trading the actual ROI is the most important result we want to optimize. For example, in the case of vol trading, it’s possible that during large periods of short vol, there would have been small intermittent opportunities for a quick long vol trade. If the model is only using Accuracy or even AUC, it would be penalized for not making that trade. If we instead focus on ROI, however, those trades are not significantly accretive to performance, therefore if we miss them, the model is still quite good.

By now it should be clear that for a model that makes predictions for trading, the loss function must be the direct, real world application of the model in a backtested trading scenario making actual trades. However, there is still a possible difference between two models that produce the same ROI.

While both models achieve the same ROI, we would prefer Model B since it is more consistent, and thus less dependent on when we start using it. If we look at the “Area Under Equity Curve”, we can differentiate mathematically between the two and teach our ML system that Model B has lower loss.

We represent the advantage that Model B has above by creating our own “Area Under Equity Curve” function, referred to as “AUEC” (not to be confused with the default AUC loss function which looks at prediction accuracy, and not the equity curves of the model’s hypothetical trading). This can be calculated using Simpson’s rule.

With our loss function finalized, we can confidently teach XGBoost to prioritize what we want: the highest ROI achieved in the smoothest way possible. In finance, this correlates directly to a high Sortino Ratio (best ROI with least volatility to the downside).

Regularization and Overfitting

The challenge with XGBoosts is that it achieves very high accuracy on the datas on which it is trained. This is because ultimately it can figure out the best decision tree combination to properly classify almost everything it has seen. Financial markets have patterns, but it would be naive to assume they repeat in exactly the same way. Best practices of course dictate that we simply split the data we have into train and test, often along the recommended 80/20 ratio.

In our particular case, this actually poses a huge problem in overfitting. This is because the standard train/test split simply splits the data points randmonly. This creates an instance where two data points that are very close together can be separated into train and test. However, because we are using 20 second data, each data point’s values will be very close to its neighbors. So if one is in the train data, its neighbor in the test data is so close that it is meaningless because the model has effectively seen it already.

The solution to this is to create train / test splits not by random sampling, but by taking large blocks of time. In course for examle, we look at the period of 2009 to 2022 as our train data, and then use 2023 to 2024 as the train data. The data points in the test data will never be too close to test, which means that if the model performs well on the test data, it has learned decision patterns which are general enough to apply to future market trends.

Regularization: When we use the above train / test split, it becomes obvious that the default XGBoost regularization settings fail miserably. Our initial tests show returns of 1000%+ on train data, while test ROIs underperform our original ratio-based algorithmic system. To cure this overfitting, we use various hyperparameters to defeat it.

Frequently tuned hyperparameters

n_estimators: specifies the number of decision trees to be boosted. If n_estimator = 1, it means only 1 tree is generated, thus no boosting is at work. The default value is 100, but you can play with this number for optimal performance.
subsample: it represents the subsample ratio of the training sample. A subsample = 0.5 means that 50% of training data is used prior to growing a tree. The value can be any fraction but the default value is 1.
max_depth: it limits how deep each tree can grow. The default value is 6 but you can try other values if overfitting is an issue in your model.
learning_rate (alias: eta): it is a regularization parameter that shrinks feature weights in each boosting step. The default value is 0.3 but people generally tune with values such as 0.01, 0.1, 0.2 etc.
gamma (alias: min_split_loss): it’s another regularization parameter for tree pruning. It specifies the minimum loss reduction required to grow a tree. The default value is set at 0.
reg_alpha (alias: alpha): it is the L1 regularization parameter, increasing its value makes the model more conservative. Default is 0.
reg_lambda (alias: lambda): L2 regularization parameter, increasing its value also makes the model conservative. Default is 1.

You can learn more about each from the excellent guide below:

Source: https://towardsdatascience.com/a-guide-to-xgboost-hyperparameters-87980c7f44a9

Manually tuning each hyperparameter is time consuming. Fortunately, a very effective package called Hyperopt exists that allows you to automate the tuning process, continuously training XGBoost models with various combinations of hyperparameters in defined ranges to find the best (defined as the hyperparameters that provide the best ROI on the unseen, test data).

Smoothing Predictions

When training a model and using its predictions to make hypothetical trades, we invariably find that there are periods where the model flip flops between predictions. That is true even for a good model with high accuracy, but in practice is not a good trading system due to slippage and the fact that trends normally do not reverse back and forth, so such entry/exit trades tend to lose small amounts of money every time and drag down returns (compared to just staying in the longer term trend and taking small, unrealized losses intermittently until the trend resumes).

To this end, we find that smoothing the predictions themselves with a moving average works very well. We then set thresholds for when we place a Long Vol (Buy VXX), Flat (exit all positions), or Short Vol (Buy SVIX) trade. Again the specific moving averages we use will be different from yours, but we suggest that significant smoothing is necessary. For example, if using a 20 second period for individual data points, you would want to look at at least 500 periods’ average to make trades. As for thresholds, if we take a typical classifier (0 being strongly Long Vol and 1 being strongly Short Vol), dividing the range in thirds works well. For example, you could use > 0.66 prediction moving average to enter a Short Vol position, and < 0.33 to enter a Long Vol position, with the values in between indicating that no position should be taken. Making this “flat” area too small (i.e > 0.51 short, < 0.49 long) will again result in too much flip flopping and defeat the purpose of using moving averages. On the other hand, making it too large (i.e only entering Long Vol close to 0 and Short Vol close to 1) will result in missing out on signifiant movements in either direction.

Ensemble Model

During our evaluation of some of the top performing models, we noticed that among models with similar AUEC, the path of the equity curves can be different. This is actually beneficial because it suggests that different models can take advantage of certain market conditions that others cannot, or avoid losses when other models underperform. Intuitively, we would believe that taking, for example, 3 models that perform the most differently from each other and their average prediction will result in a smoother equity curve.

In fact, when we apply this method and create an ensemble model, we actually see an improvement in ROI and AUEC. This seems to be due to the ensemble model having even “smoother” predictions that avoid certain small drawdowns that the sub-models have, while still capturing the main trends.

The equity curves of 3 models with similar ROI / AUEC but different performance in certain periods. The Ensemble model which uses the average prediction of the three sub models outperforms all three in ROI, with higher AUEC and a smoother equity curve (higher Sortino ratio)

Performance of our XGBoost based Volatility Predictor

The predicted performance in backtesting is not really relevant. Due to XGBoost’s excellent accuracy, the performance on training data is a bit “too good” to be realistic, even when maximizing regularization, early stopping and other methods to reduce overfitting. We can look at the performance during the “test” (held out data) period (January 2022-February 2024), which comes in at a very encouraging 100+%*. We also have some intermediate live history, based on early iterations of the XGBoost Volatility Predictor (before it was an ensemble model), which started back around October 2022 and updated with incrementally better (more generalized) models through May 2023. Those results were consistent with those model’s expectations. Therefore, it is reasonable to believe that the current model’s predicted “test” performance will similarly be matched by live results.

Of course, there is is still some “overfitting” bias even on the held out data, because we picked 3 models that each performed quite well on that data, which is its own form of overfitting. Therefore, it really makes sense to talk about only one measure of performance: real time performance on data that the model has never seen, and was never evaluated on. This of course, is an ongoing study which will require at least 5 years of live results to share.

This concludes our series “Journey to Vol Trading”. We hope you enjoyed the information provided and that it inspired your own interest in trading, machine learning and volatility. If you have any questions, please feel free to leave comments below and we will be happy to answer!

********** As always, we make it very clear that this is not investment advice, none of us are allowed to provide investment advice, this is research work product only and any trading decisions should be made on your own based on your own research and trade ideas. None of the performance discussed here is a guarantee of any future performance. **********