”Machine Learning: The concept that a computer program can learn and adapt to new data without human interference.” (Investopedia)
Despite this relatively dry sounding definition, in the past few years public awareness of this subfield of computer science has grown significantly. What is the reason for this surge in interest? Perhaps the answer lies in how advances in this arena have started to capture public imagination. For example, machine learning is integral in internet recommendation engines, driverless cars, image recognition and the success of Google DeepMind’s AlphaGo at beating the world number one Go player, Lee Sedol; this feat is regarded as a significant moment in the history of artificial intelligence and something many predicted would be decades away. Now, Google DeepMind is beginning to develop health screening tools through its Streams application.
Today, becoming a data scientist suddenly appears to have become ‘cool’ and commands sky-high salaries, and a range of businesses are now focused on making hires to help them crunch the huge amounts of data the modern age appears to be creating. The hedge fund industry has also been moving in this direction, with both quantitative and discretionary managers often referring to the use of machine learning techniques and the analysis of ‘big data’ in their investment process. At Aurum Research Limited (“Aurum”), we are accustomed to hearing the marketing ‘buzzword’ of the day; so is there really a revolution in systematic investing driven by these developments, or is it just marketing hype?
Are these managers really using machine learning? Do the managers have the requisite skillset and experience in the area and do they have a record of applying the concepts to financial markets? Is anyone actually making money from big data? Finally, just how should an allocator to hedge funds approach this new area?
An Introduction to Machine Learning in Hedge Funds
Historically, machine learning was often thought of as synonymous with ‘data mining’, a dirty word for many practitioners, who instead preferred to use a scientific, theory-driven approach whereby a hypothesis as to how one variable should affect another was empirically tested. In contrast, letting loose a computer to search for patterns in data was considered to lead to models and/or conclusions that were spurious, and overfitted to the data. Overfitting is when a model developed on the training data is overly complex and has poor predictive performance out of sample. For many years, it was the hypothesis-driven approach that enjoyed the most success and was the most common approach among quantitative hedge funds. Today, however, due to advances in computing power and more rigorous statistical testing, we are seeing increased reference to more purely data-driven approaches.
Classical machine learning can be split into the areas of supervised and unsupervised learning. Very roughly, supervised algorithms either specify that they want to find an equation that forecasts a variable, or they split the results into different classifications that are pre-defined. An unsupervised algorithm is used more for explaining data, i.e. not necessarily providing forecasts, or, in the case of classification, the clusters that the data set is split into are considered to be only loosely similar.
Deep learning is a technique that analyses multiple layers of learning. Initially, a model is determined to complete an easy task before modelling something connected and potentially more complex. This method is closer to how human intelligence works and is potentially an appropriate technique to deal with complicated concepts and patterns.
Within each of these areas there are different techniques. In addition, there are further techniques that focus on how to handle data to minimise the risk of overfitting with geeky sounding names, such as bagging, shrinkage estimators, and cross-validation. Overall, the area represents a very large potential toolbox of statistical approaches where different techniques are more appropriate for different types of task. However, given the increased risk of overfitting in machine learning compared to more traditional quantitative methods, such techniques must be used with caution.
Machine Learning’s Growth in Finance
With the recent rise in awareness of machine learning and some of its applications, it may come as a surprise that machine learning more broadly isn’t particularly new at all, with the development of the first neural networks dating back to the 1950s. This begs the question: why have so many investment managers only just started utilising these techniques?
Part of the reason is the huge advances in data processing power (see diagram below) that has gone hand in hand with falling costs. This has made a number of highly data-intensive machine learning approaches far more accessible. |It has also led to new participants setting up efforts in this area. However, a further reason is simply that, as much as the computer science community has been talking about the academic developments in this area for some years, it has only been very recently that the academic community has written papers on using these techniques and their application in the world of finance. Essentially the investment community has been slow and it now appears to be catch up time.
The Exponential Growth of Computing Power
The growth in ‘big data’ – data sets so large or complex that analysis has historically been constrained by computing power limitations and the inadequacy of traditional software – has clearly also been a reason for the surge in use of machine learning techniques. Indeed, the terms big data and machine learning are often used side-by-side on a presumption that, if one is being used, so is the other. Today there are many data vendors specialising in taking unstructured/alternative data – data which is not organised in a pre-defined manner, such as text analytics, social media data, and satellite images – and using elements of machine learning to attempt to make sense of it. This processed data is then marketed to both discretionary and quantitative investment managers. Today, it is clear to us that many of the fund managers that we deal with are indirectly incorporating techniques of machine learning when buying these data sets. This does not necessarily mean, however, that the manager has any skillset in machine learning.
What may also be leading to the increased reference to machine learning techniques within finance is simply the perceived increased sexiness that the buzzword is considered to add to a manager’s sales pitch. One must be careful when evaluating a manager that proclaims, “We are using machine learning techniques”, as what does this actually mean? There seems to be some ambiguity in defining what is, and isn’t, a machine learning technique. For example, a simple linear regression model may be described as a supervised learning technique (a sub-set of machine learning), and there are many other techniques that have been used in the world of statistics for some time that are also part of the machine learning toolkit. This difficulty in definition has clearly been a major factor in allowing some managers to mention that their “portfolio construction model is now based on machine learning”, while in reality they are just using the same statistical technique that they have been using for many years. There are similar issues where some other machine learning techniques have origins in signal processing, a part of electrical engineering. This is not to suggest that these managers are deliberately trying to mislead potential investors, but there is quite a chasm between someone using a technique that is somewhere in the realm of machine learning and someone that is truly utilising these methods throughout their investment process, and also experimenting with new techniques.
What are Managers using Machine Learning for?
There is a clear and growing trend of managers purchasing unstructured data sets that have already been exposed to machine learning techniques by the data vendors to help get the data into a more usable format. At the same time, other quantitative managers with larger infrastructures are using machine learning to process raw unstructured data sets themselves. Additionally, a recent Barclays survey[1] found that managers were increasingly using machine learning to clean traditional data sources, such as price and volume. The next most regular use of machine learning appears to be in execution systems, particularly those that are relatively high frequency.
The range of different machine learning techniques that can be applied to different parts of the investment process is fascinating. This list represents just a fraction of the potential techniques/applications:
- Principal component analysis (another example of the crossover between statistics and machine learning that a range of managers have referred to for many years): used in risk management to help decipher the factors that explain their returns
- Kalman filters (from the world of signal processing): help adjust signal weights dynamically in response to changing market conditions
- Hidden Markov Models (also from signal processing): more useful when there is a sudden regime switch rather than a gradual change in the market[2]
- Boosting: seeks to take a weak algorithm and progressively improve its predictive power. This may be appropriate for the alpha research process
- Decision Trees/Random Forests: both calculate the probability of a future event. This may be appropriate for the alpha research process
- Deep learning: seems to be more appropriate for use with unstructured data, though some managers have discussed building systems nearer what may be seen as an artificial intelligence approach. Whether deep learning can be effective for financial time series analysis is currently a large debate
- Reinforcement learning: this technique is very useful when you know that your actions affect the result. This is therefore particularly relevant to execution models and very short-term alpha signals
We have generally observed that managers doing genuine alpha research via machine learning methods have had the most success using traditional sources such as price and volume data. Whilst we have seen numerous presentations referencing the “growth of unstructured data sets”, so far we have not seen much evidence of managers successfully building strategies to exploit this data in a meaningful way. There is clearly, however, a significant amount of research dedicated to exploring these new data sources. Some highly specialised managers (for example, some discretionary or quantitative commodity or equity sector specialists) may have found value in specific new unstructured data sets; however, we are yet to see the investment industry successfully exploiting unstructured data in a broader context.
One should make a distinction between those funds that occasionally use a machine learning technique versus those strategies that incorporate them throughout the investment and alpha generation process. Indeed, in our experience very few funds are prepared to use machine learning in their signal generation, with this toolbox more commonly utilised for more task-specific purposes.
What should an investor expect from a machine learning investment? And what should an investor be wary of?
Our first-hand experience of analysing machine learning strategies has taught us that one of the most notable characteristics of a machine learning investment is how low the correlation is between these funds and other more ‘traditional’ quantitative funds. As stated earlier, machine learning is a giant toolbox of techniques, and even when the same technique is applied to the same type of data set, there are a myriad of different decisions that are made for an algorithm to be created. This means that even the application of similar methods can lead to very different results. We have also witnessed that machine learning based funds exhibit relatively little to no correlation to more traditional ‘price mean-reversion’ or ‘trend/momentum’ based strategies. It also follows that such funds appear to be less at risk than their contemporaries from market crowding. However, if related strategies or parts of the market experience sharp losses and simultaneously start to deleverage (as we saw in August 2007) then it may still be difficult for all these managers to avoid losses arising from these events.
To construct a robust alpha signal in which a manager has sufficient confidence, one needs a sufficiently large data set, with enough history to allow rigorous testing. Having obtained this, many quantitative practitioners favour the use of machine learning for the creation of relatively short-term signals (as using faster data essentially creates a larger data-set). There is some strong evidence for this and any advantage machine learning may have should become more pronounced the longer the history, or the larger the data-set, but there is no real reason why a learning algorithm cannot also find signals that are relatively long-term, as long as there is enough data for the algorithm to be robust. Without sufficient historic data, the risks of overfitting become very high.
When dealing with managers purporting to use machine learning, investors should press the manager as to what techniques they are using, in which part of the investment process they are utilised, and rationale for favouring the use of one technique over another. A major problem in analysing financial data sets is that there is a low “signal to noise ratio” (i.e. it can be a challenge to identify variables that have a statistically significant effect). This is considered by many to be the main reason why some managers struggle to make money with machine learning. A manager should have sound explanations as to why the techniques being used are effective at finding the signal. Similarly, managers should also be able to explain how they avoid overfitting. A manager should be able to explain what programming languages and computing power are being utilised, as the basic tools need to be in place for the strategy to function effectively. It is also essential that managers in the sector can adequately demonstrate where they have gained their knowledge and what experience they have of successfully applying these techniques. As stated earlier, this sector encompasses a very wide range of tools, each with their advantages and disadvantages; this arguably offers more ways for managers to make errors, particularly if they are not sufficiently experienced in the area.
Aurum has been analysing quantitative strategies for investment purposes for over 20 years; whilst we agree that there is real substance behind the hype and the buzzwords, when it comes to the application of machine learning techniques within hedge funds, one should tread carefully. Potential allocators to the space would be wise to remember that just because a pattern occurred historically it doesn’t mean that it will reoccur, and knowing when this may be the case is just as much an art as a science.