People have used machine learning in trading for decades. Hedge funds, high-frequency trading shops and sole traders use all sorts of strategies, from Bayesian statistics to physics related strategies.
In my final project for Udacity’s Machine Learning Nanodegree, I investigating using machine learning in trading stocks, specifically to predict British Petroleum (BP) stock prices on the London Stock Exchange (LSE) for a 7-day period. E.g. if I gave you 10 days of stock data (and possibly some other publicly known information) leading up till today Mon 17 Oct 2016, the model should predict stock prices for the next 7 trading days (Tues-Fri, then Mon-Wed).
My project report includes a detailed breakdown of the problem statement, datasets used, data exploration, creation and implementation of the model, evaluation and some final comments. In this post, I’ve included excerpts from five sections:
- Why trading is an interesting domain for machine learning
- Datasets used
- Implementation Summary
- Interesting Aspects of the Project
- Difficult Aspects of the Project
- Firstly, there are many non-engineered features. If we include only equities, we already have over 10,000 equities globally. That makes for at least 10,000 potential non-engineered features.
- Secondly, there are many datapoints. Even access to only daily trading information gives us 30 years * 365 days = over 10,000 datapoints for each of many stocks. (This is only an estimate because trading does not take place on Sundays in all non-Israeli exchanges.) If we were to look at intraday figures, there’s even more data: in January 2009, an average of 881,609 trades were made per day in equities on the London Stock Exchange (Source: LSE Group).
- It is also interesting because research in machine learning and statistics has affected how markets behave. There is no strategy or algorithm that will solve this problem or remain forever ‘optimal’ – if a profitable strategy is found, it may be copied by other people and so be priced in or it may be fought against or taken advantage of. This is more relevant to high-frequency trading than daily trading but nonetheless has an impact.
There is one primary dataset for this project and two supplementary datasets.
- The primary dataset is a CSV with all the daily stock prices from 1977 for stocks listed on the the London Stock Exchange. This dataset was downloaded from Quandl.
- The first supplementary dataset is a spreadsheet listing the stocks currently listed on the London Stock Exchange with information such as what each listed company’s stock symbol is and which sector they belong to. This spreadsheet was downloaded from the London Stock Exchange website.
- The second supplementary dataset is a CSV with Open, High, Low, and Close data for the FTSE100 from April 1, 1984 to Sept 9, 2016. This data was scraped from Google Finance and is used for feature engineering.
Initially we used a linear regression only on BP stock prices from the past 7 days, which produced impressive results, with 7-day predictions having a root mean squared percentage error of 5.4%.
In this initial iteration, we perfomed the following steps: 1. Import data (CSV) and format it as a Pandas Dataframe 2. Create features dataframe: Select features we wanted to use and put it into a separate dataframe 3. Create target dataframe (Prices for 7 days following the last date provided in the features). 4. Split into training and testing sets. (No shuffle because we are dealing with time series data.) 5. Train chosen classifier. 6. Predict test target. 7. Evaluate test target and print evaluation metrics.
After the initial iteration, I then repeated the process firstly with different classifiers (altering parameters, tried SVM regression) and then with new features (more days’ worth of data, GAIA data, FTSE data).
I then chose the model with the lowest mean root mean squared percentage error, which was a Linear Regression classifier trained on 7 days of BP and FTSE data (Close, max High and min Low prices. Adjusted for BP, not adjusted for FTSE).
- Coming up with new features from scratch as opposed to selecting them from a given set. This resulted in much analysis paralysis because the universe of possible features is so large.
- Collating data from different sources. I wanted to use FTSE prices that weren’t in the Quandl database I downloaded, so I wrote a Python script to scrape the data from Google Finance. I then had to combine this data with the BP price data. This was made more tedious because there were missing data values when I joined the two dataframes by dates, so I also had to proxy data values.
- A simple model turned out to be better than several more complex models. E.g. Linear Regression did better than SVM regression, and adding GAIA features or increasing the number of days’ worth of data we considered both made increased RMS percentage error.
- It was hard selecting the algorithm to use for this problem.
- It seemed as though any regression algorithm could work – and there are so many of them! I dealt with this by (1) first implementing an SVM regression to get the code to implement the algorithm down on the page so things would feel more concrete. Then I (2) chose the simplest algorithm that seemed to fit the problem and tried that.
- I was also conflicted as to whether or not I should use reinforcement learning. On the one hand there are profits that can act as rewards, but on the other hand trading would not impact the environment.
- Putting different features together in a dataframe took effort.
- Different stocks or indices had data for different dates (e.g. some had data for 1984-04-20, some didn’t). I had to find these differences and decide what to do with missing data.
- There were many possible features.
- The project just got longer and longer and I hadn’t even looked through half of the features I wanted to investigate or tried different algorithms. I decided to test out only a few features in this exploratory study and leave the rest for another study.