Stock Price Prediction ML Tutorial
Predicting future stock prices using machine learning can be a daunting process but it also offers promise of profits that would be difficult or impossible to deliver using manual analysis or looking at graphs on a computer screen. Due to complexity and jargon many people find using machine learning of reach. This article explains how to:
- Obtain daily bar data for free
- Convert bar data into a machine learning friendly format.
- Run the data through Quantized Classifer to predict future prices.
- How to extend this work to support Machine Learning assisted trading.
Summary: During the period tested 36% of all SPY bars met our goal of the market price rising by at least 1% to exit with a profit taker before it dropped by 0.5% to exit with a stop loss. This means that if you randomly purchased the stock only 1 time out of 3 would you exit with a 1% profit before you hit a stop loss at 0.5%. The classifier was able to increase our win rate to 72.7% or roughly 2 wins out of 3 purchases.
The Quantized classifier trained from 1,407 SPY bars while making predictions for 351 bars ending in Jan-2017. It predicted 11 bars would meet the goal. Of those bars it was correct 72.7% of the time. This represents a 36% lift compared to trading random entry points. With minor configuration changes we can increase the precision 84.6% with 13 trades or 29 trades at 51.7% accuracy.
The indicators used were primitive so results could improve with additional work.
Related Articles Analyzing Predictive value of features for stock prediction FAQ Deep Learning for stock price prediction How can I make money with Quantized Classifier
When first learning about stock trading I learned a general rule that if you can predict price movement correctly more than 50% of the time you can make a profit trading stocks provided:
- Your losses per trade are equal same or smaller than your wins.
- Your trades execute at the price expected.
- You can buy and sell the volume of stocks desired.
- Your trading costs combined for winning and loosing trades are less than profits.
Our general goal with machine learning is to identify when to Buy a given stock. When our wins are the same size as our losses we need the system to be correct at least 50% of the time. If the system is correct a higher percentage of the time then net profit will be higher provided we can identify a sufficient number of trades to make it worth our effort.
If the magnitude of our wins is larger than the magnitude of our losses then a system can remain profitable even when predicting with less than 50% accuracy.
For these examples we use a goal of rising to exit with a profit taker at a 1% profit before dropping to exit with a stop limit at 0.5% with a max hold of 4 days.
Any professional trader would happily accept a system that was correct 45% of the time if that system earned twice as much every time it one that it lost on the bad trades. There is a delicate balance between Magnitude of Win versus Magnitude of Loss, Precision and Recall.
There are nearly an infinite number of combinations many of which will make different traders happy. This can allow a single machine learning system to support thousands of traders with unique trades that fit their risk apetite.
Install the GO compiler to run the main classifier. GO is free, fast to download, easy to install and open source.
Install Python 3.5 or newer if you want to download new symbol data or use the transform scripts.
Download the Quantized Classifier repository and run the make_go.bat script with a command console open and with the current working directory set to where you unzipped the repository. This compiles the GO source code into a exectuable file compatible with your computer. This is explained in more detail in the main Quantized classifier readme.
The only essential command in make_go.bat builds the classifyFiles executable but you need to set the GOPATH environment variable to the directory where you places the Quantized Classier before GO will be able to find the source code.
go build src/classifyFiles.go
You can also clone the Quantized Classifier directly using Mercurial and the following command:
hg clone https://firstname.lastname@example.org/joexdobs/ml-classifier-gesture-recognition
For this example I downloaded SPY data from yahoo for the period from 2010 to 2016. I chose this time frame because trading patterns have changed as automated trading has increased which means that data before 2010 is likely to have different patterns. Since our Machine Learning depends on recognizing patterns using training data that has patterns similar to our current patterns is essential.
The script to download the SPY data is yahoo-stock-download.py but it can easily be changed to download other symbols. The data file saved is SPY.csv The baseline data is included with the repository so you only need to run the download script if you want to test against more current data.
Sample of CSV File downloaded
Date,Open,High,Low,Close,Volume,Adj Close 2010-01-04,112.3703,113.3899,111.5100,113.3300,118944600,98.2143 2010-01-05,113.2600,113.68,112.8499,113.6299,111579900,98.4743 2010-01-06,113.5199,113.9899,113.43,113.7099,116074400,98.5436 2010-01-07,113.50,114.3300,113.18,114.1900,131091100,98.9596 2010-01-08,113.8899,114.6200,113.6600,114.57,126402800,99.2889 2010-01-11,115.0800,115.1299,114.2399,114.7300,106375700,99.4276 2010-01-12,113.9700,114.2099,113.2200,113.6600,163333500,98.5003 2010-01-13,113.9499,114.9400,113.3700,114.6200,161822000,99.3323
Converting Bar Data Machine Learning Data
The raw numbers from stock bars do not provide very much predictive value so we must transform the data to a form that ML classifiers can more readily use to predict future prices. Many indicators such as SMA, EMA and RSI have been invented to help humans extract patterns from how the stock prices change over time. These indicators provide data that can be useful when trying to predict future stock prices.
Some of the things I found provides interesting and useful data are:
Percentage of Change compared to a 30, 60, 90, etc day high.
Percentage of change compared to a 30, 60, 90, etc day low
Slope of change compared to some point in the past.
Slope of change for a derived indicator for some point in the such as comparing the SMA(30) for the current bar to the SMA(30) 10 days ago.
If we convert this amount of change between these to points and divide by the starting value if gives us portion of change. We can convert this to a slope by dividing by the number of days between the two bars.
The number and diversity of possible measurements is nearly infinite but our goal is not to teach people how to implement new indicators but rather to generate some data using indicators that can be used as an example of how the Quantized Classifier can predict future stock prices. Remember that better input data can improve the classifiers ability to accurately predict future prices.
One value the Quantized classifier can provide is it can help identify which indicators deliver predictive value and which ones are just noise. This form of guidance may be more valuable than the core classification capability.
For this example I chose to use the slope of the Close current value against bars in the past. I wanted to give the system some ability to detect a longer term down trend followed by a medium term counter trend followed by a short term turn around. With this in mind I had it measure the slope of change for several points in in the past 3,6,12,20,30,60,90 bars.
This ended up producing a new data set with one row for each bar in the original file except when using the SMA it throws away the first 30 bars of data because the SMA are not valid until you have N-days of data. The Machine learning files much use integer class ID so there is a step required to map
Since we needed to split the data to allow training on the early data while reserving more recent data for testing I went ahead and had the system create the spy-1p0up-0p5dn-mh4-close.test.csv and spy-1p0up-0p5dn-mh4-close.train.csv The script that reads the SPY.csv bar file and produces the .train, .test and .class files is stock-prep-sma.py
We always allow machine learning engines to train on a part of the data then test how well they learned by running against part of the data they have never seen before. More accurate prediction of results on new data indicates either a good algorithm, a good set of input data or both. The amount of data used for training for test can be adjusted in stock-prep-sma.py This script is only intended as an example but it could easily be extended to include other indicators and to save a map file to make it easy to map row numbers back to bar dates.
Sample Machine Learning Input Output generated
class,symbol,datetime,sl3,sl6,sl12,sl20,sl30,sl60,sl90,sbm10,sbm20,sam10,sam20,ram20,ram30,rbm10,rbm20 0,spy,2010-02-18,88.548,57.359,14.136,-13.083,-7.979,-3.559,-2.373,0.000,13.083,67.725,67.725,0.047,0.047,0.000,0.026 0,spy,2010-02-19,42.525,64.324,5.738,-2.507,-7.534,-3.221,-2.147,0.000,2.507,61.975,61.975,0.050,0.050,0.000,0.005 0,spy,2010-02-22,27.208,46.703,10.091,8.928,-8.845,-3.191,-2.128,0.000,0.000,55.299,55.299,0.050,0.050,0.000,0.000 1,spy,2010-02-23,-33.060,27.305,26.384,0.182,-13.849,-5.177,-3.451,121.447,121.447,37.020,37.020,0.037,0.037,0.012,0.012 0,spy,2010-02-24,-9.597,16.402,32.502,6.907,-11.360,-3.691,-2.461,15.293,15.293,39.560,42.325,0.047,0.047,0.003,0.003
These rows can be mapped back to the source BAR data but there is also a command option -WriteFullCsv that will save the original records with just the predicted class updated. This option is intended to make integration with automated trading systems easier.
The default version of the converted data is included in the repository. You only need to run it again if you change the parameters in the stock-prep-sma.py or if you downloaded new data.
Known flaw: stock-prep-sma.py currently considers bars near the end of the input data set a failure under because we run out of data before they rise or fall by 1%. A better solution would be to omit those bars from the Test set because this can cause failures to be reported where the final state for that bar is not really known. This could understand the sucess of the engine.
Running the Classifer
set XXCWD=%cd% cd ..\..\..\..\ classifyFiles -train=data/spy-1p0up-0p5dn-mh4-close.train.csv -test=data/spy-1p0up-0p5dn-mh4-close.test.csv -testOut=tmpout/spy-1p0up-0p5dn-mh4-close.out.csv -LoadSavedAnal=false -maxBuck=500 -IgnoreColumns=symbol,datetime cd %XXCWD%- -train - location it will read training data from - -test - location is will read test data from. This would be -class if using to predict against current data. - -maxBuck - is how we divide data elements into groups internally in the engine.
Output from Classifier
Summary By Class
Summary By Class Train Probability of being in any class class=0, cnt=927 prob=0.66261613 class=1, cnt=472 prob=0.33738384 Num Train Row=1399 NumCol=8 RESULTS FOR TEST DATA Num Test Rows=350 Total Set Precis=0.7171429 class=1 ClassCnt=112 classProb=0.32 Predicted=41 Correct=27 recall=0.24107143 Prec=0.6585366 Lift=0.33853662 class=0 ClassCnt=238 classProb=0.68 Predicted=309 Correct=224 recall=0.9411765 Prec=0.7249191 Lift=0.044919074
Sample of Results by Row
This output is also saved in the file named -testOut paramter but is changed slightly because the system generates multiple files under some conditions. The file name actually generated this time is tmpout/spy.slp30.out.sum.csv. This is the actual file you would read to when using the predicted values to make trades.
ndx,bestClass,bestProb,actClass,status 0,0,4.0847554,0,ok 1,0,4.0847554,0,ok 2,0,4.0847554,0,ok 3,0,4.0847554,1,fail 4,0,4.590989,0,ok 5,1,4.1749053,0,fail 6,1,4.1749053,0,fail 7,1,6.336764,1,ok 8,1,5.7136507,1,ok 9,1,5.7171845,0,fail 10,1,5.82789,1,ok 11,1,5.7298274,1,ok 12,1,5.7298274,1,ok 13,1,4.189946,1,ok
Predict future Silver (SLV) Prices
For silver I chose a harder goal where we wanted to find bars where the price would rise by at least 1.5% before it fell by 0.3% with a max hold of 5 days.
This yields a magnitude of wins at least 500% the size of our losses which means the system can remain profitable with precision substantially below 50%.
The Classifier was able to identify 23 out of 503 test bars that it thought would fit this criteria of which 12 turned out to be correct for a win rate of 52%. this is win rate is over 200% better than what we actually needed for this strategy to break even.
- Data download script yahoo-stock-download.py was extended to download daily silver SLV bars back to 2007. To create SLV.csv
- Data conversion stock-prep-sma.py was extended to create silver machine learning files using the SMA30 on close. Produces slv-1p5up-0p3dn-mh10-close.train.csv and slv-1p5up-0p3dn-mh10-close.test.csv
- Quantized Classifier Classification script silver demo directory for silver added. Still uses the Classify Files executable but with different parameters.
- TensorFlow classification script CNNClassifyStock-SLV-1p5up0p3dnMh5.bat
Output from the classifier run SLV
Summary By Class Train Probability of being in any class class=1, cnt=719 prob=0.35878244 class=0, cnt=1285 prob=0.6412176 Num Train Row=2004 NumCol=8 RESULTS FOR TEST DATA numRow=501 sucCnt=314 precis=0.62674654 failCnt=187 failPort=0.37325346 Num Test Rows=501 class=0 ClassCnt=314 classProb=0.62674654, Predicted=491 Correct=309 recall=0.98407644 Prec=0.6293279 Lift=0.002581358 class=1 ClassCnt=187 classProb=0.3732535, Predicted=10 Correct=5 recall=0.026737968 Prec=0.5 Lift=0.1267465 Finished ClassifyTestFiles()
Extending for a Trading System
Trading systems can be super complex trading hundreds of times per day or easy to build providing information to assist a human trader choosing trades. Assuming that we wanted to build a system that provided a human trader between 4 and 10 purchases per day and that depending on market conditions many days can pass without a trade things can be relatively simple. We can assume the user will use a simple stop loss order to limit loss at specified level and that they will exit the with a profit taker would allow reasonable performance at reasonable levels of effort. There are people who are know to game the system and trigger stop loss orders so it is sometimes safer to use larger stop loss windows. margins and manually exit if the market moves adversely.
Extending the system for Human based Day trading:
- Choose several more symbols you want to trade. The SPY Example delivered 43 purchases in roughly 1 year worth of bars that is a little less than 1 trade per week. If you want 4 trades per week you will need at least 8 symbols the system is tracking 15 would probably be better.
- Enhance the data download scripts to download data for the extra symbols.
- Enhance the data download scripts to only download most recent data and add it to existing bar files.
- Add more sophisticated and greater number of indicators to data conversion script. Good traders should try to duplicate indicators they already know and trust. If a indicator works well for humans to predict price movements the same indicator may also provide good input to the classifier.
- Enhance the Data conversion script to run for the additional symbols
- Modify the Data conversion script to copy all but most recent Bar to Training file while only the most recent bar is place in the .class.csv input file. You actually need two sets of these one that only places most recent bars in the .class file and one that places 10% to 20% of the most recent bars in the .test file. You will need both during system tuning. These are one line shell scripts so it is easiest to copy and change the names.
- Create new classifier scripts to run against the different symbols. These are one line shell scripts so it is easiest to copy and change the names.
- Create new classifier scripts using the -class parameter instead of the -class instead of the -test command parameter so it produces a CSV output with the classifiers prediction for the most recent bar. You actually need two scripts for each symbol one for testing you need during test and configuration and one for classifying. These are one line shell scripts so it is easiest to copy and modify.
- Tweak the Data conversion script and classifier parameters to find acceptably good performance for each symbol. This is the most critical step because as you change goals and the symbols the system is analyzing you will need to find different combinations of indicators. For example in the SPY system we look at the slope of the change in close for 3,6,12,20,30,60,90 days. This one was looking for a 1% gain before a 1% drop so if you wanted to predict for a 5% gain then at the very least the number of days used in the comparison would need to change.
- Write a script that reads the classifier output and summarizes the output from all the symbols and the different configurations in a human friendly version for the manual trader.
- Write a parent script to run all these scripts in one step sometime after market close and before market opens. It could also run throughout the day depending on the parameters the user chose.
- Optional: Hook the summary script so it runs automatically and has the content available in he morning for human traders.
- Optional: Hooking the summary output up so the system send an email or text to the user when there is a recomended trade.
- Optional: Reverse the detection logic to detect high risk of rapid market drops to notify user when to exit positions for all symbols the user may be holding.
- Optional: Users need to review the performance by class periodically to ensure the market has not changed and that the system is still producing good data.
It is perfectly reasonable to have several different configurations of indicators working running as if they are different strategies on the same symbol through the same classifier. This can be helpful where traders can gain confidence if more than one strategy recommends buying the same symbol.
Extending this system to become a full fledged fully automated trading system could be a large project including building brokerage API, fault tolerance and all the other features needed for automated trading. I can provide expertise and consulting services to build this system around quantized classifier.
My hope is that this article will inspire some of those working on trading systems to use Quantized Classifier as a component of their solution stack. I would love to provide consulting services to help them build a production grade system around the classifier. I am also willing to provide consulting services to add features they need to Quantized classifier.
This is a super simple example. There are all kinds of enhancements some in the engine and others in the indicators. EG: A indicator showing the % over to 30 day low and % under 30 day max might give good predictive input.
All these scripts are included with our free open source classifier
My hope is that people will see what they get for free and then be willing to pay me to make enhancements to meet their needs and integrate it into their systems.
I hope this helps feel free to contact me with questions
Thanks Joe Ellsworth Machine Learning Algorithms Scientist & Consultant.
Adding Additional Data sources
The analysis above was based on technical data derived from directly from BAR data. There are other sources of data that could be used to add refinement to the predictive capacity of of the system. They are roughly classed as followed.
- Company Fundementals Data.
- Time Before and After the Next financials are released
- Important Market influencers such as Fed Announcements. Market swings can be particularly volatile and so far outside the norm that it will confuse machine learning statisitics immediatly before and following those events.
- Wars and Rumors Wars. Elections and Fears about Elections.
- News, Commentary, Blogging, Tweets, ect producing what is roughly classified as Sentiment data.
Some of this data is easily added to the technical data simply as a extra few columns of data in the CSV files. The Machine learning classifier doesn't care were the data comes from as long as it can be represented in a number in the CSV that has value for every row. A perfect example of this could be the companies dept to equity ratio or their rate of increase in sales over the last 2 quarters both of which could influence future stock prices.
Market moving Events
Other aspects such as the fed announcements are harder to incorporate as a single number since not body knows what they are going to say. This can be added to the model as a feature but it may be easier to simply lock out trading for the few days before and after these announcements unless you are building a trading system to try on capitalize on those points of high volatility.
News and textual sentiment data can be mined and added to the engine both as a market wide sentiment and as sentiment about the company. In general the sentiment mining requires different approach so the the best way to approach it is to use a separate classifier that is digesting the news data and producing numbers that can be added to the exisitng CSV as additional data columns.
One of the greatest challenges with sentiment data is gaining access to the text containing necessary data in a timely fashion and a reasonable cost. Many sites that contain valuable commentary that could be mined for sentiment but it is hidden behind pay walls. Other text like twitter is free but may contain very limited value with lots of noise. When I find a free source that is worth mining I will add it as an example to for the Quantized classifier.
One of the more interesting challenges is that some authors have opinions that are more likely to be correct than others. Any sentiment mining system needs to incorporate a notion of author credibility and use it to rate sentiment from authors with greater credibility as higher influence than others.
Ultimately sentiment data can be reduced to numbers that are added to the original technical data to help boost accuracy of prediction or it can be used to adjust acceptable risk thresholds for portfolio management. For example we could have a set of numbers such twitterMarketRise=0.5 meaning that the twitter feed seems to be neutral. SeekingAlphaIBMBull=0.9 which means the seeking alpha analysts as a set are very bulling thinking IBM will rise. Since we want to consolidate many sources into a small number of numbers being able to adjust for credibility of the source is critical. The total number of valid columns is only limited by our imagination but more columns can actually hurt predictive accuracy if they contain only noise with no signal.
SPY Goal Rise 6% before dropping by 1%
It is possible to run multiple configurations seeking different performance gains on the same symbol to provide more trading options on the same symbols. This example shows a more aggressive goal for SPY seeking a 6% gain per trade with downside risk limited to 1%.
When running multiple classifiers for a single symbol it is important to also apply portfolio risk management rules to avoid over concentration in a given symbol or segment. When multiple configurations all agree that it is a good time to buy a symbol it can be considered a positive sign.
When using multiple classifiers on the same symbol it is a good idea to use different indicators so they are not fooled by the same signal. In this instance we used the slope of the current bar against the SMA(90) of the max price. This is different than using the close without SMA above but may not be sufficiently different to avoid being fooled by the same noise.
In this configuration we seek a 6% rise from a given SPY bar before the market drops by 1%. With a 6X greater profit than loss the system needs to provide an accuracy of 16.7% or 1 win out of 6 trades to break even. Any precision above that level will yield a net profit.
During the training period 4.8% of the training bars met this criteria while 4.4% did during the test period. This means that with a random timed purchase there 21 out of 22 bars will hit the 1% stop loss before they rose by the targeted 6%.
The system predicted 1 bars would meet the criteria with which was correct yielding 100% precision. The more aggressive analyze script predicted 4 bars with 3 correct yielding 75%. If the future holds similar statistical patterns we could expect 9 out of 12 bars picked by the engine to meet the goal.
Line of python added to stock-prep-sma.py to produce this data file.
process("data/spy.csv", "data/spy-2p2up1-1dn-mh6-close.csv", "spy", 6, 0.022, 0.01,30,"close") # 220% gain ratio
Classification and Analysis scripts are in demo/stock/spy/6up1dn-mh10 directory
classifyFiles -train=data/spy-6up-1dn-mh10-smahigh90.train.csv -test=data/spy-6up-1dn-mh10-smahigh90.test.csv -testOut=tmpout/spy-6up-1dn-mh10-smahigh90.out.csv -maxBuck=100 -IgnoreColumns=symbol,datetime,rbm10,rbm20,sl3,sl6,sbm10,sam10,sbm20,ram30,rbm10,sl6,sl3,sl12,sl90,sl60,sam10
Results from Classification run
Class=1 is the Sucess condition. Class=0 is the failure condition. Summary By Class Train Probability of being in any class class=0, cnt=874 prob=0.64311993 class=1, cnt=485 prob=0.35688007 Num Train Row=1359 NumCol=18 RESULTS FOR TEST DATA Num Test Rows=339 Total Set Precis=0.6902655 class=0 ClassCnt=233 classProb=0.68731564 Predicted=338 Correct=233 recall=1 Prec=0.6893491 Lift=0.002033472 class=1 ClassCnt=106 classProb=0.31268436 Predicted=1 Correct=1 recall=0.009433962 Prec=1 Lift=0.68731564 Finished ClassifyTestFiles()
It is possible to directly access the probability by class for each row. The simplistic filter chooses the class with the highest probability and calls that the choice for that row. In reality if we are keeping rows with a 51% rating for purchase it may be worth reducing the threshold probability as low as 40% especially when classifying for a strategy that only needs 16.7% accuracy to be successful. This would increase the number of trades provided by the strategy.
Please remember these examples are only intended as examples. You are encourage to experiement, improve them and add your own creativity. I already know that we could invest man years adding more sophisticated indicators, more sophisticated entry and exit logic and building plumbing to connect to brokers. The intent is not to provide you with a full fledged trading system but rather a nucleolus you can use to build such a system. See: How can I make money with this
When I write these articles they are based on the state of the engine at a given point of time. As we add features, new indicators or fix bugs the actual results will change over time. If you see results different from those shown here when executing the samples the most probable reason is that subsequent changes caused a change in results. This is inevitable when working with any rapidly improving product.
The Directory demo/stock contains many additional configurations predicting prices for various symbols.
Several examples of using Tensorflow CNN Deep learning to predict future stock prices using the same data are available in the tlearn directory