Investing with Machine Learning (Part 1)
Setting the Scene
Over thousands of years, philosophers, scientists, and cavemen have pushed themselves near wit's end to solve the most profound, intellectual issues humans could ever fathom. Yet even the collective wisdom of the past generations can't answer the following question:
"Why do my stocks drop to all-time lows the moment I buy them?"
Perhaps buying the meme stock I saw on TikTok after it had already quintupled in share price wasn't my wisest decision. Maybe it was because I recklessly bought cheap OTM call options that matured before my Prime same-day delivery arrived. Nevertheless, I was unsatisfied with my "red carpet" of negative portfolio returns, and I pondered whether there was a way to stop losing money.
The first step would involve predicting how a stock will move.
Inspiration
With renewed inspiration, I did some research and decided to equip my refined skills (elementary Python and machine learning) to see if they could effectively predict stock prices.
As ambitious as I was, it's important to note that although no model can ever be completely accurate when it comes to predicting the stock market, these models can still provide valuable insights and potentially improve my understanding of market trends. By training my models on historical data and using them to make predictions about future stock movements, I was able to determine if they could potentially be used as a tool for investors looking to make informed decisions about their portfolios.
Project Idea
The following summarizes several key components of my project:
- Taking a step back, I realized that running models on sector ETFs may yield more intuitive results, since errors or inconsistencies could be attributed to general macroeconomic factors (as opposed to company-specific characteristics), and might be easier to interpret.
- I chose a classification approach to better capture directional changes in share prices. It is also preferred over predictions/regressions since investors generally care more about getting accurate directional changes than tracking an exact price (i.e. it is unhelpful to know an arbitrary stock price with no understanding of the movement of the price time series).
- I mainly covered the Technology Select Sector SPDR Fund (XLK) since tech is typically volatile and has generally yielded strong returns.
- Used 7 years of historical financial data from October 1st, 2015 to October 1st, 2022.
Loading and Cleaning Data
First is to import necessary libraries (pandas, numpy, matplotlib, datetime, etc):
Picking a Model
When picking a model, one must consider several factors which include, but are not limited to, the size and complexity of data, level of interpretability, and level of performance. Considering this is a classification problem, a random forest is a good fit since it is an efficient model that tends to yield low bias and moderate variance.
Methodology
Begin with a random forest, make improvements to it, and analyze if its accuracy improves.
- Simple Random Forest
Create a random forest that uses the most recent trading year as test data and the initial 6 years as train data.
2. Random Forest with Backtesting and Validation
From here, I adapted the methodology and code from the following video.
Our initial model only tests on the last trading year, so testing across multiple years of data would make the model more robust. This can be done by creating a backtesting system.
3. Adding Rolling Averages and Horizons and Improving Prediction
Next, we can add more predictors to the random forest to determine if they improve model accuracy. An analyst may want to consider whether the stock price today is higher than it was 2 days ago, a week ago, a month ago, a year ago, or even longer. This could be done by adding time horizons for calculating rolling means. Then we can find the ratio of today's closing prices to those windows' Closing Prices.
# Here we add more conditions for our predictor function.
# We set a custom threshold. Our model must have more confidence (0.55),
# reducing trading days, but improves accuracy
def predict(train, test, predictors, model):
model.fit(train[predictors], train["Target"])
preds = model.predict_proba(test[predictors])[:,1]
preds[preds >= 0.55] = 1
preds[preds < 0.55] = 0
preds = pd.Series(preds, index=test.index, name = "Predictions")
combined = pd.concat([test["Target"], preds], axis =1)
return combined
Results Summary
- Simple Random Forest: 46.6%
2. Random Forest with Backtesting and Validation: 48.91%
3. Improved Model: 51.04%
The prediction accuracy increased slightly after adding the backtesting system and modifying some features.
Logistic Regression as a Baseline
While tuning some parameters of the random forest improved my model a bit, I still have no comparison against other models. The reasoning is to see if a simpler, less complex, and less computationally intensive model would suffice for this given purpose.
# Copy for new data set to not disturb original
# 6/7 split to train first 6 years and test more recent year.
data2 = data.copy()
split = int((6/7)*len(data2))
X = pd.DataFrame(data2[["Close_Ratio_2", "Close_Ratio_5", "Close_Ratio_60", "Close_Ratio_250"]])
y = data2["Target"]
X_train, X_test, y_train, y_test = X[:split], X[split:], y[:split], y[split:]
model = LogisticRegression()
model = model.fit (X_train,y_train)
probability = pd.DataFrame(model.predict_proba(X_test))
predicted = model.predict(X_test)
# Confusion matrix to see results
print(metrics.confusion_matrix(y_test, predicted))
[[ 0 118]
[ 0 98]]
# Show accuracy
print(model.score(X_test,y_test))
0.4537037037037037
Fortunately, time was spent well with the random forest, since the logistic regression didn't capture the classification predictions as well as any of the random forest models.
Comparison with Energy
In an attempt to search for better results, I applied the same models to the energy sector using the Energy Select Sector SPDR Fund (XLE). For simplicity's sake, here are the results:
- Simple Random Forest: 51%
- Random Forest with Backtesting and Validation: 46.81%
- Improved Model with Horizons: 59.09%
- Logistic Regression: 51.39%
Thoughts and Interpretations
Overall, the more complex random forest model including the time horizon was the best-performing model when modeling both the financial and energy sectors, and I have reason to believe the additional complexity is the reason for this. Given the nature of the data being used, logistic regression may have fallen short due to its simplicity. Logistic regression is easily thrown off by outliers and struggles with correlated features, and both could apply to this data, since stock price data is very volatile (especially so for the tech sector), and stock price does not necessarily move in a random Brownian motion, but prices could be autocorrelated.
I will still have to keep my job for now...