Investing with Machine Learning (Part 1)

Setting the Scene

Over thousands of years, philosophers, scientists, and cavemen have pushed themselves near wit's end to solve the most profound, intellectual issues humans could ever fathom. Yet even the collective wisdom of the past generations can't answer the following question:

"Why do my stocks drop to all-time lows the moment I buy them?"

Professional graph of every stock's performance in the old clemshao portfolio

Perhaps buying the meme stock I saw on TikTok after it had already quintupled in share price wasn't my wisest decision. Maybe it was because I recklessly bought cheap OTM call options that matured before my Prime same-day delivery arrived. Nevertheless, I was unsatisfied with my "red carpet" of negative portfolio returns, and I pondered whether there was a way to stop losing money.

The first step would involve predicting how a stock will move.

Inspiration

With renewed inspiration, I did some research and decided to equip my refined skills (elementary Python and machine learning) to see if they could effectively predict stock prices.

As ambitious as I was, it's important to note that although no model can ever be completely accurate when it comes to predicting the stock market, these models can still provide valuable insights and potentially improve my understanding of market trends. By training my models on historical data and using them to make predictions about future stock movements, I was able to determine if they could potentially be used as a tool for investors looking to make informed decisions about their portfolios.

Project Idea

The following summarizes several key components of my project:

  • Taking a step back, I realized that running models on sector ETFs may yield more intuitive results, since errors or inconsistencies could be attributed to general macroeconomic factors (as opposed to company-specific characteristics), and might be easier to interpret.
  • I chose a classification approach to better capture directional changes in share prices. It is also preferred over predictions/regressions since investors generally care more about getting accurate directional changes than tracking an exact price (i.e. it is unhelpful to know an arbitrary stock price with no understanding of the movement of the price time series).
  • I mainly covered the Technology Select Sector SPDR Fund (XLK) since tech is typically volatile and has generally yielded strong returns.
  • Used 7 years of historical financial data from October 1st, 2015 to October 1st, 2022.

Loading and Cleaning Data

First is to import necessary libraries (pandas, numpy, matplotlib, datetime, etc):

company = "XLK"
start = dt.datetime(2015,10,1)
end = dt.datetime(2022,10,1)

data = web.DataReader(company, 'yahoo', start, end)

I used the variable 'data' for my 7-year financial data for XLK

data["Tomorrow"] = data["Adj Close"].shift(-1)
data["Target"] = (data["Tomorrow"] > data["Adj Close"]).astype(int)

Creating a Target variable that returns 1 if returns went up or 0 if down 

Picking a Model

When picking a model, one must consider several factors which include, but are not limited to, the size and complexity of data, level of interpretability, and level of performance. Considering this is a classification problem, a random forest is a good fit since it is an efficient model that tends to yield low bias and moderate variance.

Methodology

Begin with a random forest, make improvements to it, and analyze if its accuracy improves.

  1. Simple Random Forest

Create a random forest that uses the most recent trading year as test data and the initial 6 years as train data.

model = RandomForestClassifier(n_estimators = 200, min_samples_split = 100, random_state=1)

train = data.iloc[:-252]
test = data.iloc[-252:]

predictors = ["Adj Close", "Volume", "Open", "High", "Low"]
model.fit(train[predictors], train["Target"])

Hyperparameters weren't optimized but were close from quick trial and error

2. Random Forest with Backtesting and Validation

From here, I adapted the methodology and code from the following video.

Our initial model only tests on the last trading year, so testing across multiple years of data would make the model more robust. This can be done by creating a backtesting system.

def predict(train, test, predictors, model):
    model.fit(train[predictors], train["Target"])
    preds = model.predict(test[predictors])
    preds = pd.Series(preds, index=test.index, name = "Predictions")
    combined = pd.concat([test["Target"], preds], axis =1)

    return (combined)

Define a predict function that fits and trains the model using predictors to classify Target

# Start with ~ 4 years of data with a step count of ~1 year to add to backtesting

def backtest(data, model, predictors, start = 1000, step = 250):
    all_predictions = []                                         
    
    for i in range(start, data.shape[0], step):                  
        train = data.iloc[0:i].copy()
        test = data.iloc[i:(i+step)].copy()
        predictions = predict(train, test, predictors, model)
        all_predictions.append(predictions)                      
    return pd.concat(all_predictions)                            

Creating a list, train/test, appending all predictions to the list, and concatenating all results 

3. Adding Rolling Averages and Horizons and Improving Prediction

Next, we can add more predictors to the random forest to determine if they improve model accuracy. An analyst may want to consider whether the stock price today is higher than it was 2 days ago, a week ago, a month ago, a year ago, or even longer. This could be done by adding time horizons for calculating rolling means. Then we can find the ratio of today's closing prices to those windows' Closing Prices.

horizons = [2, 5, 60, 250]
new_predictors = []
for horizon in horizons:
    rolling_averages = data.rolling(horizon).mean()
    ratio_column = f"Close_Ratio_{horizon}"
    data[ratio_column] = data["Adj Close"] / rolling_averages["Adj Close"]
    trend_column = f"Trend_{horizon}"
    data[trend_column] = data.shift(1).rolling(horizon).sum()["Target"]
    new_predictors += [ratio_column, trend_column]

2 day, 5 day, 60 day, and 250 day horizons used

# Here we add more conditions for our predictor function.
# We set a custom threshold. Our model must have more confidence (0.55), 
# reducing trading days, but improves accuracy

def predict(train, test, predictors, model):
    model.fit(train[predictors], train["Target"])
    preds = model.predict_proba(test[predictors])[:,1]
    preds[preds >= 0.55] = 1
    preds[preds < 0.55] = 0
    preds = pd.Series(preds, index=test.index, name = "Predictions")
    combined = pd.concat([test["Target"], preds], axis =1)
    return combined

Results Summary

  1. Simple Random Forest: 46.6%
preds = model.predict(test[predictors])
preds = pd.DataFrame(preds, index=test.index)

precision_score(test["Target"], preds)

Accuracy: 0.46551724137931033

2. Random Forest with Backtesting and Validation: 48.91%

predictions = backtest(data, model, predictors)

predictions["Predictions"].value_counts()
#1    415
#0     97

precision_score(predictions["Target"], predictions["Predictions"])

predictions["Target"].value_counts() / predictions.shape[0]
#0    0.509766
#1    0.490234

Accuracy: 0.4891566265060241 

3. Improved Model: 51.04%

predictions = backtest(data, model, new_predictors)
predictions["Predictions"].value_counts()
#1.0    386
#0.0    126

precision_score(predictions["Target"], predictions["Predictions"])

Accuracy: 0.5103626943005182

The prediction accuracy increased slightly after adding the backtesting system and modifying some features.

Logistic Regression as a Baseline

While tuning some parameters of the random forest improved my model a bit, I still have no comparison against other models. The reasoning is to see if a simpler, less complex, and less computationally intensive model would suffice for this given purpose.

# Copy for new data set to not disturb original
# 6/7 split to train first 6 years and test more recent year.
data2 = data.copy()
split = int((6/7)*len(data2))

X = pd.DataFrame(data2[["Close_Ratio_2", "Close_Ratio_5", "Close_Ratio_60", "Close_Ratio_250"]])

y = data2["Target"]

X_train, X_test, y_train, y_test = X[:split], X[split:], y[:split], y[split:]

model = LogisticRegression()
model = model.fit (X_train,y_train)

probability = pd.DataFrame(model.predict_proba(X_test))
predicted = model.predict(X_test)

# Confusion matrix to see results
print(metrics.confusion_matrix(y_test, predicted))
[[  0 118]
 [  0  98]]
 
# Show accuracy
 print(model.score(X_test,y_test))
0.4537037037037037

Fortunately, time was spent well with the random forest, since the logistic regression didn't capture the classification predictions as well as any of the random forest models.

Comparison with Energy

In an attempt to search for better results, I applied the same models to the energy sector using the Energy Select Sector SPDR Fund (XLE). For simplicity's sake, here are the results:

  1. Simple Random Forest: 51%
  2. Random Forest with Backtesting and Validation: 46.81%
  3. Improved Model with Horizons: 59.09%
  4. Logistic Regression: 51.39%

Thoughts and Interpretations

Overall, the more complex random forest model including the time horizon was the best-performing model when modeling both the financial and energy sectors, and I have reason to believe the additional complexity is the reason for this. Given the nature of the data being used, logistic regression may have fallen short due to its simplicity. Logistic regression is easily thrown off by outliers and struggles with correlated features, and both could apply to this data, since stock price data is very volatile (especially so for the tech sector), and stock price does not necessarily move in a random Brownian motion, but prices could be autocorrelated.

I will still have to keep my job for now...