Modeling Market Capitalization Of Public US Companies Using Publicly Available Stock Metrics And Performance Data

Author: Andrew Jain
Mentor: Dr. Mohammadreza Mousavi Kalan
Lyons Township High School

Review of Literature

Predicting the stock market, especially with simpler statistical analysis methods, is challenging because it exhibits extremely complex and nonlinear trends (Sawale & Rawat, 2022). Despite inherent unpredictability, numerous studies have been conducted on the ability of AI and other statistical tools to predict the market (Lin & Lobo Marques, 2024). Prediction techniques like Machine Learning (ML), Deep Learning (DL), Neural Networks (NN), Support Vector Machines (SVM), and Sentiment Analysis have been found to perform well in this nonlinear environment. These methods come at the cost of high computational needs and low interpretability (Lin & Lobo Marques, 2024).

It still needs to be determined which of these prediction techniques is best. One study achieved the highest performance with SVM when forecasting stock market returns (Arrieta et al., 2015). In contrast, a 2023 study analyzed nine different ML models in predicting the direction of Tesla’s stock price and found that the simpler method of Logistic Regression had the highest accuracy. In the study, the Random Forest model and NN also performed well (Khan et al., 2023).

Recently, more complex predictive methods like Long Short-Term Memory (LSTM) have been studied for stock market prediction (Tiwari & Chaturvedi, 2021). LSTM models have been found to outperform other ML and NN models in stock market forecasting, and they present a promising option for modeling the stock market (Sawale & Rawat, 2022).

Although advanced techniques like LSTM can make accurate predictions, they generally rely on technical analysis, with less than 25% of studies using fundamental analysis to construct their models (Lin & Lobo Marques, 2024). Creating models based on fundamental analysis is challenging because it requires explaining the reason for the movement of a stock price (Lin & Lobo Marques, 2024). However, models based on fundamental analysis can detect simple relationships between specific factors and a stock price. A study by Sohdi (2024) tested the ability of financial metrics, like growth rate, return on assets, quick ratio, and price-to-earnings ratio, to predict stock price using multiple linear regression.

The study found a significant negative correlation between quick ratio and stock price. It also suggests future research to study an expanded set of variables in predicting stock price. In a separate study, Ganguli (2011) explored the relationships between fundamental accounting metrics like earnings and valuation. Using regression, the study found a correlation between abnormal earnings, book value, and the market value of a company. It was also found that in the presence of abnormal earnings and book value, operating cash flow does not contribute to modeling market value. Ganguli (2011) recommends additional research that includes additional variables besides abnormal earnings and book value.

This project seeks to expand on the previous research exploring the relationships between accounting metrics. Specifically, the project will use linear regression to conduct fundamental analysis on publicly traded US companies. While numerous effective statistical and machine learning methods have been found to predict the stock market, many are complex, hard to interpret, and computationally intensive. Linear regression was chosen for this project because, despite its simplicity, it remains a powerful tool that can detect subtle underlying relationships. In addition, linear regression was selected because of its interpretability, which is crucial for fundamental analysis. With linear regression, this study will search for connections between a company’s market capitalization and other important data metrics like earnings per share. Ideally, linear regression will discover new relationships that can used to predict a company’s market capitalization.

Data Selection

Before fitting the first model, many features irrelevant to Market Cap were removed from the dataset. These features included Change from Open, Country, and 50-day Moving Average. Producing a model with these features would not contribute to accuracy, but it would cause a significant increase in complexity. Only predictors relevant to Market Cap were kept to keep the model interpretable. Additional features with too many missing values were also removed from the dataset. While having some missing values is inevitable, having too many would cause a higher error rate because there is less data to train on. After removing all unnecessary features, there were 15 predictors left to be used in a model.

All of the companies without Market Cap data were removed from the dataset, as the model can’t be trained without it. This means more than 3300 data entries were removed, bringing the number of entries down to ~6200. Additional entries were deleted because they contained missing data from the predictor Outstanding Shares or the predictor Performance by Half Year.

One feature absolutely integral to predicting Market Cap is a company’s earnings. This was a metric not included in the initial dataset. A model without earnings as a predictor would likely struggle to make accurate predictions. To include earnings as a feature, each company’s earnings had to be calculated using its price/earning (p/e) ratio and its number of outstanding shares. To perform this calculation, more than 3000 entries with missing p/e ratios had to be removed, bringing the final number of data entries down to 3086. This is a less-than-ideal amount of data, but still enough to create an accurate model.

With the feature Earnings calculated and added to the dataset, a few final data metrics were dropped from the set, including Earnings Per Share, Price, Price to Earnings, and Forward Price to Earnings. The final number of predictors became 10, one of which, Industry, is a categorical predictor.

The final dataset to be used for creating the models.

Single Variable Analysis

To begin analyzing the data, a correlation heatmap was created to visually look for relationships between the features themselves and Market Cap.

Note: Industry not shown in correlation heatmap.

The heatmap illustrates a correlation between Market Cap and Earnings and Market Cap and Outstanding Shares. Earnings and Outstanding Shares also appear to be correlated, and many of the performance statistics appear to be correlated with each other.

Scatterplots between two features, Earnings and Beta, and Market Cap

Scatterplots were used to look for relationships between individual predictors and Market Cap. The plots above demonstrate almost no relationship between Beta and Market Cap but a positive relationship between Earnings and Market Cap.

Now, to mathematically test each feature’s ability to predict Market Cap, models were created using simple linear regression on each predictor. Each of these models was trained using training data and then tested on a validation set of data. To split the whole dataset into a training set and a validation set, the train_test_split() function from the Sklearn module was used. Roughly 2400 entries were left for the training set, while 600 were used in the validation set. The function parameter random_state was set to 0 for all of the single variable analyses to ensure consistent results even when re-running the program.

The first feature to be tested was the categorical predictor Industry. There are 147 different possible classes a data entry can have for Industry. According to the results from training a model on Industry, almost every class was statistically insignificant in predicting Market Cap. When using the trained model with the validation set, Industry was not able to accurately predict Market Cap. In addition, a categorical variable with 147 possible categories would add significant complexity to a multi-variable model, so Industry was dropped from the dataset.

For every other feature, this process was repeated. A model was trained to predict Market Cap using that predictor, and then the model was tested on the validation set. The feature Outstanding Shares proved to be statistically significant, with a p-value of 0, and able to slightly predict Market Cap, with a testing R-squared of 0.35. The features Relative Volume and Beta had high p-values and testing R-squared close to 0. Performance by Year and Earnings both had p-values of 0, meaning they are statistically significant for predicting Market Cap. The variables Performance by Week, Performance by Month, and Performance by Quarter all had high p-values and low R-squared values. The final predictor, Performance by Half Year, had a low nonzero p-value and a low R-squared value. It may be statistically significant, but its ability to predict Market Cap may be limited.

Multi-variable Analysis

After analyzing each feature individually, a model was created that used all of the features to predict Market Cap. When tested, this model produced a testing R-squared value of 0.624 and a training R-squared value of 0.680. According to the results from training the model, Performance by Week, Performance by Month, Performance by Quarter, Relative Volume, and Beta were all statistically insignificant features. These five predictors were removed to simplify the model, leaving just four features. The cost in accuracy of removing these predictors was minuscule, with a 0.001 drop in both the training and testing R-squared values.

To continue improving this model, nonlinear and interaction terms were tested. Adding nonlinear terms to the model greatly decreased the model’s testing accuracy. The greatest decrease in accuracy was seen when a nonlinear Earnings term was added. Nonlinear terms were tested on all of the model’s features. Plots like the one below were used to look for a nonlinear trend in the data.

Interaction terms benefitted the model and improved its accuracy. An interaction term between Performance by Year and Earnings improved the model’s testing accuracy and had a p-value of zero. The interaction term between Performance by Half Year and Earnings also improved the testing accuracy while having a p-value of zero. Adding an interaction term between Earnings and Outstanding Shares did not improve the accuracy of the model’s predictions, and the term had a high p-value, meaning it is statistically insignificant.

So far, all of the training and testing have been done with the same training and validation sets. The random_state parameter of the train_test_split() function controls these specific training and validation sets. To ensure that the model sustains its accuracy with a different validation set, a for-loop was added that would iterate through multiple values for random_state. This would create different training and validation sets. The model would then be trained on the training set and tested on the validation set. The testing R-squared value was recorded for each random_state value. The mean of these R-squared values was calculated to give a more accurate idea of the model’s accuracy. For 200 different values of random_state, the average testing R-squared value was 0.629.

Standardization and Outliers

Some datasets have outlier points. These points can be outliers in the output or in the inputs. An outlier in output means that for its given inputs, a point has an output that differs greatly from what is expected. An outlier in input means that compared to the rest of the dataset, a point’s inputs are very different. Outliers have very high influence (also known as leverage) when training a model, so their presence can significantly affect the model’s estimated parameters, occasionally causing extra error. The plots below were used to check for these outliers visually. It is apparent that there are quite a few points with extremely high leverage and others with Market Cap values very far from the average (studentized residual values far from zero).

These outlier points may be affecting the model, so a new model was created that was trained off of data without these outlier points. Points with leverage greater than 0.05 and points with studentized residuals above five or below negative five were removed from the training set for this model. This model was tested using the same for-loop to iterate through different training sets and validation sets. For the same 200 different values of random_state, the model’s performance decreased. The average R-squared value fell from 0.629 (for the model trained with outliers) to 0.611 (for the model trained without outliers).

Standardization is another method that can improve model accuracy by equalizing all features. When a dataset is standardized, each data value is put in terms of its predictor’s mean and standard deviation. Since all of the data in the dataset has a similar value once standardized, every predictor is weighted equally. One predictor cannot dominate by having naturally higher values.

Another model was created and tested with the data standardized. It was tested with the same for-loop and 200 different values of random_state. This model produced an average R-squared value of 0.629, the exact same as the initial model. A fourth model was created with standardized data and with outliers removed. This model was tested using the same process and produced an average testing R-squared value of 0.611. This is the same as the model trained without outliers and without standardized data, and it is still lower than both models trained with outliers. The four plots below were created to visualize the R-squared values produced by all four models.

Both models fitted with outliers have the same higher average R-squared of 0.629. Both models trained with the outliers removed had an average R-squared value of 0.611, worse than those trained with outliers. The plots demonstrate the higher performance of models trained with outliers included.

Four plots corresponding to the four models that were created. Orange represents testing R-squared, and blue represents training R-squared. Note: the y-axis scale is different between plots.

Conclusion

This project has found that Earnings, Performance by Year, Performance by Half Year, and Number of Outstanding Shares can be used to model a company’s Market Cap. It was also found that, in combination, Earnings and Performance by Year produce an additional positive increase in Market Cap. This was modeled through an interaction term between the two. Earnings and Performance by Half Year together create a negative increase in Market Cap. This was also modeled through an interaction term between both features. This project found no accuracy benefit to adding a nonlinear term to the model. Nonlinear terms were tested for every predictor, and there was decreased accuracy every time. Finally, it was found that the model does not benefit from being trained on standardized data nor from being trained on data with outliers removed.

Outstanding Shares was an unexpected feature. This project’s hypothesis was proved correct by Outstanding Share’s presence in the most accurate model. Theoretically, Market Cap is only based on Earnings and project growth, but in practice, it can be predicted using other features.

This research was limited by a lack of usable data points. Less than one-third of the observations included in the initial dataset were suitable for the analysis. Another limitation is a lack of data from different time frames. All data was collected on December 11th, 2023, so some of this project’s conclusions may not apply to other time periods. Many patterns found in the Stock Market may be too subtle or complex to be captured by linear regression. As this project only used linear regression, its ability to detect patterns or trends was limited.

Future research would benefit from studying a wider range of data metrics and factors in predicting company valuation. A study on data across multiple years would also expand on this project and determine if the conclusions from this research can be applied in the future.

References

Arrieta, ibarra, I., & Lobato, I. N. (2015). Testing for Predictability in Financial Returns Using Statistical Learning Procedures. Journal of Time Series Analysis, 36(5), 672–686. https://doi.org/10.1111/jtsa.12120

Ganguli, S. K. (2011). Accounting Earning, Book Value and Cash Flow in Equity Valuation: An Empirical Study on CNX NIFTY Companies. IUP Journal of Accounting Research & Audit Practices, 10(3), 68–77.

Khan AH, Shah A, Ali A, Shahid R, Zahid ZU, Sharif MU, et al. (2023) A performance comparison of machine learning models for stock market prediction with novel investment strategy. PLoS ONE 18(9): e0286362. https://doi.org/10.1371/journal.pone.0286362

Larcher, Jeremy. (2023). US Stock Metrics & Performance [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/7187831

Lin, C. Y., & Lobo Marques, J. A. (2024). Stock market prediction using Artificial Intelligence: A systematic review of Systematic Reviews. Social Sciences & Humanities Open, 9. https://doi.org/10.1016/j.ssaho.2024.100864

Sawale, G. J., & Rawat, M. K. (2022). Stock Market Forecasting Using Metaheuristic LSTM Approach with Sentiment Analysis. Special Education, 2(43), 1800–1806.

Sohdi, L. R. (2024). The Influence of Growth Rate, Profitability, Liquidity, and Company Valuation on Stock Price. Jurnal Riset Akuntansi Dan Bisnis Airlangga (JRABA), 9(1), 1–23. https://doi.org/10.20473/jraba.v9i1.56477

The Finance Storyteller. (2018, November 14). Market Capitalization explained. YouTube. https://www.youtube.com/watch?v=k-Rp32j0uj8

Tiwari, S., & Chaturvedi, A. K. (2021). A Survey on LSTM-based Stock Market Prediction. Ilkogretim Online, 20(5), 1671–1677. https://doi.org/10.17051/ilkonline.2021.05.182


About the author

Andrew Jain

Andrew is a Junior in high school with strong interests in math and science. At school, he competes on the math team as well as the cross country and tennis teams. His work in statistics has been greatly assisted by a mentorship under Dr. Mohammadreza Kalan, a postdoctoral researcher at Columbia University. With the dream of using his knowledge to help the world, Andrew plans to pursue an undergraduate degree in Engineering.