Is your business going bankrupt?

Problem Introduction

Going bankrupt is never something that many want for a business. Trying to gain insight from businesses that have gone bankrupt is what this analysis is all about. Trying to help other businesses with avoiding the same trends noticed from a dataset found from Kaggle is the main goal of this article.

This dataset highlights data taken from the Taiwan Economic Journal from 1999 to 2009, and was gathered by a Chih-Fong Tsai and Deron Liang from University in Taiwan. The dataset was published by Kaggle user fedesoriano. According to the dataset’s tasks, the main goal is to maximize the F1-Score, but more about that in the metrics section of this blog.

The main problem to solve is to identify companies that are in the process of going bankrupt or actively going bankrupt. This problem can be extended to helping find the factors that contribute to a company going bankrupt. This is done by doing a few iterations on the dataset and adjusting some features from normal space to log normal space because of the range of values in the distribution of the data.

There are five models that use the dataset through each iteration, and their results are tracked. The models used are Logistic Regression, Decision Tree, Random Forest, XGBoost, and Support Vector Machine (SVM). After each iteration the F1 score for each model was compared and analyzed alone with the features with the highest scored weight.

The metric that we want to maximize is F1 score. This metric is established in the problem definition but also clear based on the distribution of the classifications targeted feature as noted in Figure 1.

The formula for F1 Score is:

F1_Score = 2 * ((precision * recall) / (precision + recall))

where Precision and Recall are defined as:

Precision = true positives /(true positives + false positives)

Recall = true positives / (true positives + false negatives)

Accuracy doesn’t make much sense in this case because if the model classified all values as not bankrupt, then it would receive a 97% accuracy. Thus, the recall or F1 score metric makes more sense because the goal is to allow people to be aware that they might be at risk for losing their business even if they are not. The analysis will be going with the initial ask of analyzing F1 Score, because we do want to have our final answers have precision as well.

Figure 1: Distribution for the Bankrupt feature

Data Exploration/ Data Visualizations

The dataset contains 95 features that are all of type integer or double. This makes changing and analyzing the data fairly straight forward. As far as changing the dataset for modeling. This isn’t an issue in terms of data type, but there are likely features that don’t add anything to the dataset or have any bearing on the feature . However, analyzing the distributions and adjusting these values to appeal to other model types will be the best way to iterate on this dataset.

Starting off analyzing features that have very little or no variation in the data, and dropping any columns with only one value. One column fit this description and that was Net Income. The next step was to view a heatmap plot of the features, and look at the features that have a high correlation to the “Bankrupt?” feature. The resulting heatmap is shown in Figure 2.

Figure 2: Heatmap for all of the features in the dataset.

This visual shows us a few interesting features in the dataset that have a high correlation to bankruptcy. So having a closer look at those features could product some more insights into our dataset. The box and whisker plots, shown in Figure 3, highlight some interesting insights for these specific features

Figure 3: Box and Whisker plots for each of the interesting features

Interesting to see so many outliers from the box plot for these features. Also how a lot of features are continuous and on the scale from 0 to 1. That is a common aspect of each of these features, and for a lot of analyzed features outside of these features.

The final thing to think about is the distribution for these datasets. As adjusting those distributions to something that would be useful and not weight down or skew the datasets would be useful for various iterations. One of the features that fit this was the feature for Total Asset Growth Rate (Figure 4).

Figure 4: Distribution of the Total Asset Growth Rate feature from the dataset.

This distribution has a plethora of values between 0 and .001, 760 to be exact. The distribution is interesting because it looks like a normal distribution other than this exceedingly high value count between 0 and .001.

The other distribution noticed in this dataset was a logarithmic distribution. Those distributions have a lower value count and with a long tail as the distribution doesn’t typically have a cap on the high end. An example of this would be salaries in a large company and shown in Figure 5. We also see examples of this in this dataset as well in the feature for Cash per Total Assets.

Figure 5: Distribution for the Cash/Total Assets feature form the dataset.

Normal distributions are handled better by various distributions, and for some of the models selected that is one of the assumptions. So adjusting these distributions would be good to investigate in iterations of the modeling phase.

Models/Methodology

There are five models that were used for the main purposes of modeling. Each model was optimized through the GridSearchCV library using various model parameters from SKLearns library. The five models used for classification are:

  1. Linear Regression
  2. Decision Tree
  3. Random Forest
  4. XGBoost
  5. SVM

These models are all traditional classification models that are designed for predicting classifications under trained supervised learning. The linear regression model plots a line through the features for the data to find the features that have the most weight on the feature we are trying to find. SVM views the data in a multi-dimensional plane, and does transforms per iteration based on the classification optimization. These are the individual classification models that are not very much like the others.

The remaining models are tree based classification models. Decision Tree uses features to split the dataset based on various splitting functions, and at the bottom of each branch it provides a classification for that observation. Random forest is the same basic concept as decision tree, but contain “weaker” learners, and vote on the classification based on the majority rule from the “weaker” learners. You can think of them as a large group of decision trees, hence the name random forest. The XGBoost algorithm is a random forest but with even more improvements. Those improvements are centered around parallel processing, tree-pruning, handling missing values, and other regularization methods to avoid overfitting and biases in the data.

These models were selected because of the classification type. Decision trees, random forest, and XGBoosts are all classification models that have historically done very well in Kaggle competitions. The linear regression model has been the standard for introductory models and thus is a quick model to implement and get a proof of concept on. SVM is a more intensive model that I decided to add in just to be inclusive from most of the models that I was familiar with.

Preprocessing/Process Outline:

Each iteration required a little bit of preprocessing and the dataset for the first iteration was used as a base for the last two iterations. There was some generic data processing based on some EDA of the data, because the dataset was very clean and tidy to begin with. The second and third iterations used the default math library to convert the values in that feature to log base 10.

The original dataset had the “Net Income Flag” feature dropped because all of the values were 1, and that feature wouldn’t add anything interesting to the modeling portion of the dataset. Before each iteration, the data was split, where 30% of the original dataset was set aside for testing. This is largely due to the size of the data, but less could have been taken to increase training accuracy. Finally, after the models were trained, the trained models were saved via the library Pickle, and if this exported file existed, it loads that file to save on training time.

The first iteration on the base dataset converted three columns that contained the currency of Taiwan in them (“Yuan”). If the column was changed, the value “log10_” was prepended to the feature name to distinguish them from the original values. The columns that were adjusted in the first iteration were:

  1. Revenue Per Share (Yuan ¥)
  2. Operating Profit Per Share (Yuan ¥)
  3. Per Share Net profit before tax (Yuan ¥)

The second iteration on the base dataset converted the above columns, and another 53 other features to the same log base 10. These features are outlined in the variable name “log_normal_features” found in the constants file in my GitHub repo for this project. These features were determined by analyzing the distributions of those features.

Refinements/Hyperparameter Tuning

Figure 6: Example of a lognormal distribution. Here there is a distribution with a long tail, and a lot of data points compressed to the left or even to the right.

From previous experience with monetary values, a lot of the distributions are lognormal distributions (Figure 6). These distributions are not what a lot of models require (ex: Logistical Distribution) and require for the data to be converted to a lognormal scale to make the distribution linear for the modeling step.

In total there was two iterations, all by adjusting the features that were lognormally distributed, and converting those distributions into normal distributions before modeling. The first iteration consisted of converting only values for the currency in Yuan to normal scale. The second iteration consisted of casting a wider net for those distributions that appeared to be lognormal.

The parameters that are being tuned are as follows for the perspective models are:

  1. Logistic Regression: penalty [l1,l2], and multi_class [auto,ovr]
  2. Decision Tree: criterion[gini,entropy], and splitter [best,random]
  3. Random Forest: n_estimators [100, 120, 140, 160, 180, 200], and criterion [gini, entropy]
  4. SVC: kernel [linear, poly, rbf, sigmoid], and max_iter [500,1000]
  5. XGBoost: n_estimators [100, 200], learning_rate[.1,.15], and loss [deviance,exponential]

Some of these parameters are because they were taking too much time to train the model. SVC had a max iteration parameter set, because if left unchanged, then it was training for over a day before the process was stopped.

Results

For each iteration, the decision tree testing the highest F1 Score as seen in Figure 7. Given the changes to the dataset, the fact that the same model produced the highest level by a fairly wide margin every iteration is quite surprising.

Figure 7: F1 Scores for each model over the three various datasets.

Based on the iterations that the data was on, the best F1 scores were received from the second iteration, and the third iteration produced lower F1 scores than the basic raw data. There was 95 features that were used for prediction of if the company went bankrupt. Features were replaced when the values were converted to lognormal and then these features were used in modeling. With 95 features being present in each iteration, analysis was only completed on the top ten features for each dataset.

Top ten features for the first iteration:

Figure 8: Bar chart showing the top ten features from the first round of modeling.

Top ten features for the second iteration:

Figure 9: Bar chart showing the top ten features from the first round of modeling.

Top ten features for the third iteration:

Figure 10: Bar chart showing the top ten features from the first round of modeling.

How the top ten features changed over the three iterations.

Figure 11: Feature Weights for the best model for each iteration.

The first couple of iterations have Borrowing Dependency a clear feature that has the most weight. Borrowing dependency was changed in the last iteration because it was a parameter that was adjusted for log10 adjustments. This changed the feature weight significantly, but it was still in the top 10 features.

The last iteration had Net Income for StockHolders Equality become the top feature weight, and ROA(B) before interest and depreciation after tax continue to be in the top ten weighted features for the dataset. This feature was also one of two that were features in the top ten for all three set. The other feature that had a weight big and most consistently in the top ten was Non-Industry income and expenses/revenue. These features all had interesting trends throughout the project and each iteration.

Another item of note was the number of high correlation features that were noticed during the EDA portion of the project only had one out of eight features have weight for the top feature for the top performing model in the projects iterations.

This brings attention to the feature Borrowing Dependency. This feature is not explicitly stated for what it is, but I hypothesis that this is the percent chance for the company to receive a loan from a bank or another vendor. This would be an interesting feature to investigate, because it would mean that if a company staying afloat and not going bankrupt would be dependent on if they can receive loans from other companies or investors.

Conclusion/Reflection

Going Bankrupt is not something that we want to happen to any company. However after analysis and modeling on a Taiwan Dataset from various companies it was determined that the Decision Tree Classifier was the most effect for determine if the company was going to go bankrupt. With the biggest features that factor into this is Borrowing dependency, Net Value Growth Rate, and Non-industry income and expenditure/revenue.

This project was a tough project for various reasons. Namely that none of these features had very good definitions for what they are. This makes determining why the results were what they were very difficult, and knowing what more of these features are, would have provided more insight for how to adjust these values to prepare for modeling. The other factor that made this dataset interesting and additionally hard was that there was no NaN values. These values could provide additional insight, and help provide more iterations for the dataset.

Improvements

Reflecting back on this project, I think that I would change how many iterations the SVM model can go through. I would also do some more refining on the dataset and converting more distributions to normal distributions.

If additional time was available for this dataset, I would start iterating and removing features that were not lognormal by viewing and iterating on the datasets results. Then in addition to that, I noted a few other features that didn’t contribute to the dataset and I would have done another iteration without these features in the training set to see if the features changed, and continue to filter the number of features in the dataset.

The full analysis of this data can be found on my GitHub profile found here. Along with some bigger visuals that what could be captured and included in this blog posts.

Data Scientist I at RiskLens. Currently living in Spokane, WA.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store