Banking supervision with the use of innovative statistical techniques

Proactively monitoring and assessing the economic health of financial system has always been the cornerstone of supervisory authorities for supporting informed and timely decision making. Bank of Greece as the competent supervisory authority for the Greek banking system evaluates both the riskiness of banks on an individual level and the health of the financial system in total, from a macro prudential perspective. In accomplishing those targets, the Bank of Greece could make use –inter alia-of various statistical methods along with expert judgment. In this work, we employ a series of innovative modeling techniques in the prediction of individual bank insolvencies and generalized financial crises. Our empirical results indicate that innovative statistical techniques, i.e. Deep Learning and Machine Learning methodologies, have superior out of sample and out of time predictive performance in comparison to traditionally employed methods in finance, such as Logistic Regression, Classification Tress, and Linear Discriminant Analysis. In essence, we build an Early Warning System for bank insolvencies and another one for stock market crises, which could complement the assessments performed by micro-prudential and macroprudential authorities. In short, the holistic monitoring of the resilience of the financial system would steer decision making, via triggering the imposition of any necessary targeted corrective actions, leading vulnerable institutions back to viable business performance and the financial system back to balanced operation.


Introduction -Banking Supervision and Innovation
The recent global financial crisis disrupted significantly the economic growth and had severe socioeconomic and fiscal effects in most parts of the world. In several countries the sovereign had to step in and provide support in order to avoid the full collapse of the banking system. In Greece, the global financial crisis unveiled the large economic imbalances that were built up after euro accession, which ultimately triggered an unprecedented sovereign and banking crisis. More specifically, the Greek economy suffered from the "twin" deficits, namely, the fiscal and the current account deficits, which were the result of a strong fiscal expansion financed mostly by external borrowing. At the same time, private indebtedness had increased on the back of a sizeable domestic credit growth. As the cross-border flows dropped dramatically after the eruption of the global financial crisis in 2008, Greece remained exposed to these economic imbalances and avoid complete collapse only due to the economic support receive by the IMF and its EU partners. The crisis also weighed negatively on the Greek banking system that lost access to the capital and liquidity markets and had to resort to the Emergency Liquidity Assistance of the Central Bank to address the massive deposits outflows. The asset quality deteriorated significantly due to worsened balance sheet of both corporate and households. The elevated loan loss provisions, the impact from the participation in the sovereign debt restructuring, and the subdued profitability resulted in three rounds of recapitalizations for the four core banks and the resolution of several non-systemic players.
The supervisory response to the global financial crisis was immense. The Basel Committee on Banking Supervision updated the rulebook incorporating all lessons learned from the crisis, in the so-called Basel III accord that was transposed in the EU framework via the CRR/CRD IV. The supervisory requirements increased significantly both in quantitative and qualitative terms. The macro prudential perspective in supervision gained momentum, whereas several requirements regarding the recovery and resolution of financial institutions were introduced. The effort made by the supervisory authorities and the official sector targeted to the buildup of a resilience financial sector that will be capable to absorb the impact from a future crisis. However, future financial crisis cannot be precluded, whereas we can never be sure about the shape and length of an imminent crisis. Therefore, considering the importance of early warning systems in order to mitigate or even preempt a financial crisis, we used innovative statistical techniques to build up tools for predicting crisis both at the level of individual banks and at the whole financial system level.
The working paper is structured as follows: In Chapter 2 an overview of the innovative methodologies employed by Bank of Greece is presented whereas in Chapter 3 the relevant evaluation measures are shown. The respective applications of the innovative methodologies are described analytically in Chapter 4 (Bank insolvencies prediction) and Chapter 5 (Stock Market Crisis prediction) whereas some conclusions and regulatory implications are discussed in Chapter 6.

Innovative Methodologies
Random Forests (RF) is a popular method for modeling classification problems. Since its inception (Breiman 2000) RFs has gained significant ground and is frequently used in many machine learning applications across various fields of the academic community. To build the considered Random Forests, we employed the "randomForest" package in R. The basic philosophy of Random forests is based on combining three concepts: i) classification or regression decision trees; ii) bootstrap aggregation or bagging; and iii) random subspaces. It adopts a divide-and-conquer approach to capture non-linearities in the data and perform pattern recognition. Its core principle is that a group of "weak learners" combined, can form a "strong predictor" model. Support Vector Machines (henceforth SVMs) are a family of non-linear, large-margin binary classifiers. SVMs estimate a separating hyperplane that achieves maximum separability between the data of the two modeled classes (Vapnik, 1998). The main drawbacks of SVMs stem from the fact that they constitute black-box models, thus limiting their potential of offering deeper intuition and visualization of the obtained results and inference procedure.
Neural Networks constitute a well-known machine learning technique that is broadly used in credit rating classification problems. Classification problems are characterized by the availability of big datasets, many explanatory variables, and the possibility of noise existence in the data. Experimental results offer evidence that neural networks are able to capture complex non-linear patterns in the analyzed data. As such, it is no coincidence that the current literature offers numerous structural variations of Neural Networks depending on the number of layers, the flow of information and the algorithms used to train them.
XGBoost (eXtreme Gradient Boosting) is an advanced implementation of gradient boosting algorithm, offering increased efficiency, accuracy and scalability over RFs and NNs. It supports fitting various kinds of objective functions, including regression, classification and ranking. XGBoost offers increased flexibility, since optimization is performed on an extended set of hyperparameters, while it fully supports online training, without the danger of catastrophic forgetting.
Deep learning has been an active field of research in the recent years, as it has achieved significant breakthroughs in the fields of computer vision and language understanding. However, their application in the field of finance is rather limited. Our approach consists in building a multi-layer perceptron using the MXNET package of R. We postulated modern deep models that are up to five hidden layers deep and comprise various numbers of neurons. Model selection using crossvalidation was performed by maximizing the area under the curve metric.

Evaluation Measures
Classification accuracy is the main criterion to assess the efficacy of each method and to select the most robust one. In this section, we present a series of metrics that are broadly used by the Bank of Greece for quantitatively estimating the discriminatory power of each scoring model. In evaluating the classification accuracy, we focus on the following measures • G-mean: The geometric mean G-mean is the product of sensitivity and specificity. This metric indicates the balance between classification performances on the majority and minority class. A poor performance in prediction of the positive cases will lead to a low G-mean value, even if the negative cases are correctly classified from the algorithm. • LR-: The negative likelihood ratio is the ratio between the probability of predicting a case as negative when it is actually positive, and the probability to predict a case as negative when it is truly negative. A lower negative likelihood ratio means better performance on the negative cases, which is the main point of interest in this study as we model bank failures. • DP: Discriminant power is a measure that summarizes sensitivity and specificity. For DP values higher than 3 then the algorithm distinguishes well between positive and negative cases.
• BA: The balanced accuracy is the average of Sensitivity and Specificity. If the classifier performs equally well on either class, this term reduces to the conventional accuracy measure. In contrast, if the conventional accuracy is high merely because the classifier takes advantage of good prediction on the majority class, the balanced accuracy will drop thus signaling any performance issues. That is, BA doesn't disregard the accuracy of the model in the minority class. • Youden's γ: Youden's index is a linear transformation of the mean sensitivity and specificity therefore it is difficult to interpret. As a general rule, a higher value of Youden's γ indicates better ability of the algorithm to avoid misclassifying banks. • AUC: The area under the ROC curve (Area Under Curve, AUC) is a summary indicator of the performance of a classifier into a single metric. The AUC can be estimated through various techniques, the most commonly used being the trapezoidal method. The AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. In practice, the value of AUC varies between 0.5 and 1 with a value above 0.8 to denote a very good performance of the algorithm.

Predicting bank insolvencies using machine learning techniques
Supervisory authorities are mandated to protect depositors' interests, via ensuring that financial institutions are able to survive under business as usual conditions and are capable to absorb adverse market shocks. Hence, the comprehensive assessment of the current financial conditions of a bank as well as the evaluation of its future sustainability is the cornerstone of proactive banking supervision. To distinguish between strong and weak banks, supervisory authorities make use of early warning expert systems or/and statistical modeling techniques.
-813 - The outcome of this analysis can drive the imposition of targeted regulatory measures. These measures can take the form of preemptive corrective actions addressing vulnerabilities of weaker banks and as a result can increase their resilience on a going concern basis. On the other hand, in specific cases of likely to fail banks, whose return to viability is considered rather improbable, it will provide the necessary evidence to the supervisory in order to take actions from a gone concern perspective. Essentially, supervisory actions serve in retaining depositors' confidence to the financial system by ensuring soundness of individual banks or resolution of failing banks in an orderly manner, should this be necessary in order to avoid any domino effect that can even trigger a systemic financial crisis.
In the last decades various statistical methodologies have been exploited to aggregate bank specific information into a single figure in order to distinguish between solvent and insolvent financial institutions. These classification methods range from simple Discriminant analysis (Altman 1968 andCox 2014) and Logit/Probit regressions (Ohlson 1980, Cole andWu 2014), to advanced machine learning techniques, conditional inference trees and Neural Networks (Messai & Gallali 2015). At the same time, other novel modeling approaches such as Random Forests (RF) (Breiman 2000) have not been employed up to now in the problem of assessing bank failures, regardless of these models being really popular for modeling classification problems in recent years.
In this work we employ a series of innovative techniques in predicting bank insolvencies, such as Random Forests, Support Vector Machines, Neural Networks and Random Forests of Conditional Inference Trees whereas we benchmark their results based on widely employed techniques such as Logistic Regression and Linear Discriminant analysis.

Literature Review
There is an extensive literature on the various methods and analyses performed, regarding the prediction of bank default. Messai and Gallali (2015) by applying discriminant analysis, logistic regression and artificial intelligence methods along with Cole and Wu (2014) who focused on timevarying hazard models and probit models, supported the view that CAMELS risk ratios are the most relevant and significant factors in predicting a bank default. The former pointed also that the neural network method performed better compared to the other models.
Cole and White (2010) examined the defaults of US commercial banks that occurred in 2009 by examining supervisory indicators as well as additional portfolio variables, such as real-estate loans and mortgages, which proved to be important as early warning indicators. Cox and Wang (2014) also focused on supervisory indicators, while they also incorporated risk factors that were overlooked by the literature prior the US financial crisis in 2007-2009. Mayes and Stremmel (2014) incorporated supervisory indicators and macroeconomic variables in the framework of Logistic Regression and discrete survival time analysis methods. Betz et al. (2013) combined supervisory indicators with country-level data in order to improve the performance of the model in terms of Type I error and out-of-sample validation over different forecast horizons. Poghosyan and Čihák (2009) used supervisory indicators together with other factors related to depositor discipline, contagion effect among banks, macroeconomic environment, banking market concentration and the financial market. The results show that indicators related to capitalization, asset quality and profitability can effectively identify weak banks.

Data Collection and Variable Selection
We have collected information on non-failed entities, failed entities, and entities that received state assistance, from the database of the Federal Deposit Insurance Corporation (FDIC), an independent agency created by the US Congress in order to maintain the stability and the public confidence in the financial system. The collected information is related to all US banks, while the adopted definition of a default event in this dataset includes all bank failures and assistance transactions of all FDIC-insured institutions. Under the proposed framework, each entity is categorized either as solvent or as insolvent based on the indicators provided by FDIC.
The dataset covers the 2008-2014 period; a 7 years' period with quarterly information resulting in dataset with more than 175,000 records. The selected time period, seems to approximate a full economic cycle, in terms of the Default Rate evolution. Figure 1, shows the number of records included in each observation quarter and the corresponding default rate. The default rates significantly increased in the first half of sample, compared to the second half. Specifically, the Default Rates follow an increasing trend in the 2008-2009 period, where they peak at 2.5% in the third quarter of 2009. Thereafter, they follow a decreasing trend. The default rates seem to have flattened out in 2013, further decreasing during 2014, reaching 0.1% in the fourth quarter of 2014. The dataset was split into three parts (Figure 2). Our original development sample contains 101.641 observations that can be divided into 100.068 solvent and 1573 insolvent cases, and we call it "Full in-sample". The overbalanced nature of our dataset, which presents a preponderance of solvent banks (i.e. good cases), does not facilitate the training of complex techniques. To this end, we created a new training sample (called "Short in-sample"), including randomly chosen 10% of the good cases and all the bad cases. So, the final training sample used to develop our models contains 10.001 good cases and 1.572 bad cases, reaching 11.573 observations in total. For the purpose of fine tuning the parameters of the random forests and neural networks specifications, we further equally divide the short in-sample dataset into training and validation sub-samples (50% each). In short, the term "Short in-sample" refers to the more balanced dataset, while the term "Full insample" refers to the sample that includes all the good cases. As already mentioned, the "Out-ofsample" dataset refers to the 20% randomly selected observations covering the years 2008-2012. Finally, the "Out-of-time sample" refers to the data for the years 2013-2014.

Results
In terms of performance accuracy, we focus on the out of sample and out of time accuracy of the employed specifications. When examining the out-of-sample (Table 1) performance, RFs are the best across almost all performance measures, while logistic regression seems also to be an adequate tool for assessing bank failure probability as it is ranked second. Regarding out-of-time performance, presented in Table 1, Random Forests and Neural Networks provide again the best fit, with the former method exhibiting marginally better performance in 5 criteria and better performance in 1 criterion relative to the latter. Logistic regression performs poorly in the out-oftime period, as it shows the worst performance in 6 out of 8 criteria. Summarizing it is evident that the proposed RF rating system exhibits higher discriminatory power compared to the considered benchmark models when taking into account the skewness of the data. More importantly, the performance of RF is more stable and more consistent across all test samples, resulting in lower performance variability.

An innovative forecasting framework for stock market crisis events
The analysis of interdependence and contagion in financial markets presents a challenging analysis topic for supervisory authorities. This is especially true in times of financial turmoil, as investors and policy makers have strong interests in knowing whether and how the crisis propagates between markets and countries. Our approach comprises a solid forecasting mechanism concerning the probability of a stock market crash event in various time frames. The developed approach combines different machine learning algorithms in modelling data from 39 countries that cover a large spectrum of economies. More precisely, we leverage the merits of a series of techniques including Classification Trees, Support Vector Machines, Random Forests, Neural Networks, Extreme Gradient Boosting, and Deep Neural Networks.

Literature Review
The use of Machine Learning Techniques in the development of early warning systems for financial crisis is rather limited in the existing literature. Cuneyt et al (2014) developed three different early warning systems, based on artificial neural networks (ANN), decision trees, and logistic regression, and tested them on the Turkish economy; artificial neural networks yielded the best performance in their analyses. Atsalakis et al. (2016) focused on 1-day stock market forecasting, specifically during stressed periods, and employed a neuro-fuzzy modeling methodology. Oztekin et al. (2016) also focused on prediction of daily stock price. Their work deployed and integrated adaptive neuro-fuzzy inference systems, artificial neural networks, and support vector machines. Dopke et al. (2017) implemented boosted regression Trees for predicting recessions. Finally, Dabrowski et al. (2016) investigated dynamic Bayesian network models and showed that they can provide significantly more precise early-warnings compared to logistic regression.

Data collection and processing
A crisis "event" for each country was identified when the daily return of the Stock Index was below the first percentile of the empirical distribution of returns. The initial empirical distribution of returns was calculated based on the stock index returns of the first 200 observations, covering the period 10/01/1996 -15/10/1996. For each subsequent record, the empirical distribution of returns was re-calculated in order to incorporate the new observation, and an event was identified if the return was below the first percentile of the new empirical distribution. Thus, for the latest observation in the sample (i.e. 15/12/2017), the empirical distribution of returns was based on the 10/01/1996 -14/12/2017 period.  Figure 4 illustrates the number of countries exhibiting an extreme stock market fall during the selected 22 years period. We observe at least 10 events with global impact, with the most severe being the global economic crisis of 2008. The events are identified based on the number of stock market exceedances (less than 1% percentile of the empirical distribution) across the 39 countries in the sample (co-exceedances). More recently, i.e. from 2011 to 2017, it seems that the global market is more stable as less variability is observed, even though some events are also present.
Having identified the "events" occurring in each of the 39 countries in our sample on the basis of the daily movement of the corresponding stock indices, we proceeded to event aggregation at a region level, i.e. America, Asia, Europe, and Global. The essence of these binary variables is to capture the "significant events" within a region, i.e. events that had a collective impact on many stock markets in the region. The selected thresholds were determined in relation to the number of stock markets inof each region. Specifically, we postulated the following thresholds: • America: At least 3 country specific events per day.
• Asia: At least 6 country specific events per day.
• Europe: At least country specific 8 events per day.
• Global: At least 2 regions are in crisis mode on a daily basis. Based on the above outcomes, we created two classes of predictive binary variables. The first one measures whether there is a significant event in the next working day (Glob 1), both on regional basis (America, Asia, Europe) and globally; the second one measures whether there is a significant event during the next 20 working days (Glob 20). To fit the developed machine learning models, we use the binary variables for a global crash as our dependent variables, whereas the created binary target variables pertaining to each region (America, Asia and Europe) serve as independent lagged predictors.
The explanatory covariates employed in the study encompass information from stock, bond, currency, and commodity markets, along withcredit spreads and volatility indices. The covariates and their relative transformations are shown analytically in Part II of the Appendix. This variable generation process led to a set of almost 2700 potential predictors to be tested in the developed machine learning models.

Variable selection
The initially constructed dataset comprises an enormous number of independent variables, which is clearly disproportional to size of the dataset as we are dealing with around 2700 variables over around 5400 days. Fitting a machine learning model to such a huge number of independent variables (relative to the size of the dataset) is doomed to suffer from the so-called curse of dimensionality problem. That is, the fitted classifier may seem to yield very good performance in the training dataset, but it turns out to generalize very poorly, yielding a catastrophically low performance outcome in the test data. Thus, to ensure a good performance outcome for our model, we need to implement a robust independent variable (feature) selection stage, so as to limit the number of used features to the absolutely necessary. Besides, apart from increasing the generalization capabilities of the fitted models, such a reduction is also important for increasing the computational efficiency of the explored machine learning algorithms. Figure 5 provides an overview of the adopted feature selection procedure. It comprises three phases: In the first phase, we employ three methodologies that independently assign importance to the available features: Boruta, LASSO, and a qualitative criteria-driven filter method. In the second phase, a balanced score is produced for each variable. In the third phase, we impose a heuristically determined cut-off score, and discard all features that do not reach this score. This way, a total of 131 explanatory variables are eventually selected to be retained. The Boruta algorithm is based on a postulated Random Forest model. Based on the inferences of this Random Forest, features are removed from the training set, and model training is performed. Boruta infers the importance of each independent variable (feature) in the obtained predictive outcomes by creating shadow features. On the other hand, LASSO is a regression model that penalizes the number of model parameters in its objective function as a means of excluding irrelevant variables from the model. One of the most important features of LASSO is its ability to cope with high numbers of independent variables (features) relative to the available training observations, which is pertinent in the context of our study. We performed LASSO analysis by using the GLMNET package in R, which offers a very fast way to select best model using both cross-validation and the Bayesian Information Criterion (BIC).
Finally, the employed qualitative criteria-driven method consists of evaluating the individual correlation of each feature against the dependent variables. The rankings of variables produced by each of the above methods are combined by applying weighted average scoring. Specifically, each selection method assigns each variable 1 or 0, depending on whether they are selected by the corresponding method or not. Then, each score is multiplied by the weight assigned to each selection method. In our study, Boruta outputs are assigned a slightly higher weight due to its extensive analysis of the features in the dataset. Eventually, the final score obtained for each explanatory variable ranges in the interval from 0 to 4. To obtain the final selection, a cut-off score of 3 is applied, yielding a narrowed group of 131 candidate variables.

Results
We evaluate the predictive performance of the developed methods in the dataset covering the years 2011-2017; in the following, we refer to this part of the dataset as the "Out-of-time" sample.
As we observe in Table 3, in both horizons (1 day and 20 days) the MXNET algorithm provides the best empirical performance. This is followed by the XGBoost methodology in the case of the 20day horizon, and the Neural Network in the case of the 1-day horizon. Hence, MXNET deep neural networks offer significantly superior predictive accuracy both in the 1-day and 20-day forecasting setup on the test sample. Another remark is that, by moving from simple neural networks to deep networks, we are able to infer richer and subtler dynamics from the data, thus increasing our capacity in modeling nonlinearities and cross-correlations among financial market variables.
Summarizing the results across all metrics in the test sample, it is evident that the MXNET system exhibits higher discriminatory power compared to all the considered benchmark models when taking into account the skewness of the data. At this point, it is important to stress that a nonanticipated crash in the global stock markets may come at a much higher cost for the economy compared to generating to false-alarm. Hence, it is crucial for supervisory purposes to achieve the maximum possible accuracy in predicting imminent crises via a developed Early Warning System for economic and financial crisis. Table 2: Validation Measures -Dependent Variables concern a stock crisis occurring on 20-days horizon (Glob 20) and 1-day horizon (Glob 1). AUROC (Area Under the Curve), G-mean (Geometric Mean), LP (positive likelihood ratio), LR (negative likelihood ratio), DP (Discriminant power), Youden's index, BA (Balanced Accuracy)

Glob20
Logit Further, we present in Figures 6 and 7 the ROC curves corresponding to the models analyzed. The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the modeling approach. The corresponding ROC curve of deep neural networks is higher over all the considered competitors regarding both the explored dependent variables (pertaining to the one-day and 20-day horizons). Hence, we obtain yet another strong evidence supporting the high degree of efficacy and generalization capacity of the proposed deep learning system.

Conclusions -Regulatory Implications
Our empirical results indicate that innovative statistical techniques, i.e. Deep Learning and Machine Learning methodologies, have significant predictive power. The use of these models can be used by the micro-prudential supervisors as a complement to their existing tools, notably to the Supervisory Review and Evaluation Process (SREP). Macro-prudential supervisors could also benefit by the use of these models predicting stock market crisis but also taking into consideration that systemic crisis may be created by the collapse of individual banks.