한국재무학회

기계학습 기반 기업신용정보 분석을 통한 채무불이행 예측

송민찬 NICE디앤비 과장
류두진 성균관대학교 경제학과 교수

본 연구는 신용정보 표본DB 원격분석시스템에서 제공하는 기업신용정보를 분석하여 기업의 채무불이행을 예측한다. 사업자 구분에 따라 분석대상을 나누고 표본DB에서 제공하는 기업신용정보의 활용에 따른 실험을 구성한다. 또한, 다양한 기계학습 기법을 모수 추정 방식에 따라 모수적 방법론, 비모수적 방법론, 준모수적 방법론으로 구분하여 예측성과를 비교한다. 표본DB 원시데이터를 활용한 분석보다 대출 및 연체 종류에 따라 가공한 자료를 활용하는 경우 각 기계학습 모형별 성능개선이 관측되었으나, 기업 차주의 특성정보와 기술신용평가 정보의 활용은 모형별 성능개선에 기여하지 못하였다. 모든 세그먼트에서 준모수적 방법론에 해당하는 심층신경망 모형에 대해 성능이 가장 우수한 것으로 확인되었으며, 트리계열이 아닌 비모수적 방법론의 경우 재현율이 낮게 관측되어 채무불이행 예측 문제에 적합하지 않았다. 기존 실무에서 사용되는 모수적 방법론을 활용한 경우보다 준모수적 방법론을 활용할 경우 분류 성능이 향상됨을 확인하였다. 본 연구는 실제 기업신용정보에 대해 구성된 표본DB를 활용하여 기업부실 예측을 시도한 최초의 연구이며, 기업신용정보를 활용하는 여신 금융기관과 신용정보사의 자료 활용 및 모형 구축에 대한 방향성을 제시한다.

주제어:기계학습,부실위험,여신거래정보,채무불이행 예측,표본DB원격분석시스템

Predicting Loan Delinquency by Analyzing Sample DB with Machine Learning

Minchan Song
Doojin Ryu

This paper investigates the ability to predict corporate default rates using loan-sample data from the Korea Credit Information Service's financial big data open system (CreDB). The corporate loan from financial institution increases financial institution's credit exposure. Because measurement of the impact on the credit risk in the financial institution is used in determining the pricing model and structure of loan products, it is an essential factor for the financial institution that affects its profit structure. In terms of risk management, predicting delinquency using loan data is necessary for 5,000 Korean financial institutions. In several studies, bankruptcy forecasting was conducted on listed companies that disclosed financial and stock price information. However, this study increases the practical utility by extending the analysis target to individual entrepreneurs and small and medium-sized enterprises(SMEs). In addition, this study presents representative big data analysis results by utilizing loan, delinquency, and technology credit information of approximately 1.1 million corporations, which is 20% of almost 5.6 million domestic sole proprietors and non-listed corporations. For loan data, it includes ten monthly loan type codes and eleven overdue reason codes. Prediction targets are separated by individual and corporate entrepreneurs. Also, analyses are divided by use of the processed dataset. For efficient analysis, the data dimension was reduced by changing the table structure through nested iterative operations while expanding the variable composition from a table consisting of N rows to one column. To reflect the characteristics of the data as much as possible, exploratory data analysis and feature-engineering were performed to process the data. Also, classification models are classified by four groups using a parametric method that nine models train for classification. Group 1 consists of Logistic Regression and Linear discriminant analysis based on the parametric method, group 2 consists of several algorithms that calculate the distance for model learning. In addition, group 3 consists of tree-based algorithms, which are also non-parametric methods. Group 4 consists of the semi-parametric method, which is deep neural network. However, out of the total 438,697 corporations, 810 defaulted, accounting for only 0.2% of the forecast, so the target distribution is severely imbalanced. For this reason, before model fitting, under sampling of imbalanced data was performed. The bias of the sampled training and validation data is minimized by performing. K-fold cross validation as much as the level of K=5. Finally, the analysis result suggests a significant effect on classification performance when the processed data is used. However, this study suggests no significant effect on performance when loan owner's characteristics are included. Moreover, tech-credit rating (TCB) information gives any meaningful effect regarding the type of corporation. Also, classification with Deep Neural Network (DNN), which is based on the Semi-parametric method, makes the best performance of binary classification. Non-parametric and Non-tree based models are not appropriate methods for analyzing loan data. In the case of the DNN based on the semi-parametric methodology, the highest classification performance was confirmed for all analyses and entrepreneurs' classifications performed in this study. The neural network used in this study consists of 14 hidden layers. According to the neural network baseline design, the sigmoid function was applied to the activation function's initial value, the relu function was applied to the hidden layer, and optimization was performed through the Adam optimizer. In particular, the analysis of credit transaction information based on credit information of all financial institutions in Korea was conducted, and there is a possibility for alleviating information asymmetry of individual credit institutions regarding risk management targets. In addition, in the case of parametric methodologies used in classical studies and most used in practice, the average classification performance for major segments was inferior to that of semi-parametric methodologies. Furthermore, the difference between these performances is up to 16 percent. This paper suggests the direction of using loan-sample data. It is foundational research for financial institutions that are using loan data for credit risk management. It is necessary to expand research focusing on semi-parametric methodologies about corporate credit information analysis.

Keywords:Corporate Loan Data,Distress Risk,Machine Learning,Predicting Loan Delinquency,Sample DB Remote Analysis System

학회소개

인사말

연혁

임원진

학회정관

학회심볼

오시는길

학회소식

공지사항

사업계획

외부기관 소식

회원안내

회원가입 안내

회원가입

회원정보 수정

증명서 발급

회원동정

개인정보처리방침

재무연구

학술자료 검색

재무연구소개

편집위원회

편집위원회 운영내규

원고작성요령

윤리규정

온라인논문투고

자료실

재무포럼

기타자료실

관련사이트

관련사이트

기계학습 기반 기업신용정보 분석을 통한 채무불이행 예측

Predicting Loan Delinquency by Analyzing Sample DB with Machine Learning