Yukako Toko (ytoko@nstac.go.jp)
National Statistics Center, Japan
Shinya Iijima (siijima@nstac.go.jp)
National Statistics Center, Japan
Mika Sato-Ilic (msato@nstac.go.jp)
National Statistics Center, Japan / University of Tsukuba
Abstract
We developed the supervised multiclass classifier for autocoding in our previous study. The classifier assigns corresponding classification codes based on reliability scores. The purpose of this study is the generalization of the reliability score for more accurate classification.
The previously defined reliability score considering both the uncertainty from data (probability measure) and the uncertainty from the latent classification structure in data (fuzzy measure) gives our method a better accuracy of the result. However, this reliability score still has problems to be addressed. The first problem is that the reliability score has not been generalized. When we consider applying our classifier in a practical situation, we must consider variability of the observed data. Therefore, there is a necessity to generalize the reliability score. The second problem is that the reliability score does not consider the frequency for each object (or feature) which means a sum of frequencies over the codes for each object (or feature) in the training dataset. This problem could cause infrequent objects (or features) to have significant influence leading to the classifier sometimes incorrectly classifying data.
To overcome these problems, we propose a generalized reliability score by using the idea of the T-norm in statistical metric space and considering the frequency of each object (or feature) over codes in data. In addition, we investigate the robustness of the proposed classifier based on the generalized reliability score to show the guarantee of the classification accuracy based on the classifier.
The proposed algorithm is implemented in R to improve its versatility.
Keywords: Coding, Machine learning, Overlapping classification, Generalization of reliability score
JEL Classification: C38