1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks

      Preprint
      , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Classification tasks are usually analysed and improved through new model architectures or hyperparameter optimisation but the underlying properties of datasets are discovered on an ad-hoc basis as errors occur. However, understanding the properties of the data is crucial in perfecting models. In this paper we analyse exactly which characteristics of a dataset best determine how difficult that dataset is for the task of text classification. We then propose an intuitive measure of difficulty for text classification datasets which is simple and fast to calculate. We show that this measure generalises to unseen data by comparing it to state-of-the-art datasets and results. This measure can be used to analyse the precise source of errors in a dataset and allows fast estimation of how difficult a dataset is to learn. We searched for this measure by training 12 classical and neural network based models on 78 real-world datasets, then use a genetic algorithm to discover the best measure of difficulty. Our difficulty-calculating code ( https://github.com/Wluper/edm ) and datasets ( http://data.wluper.com ) are publicly available.

          Related collections

          Most cited references16

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          Curriculum learning

            Bookmark
            • Record: found
            • Abstract: not found
            • Book Chapter: not found

            Pearson Correlation Coefficient

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              SMOTE: Synthetic Minority Over-sampling Technique

              An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
                Bookmark

                Author and article information

                Journal
                05 November 2018
                Article
                1811.01910
                b07be42b-9532-410d-b1e8-06b805d64ade

                http://creativecommons.org/licenses/by/4.0/

                History
                Custom metadata
                ACL, CoNLL(K18-1037), 22, 380--391, (2018)
                27 pages, 6 tables, 3 figures (submitted for publication in June 2018), CoNLL 2018
                cs.CL cs.AI cs.LG cs.NE

                Theoretical computer science,Neural & Evolutionary computing,Artificial intelligence

                Comments

                Comment on this article