Inviting an author to review:
Find an author and click ‘Invite to review selected article’ near their name.
Search for authorsSearch for similar articles
2
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      DocSCAN: Unsupervised Text Classification via Learning from Neighbors

      Preprint
      ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN). For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels. Our learnable clustering approach uses pairs of neighboring datapoints as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels. On five topic classification benchmarks, we improve on various unsupervised baselines by a large margin. In datasets with relatively few and balanced outcome classes, DocSCAN approaches the performance of supervised classification. The method fails for other types of classification, such as sentiment analysis, pointing to important conceptual and practical differences between classifying images and texts.

          Related collections

          Author and article information

          Journal
          09 May 2021
          Article
          2105.04024
          1e10b8ed-ce52-4eeb-a736-d9798ff046e3

          http://creativecommons.org/licenses/by/4.0/

          History
          Custom metadata
          cs.CL

          Theoretical computer science
          Theoretical computer science

          Comments

          Comment on this article