0
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Quality indices for topic model selection and evaluation: a literature review and case study

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Topic models are a class of unsupervised machine learning models, which facilitate summarization, browsing and retrieval from large unstructured document collections. This study reviews several methods for assessing the quality of unsupervised topic models estimated using non-negative matrix factorization. Techniques for topic model validation have been developed across disparate fields. We synthesize this literature, discuss the advantages and disadvantages of different techniques for topic model validation, and illustrate their usefulness for guiding model selection on a large clinical text corpus.

          Design, setting and data

          Using a retrospective cohort design, we curated a text corpus containing 382,666 clinical notes collected between 01/01/2017 through 12/31/2020 from primary care electronic medical records in Toronto Canada.

          Methods

          Several topic model quality metrics have been proposed to assess different aspects of model fit. We explored the following metrics: reconstruction error, topic coherence, rank biased overlap, Kendall’s weighted tau, partition coefficient, partition entropy and the Xie-Beni statistic. Depending on context, cross-validation and/or bootstrap stability analysis were used to estimate these metrics on our corpus.

          Results

          Cross-validated reconstruction error favored large topic models (K ≥ 100 topics) on our corpus. Stability analysis using topic coherence and the Xie-Beni statistic also favored large models (K = 100 topics). Rank biased overlap and Kendall’s weighted tau favored small models (K = 5 topics). Few model evaluation metrics suggested mid-sized topic models (25 ≤ K ≤ 75) as being optimal. However, human judgement suggested that mid-sized topic models produced expressive low-dimensional summarizations of the corpus.

          Conclusions

          Topic model quality indices are transparent quantitative tools for guiding model selection and evaluation. Our empirical illustration demonstrated that different topic model quality indices favor models of different complexity; and may not select models aligning with human judgment. This suggests that different metrics capture different aspects of model goodness of fit. A combination of topic model quality indices, coupled with human validation, may be useful in appraising unsupervised topic models.

          Supplementary Information

          The online version contains supplementary material available at 10.1186/s12911-023-02216-1.

          Related collections

          Most cited references27

          • Record: found
          • Abstract: found
          • Article: not found

          Learning the parts of objects by non-negative matrix factorization.

          Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations. Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Probabilistic topic models

              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              The Hungarian method for the assignment problem

              H. Kuhn (1955)
                Bookmark

                Author and article information

                Contributors
                christopher.meaney@utoronto.ca
                Journal
                BMC Med Inform Decis Mak
                BMC Med Inform Decis Mak
                BMC Medical Informatics and Decision Making
                BioMed Central (London )
                1472-6947
                22 July 2023
                22 July 2023
                2023
                : 23
                : 132
                Affiliations
                [1 ]GRID grid.17063.33, ISNI 0000 0001 2157 2938, Department of Family and Community Medicine, , University of Toronto, ; 500 University Ave, Toronto, ON M5G1V7 Canada
                [2 ]GRID grid.17063.33, ISNI 0000 0001 2157 2938, Institute of Health Policy, Management and Evaluation, , ICES, University of Toronto, ; Toronto, Canada
                [3 ]GRID grid.17063.33, ISNI 0000 0001 2157 2938, Dalla Lana School of Public Health, , University of Toronto, ; Toronto, Canada
                Author information
                http://orcid.org/0000-0002-5429-5233
                Article
                2216
                10.1186/s12911-023-02216-1
                10362613
                0391ae7b-c48e-49c6-aa61-4cad0028308f
                © The Author(s) 2023

                Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

                History
                : 31 May 2022
                : 22 June 2023
                Funding
                Funded by: CIHR
                Award ID: FDN 143303
                Award Recipient :
                Categories
                Research Article
                Custom metadata
                © BioMed Central Ltd., part of Springer Nature 2023

                Bioinformatics & Computational biology
                non-negative matrix factorization,topic model,internal validation,cross-validation,stability analysis,clinical text data,electronic medical record

                Comments

                Comment on this article