16
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Identifying and evaluating clinical subtypes of Alzheimer’s disease in care electronic health records using unsupervised machine learning

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Alzheimer’s disease (AD) is a highly heterogeneous disease with diverse trajectories and outcomes observed in clinical populations. Understanding this heterogeneity can enable better treatment, prognosis and disease management. Studies to date have mainly used imaging or cognition data and have been limited in terms of data breadth and sample size. Here we examine the clinical heterogeneity of Alzheimer's disease patients using electronic health records (EHR) to identify and characterise disease subgroups using multiple clustering methods, identifying clusters which are clinically actionable.

          Methods

          We identified AD patients in primary care EHR from the Clinical Practice Research Datalink (CPRD) using a previously validated rule-based phenotyping algorithm. We extracted and included a range of comorbidities, symptoms and demographic features as patient features. We evaluated four different clustering methods (k-means, kernel k-means, affinity propagation and latent class analysis) to cluster Alzheimer’s disease patients. We compared clusters on clinically relevant outcomes and evaluated each method using measures of cluster structure, stability, efficiency of outcome prediction and replicability in external data sets.

          Results

          We identified 7,913 AD patients, with a mean age of 82 and 66.2% female. We included 21 features in our analysis. We observed 5, 2, 5 and 6 clusters in k-means, kernel k-means, affinity propagation and latent class analysis respectively. K-means was found to produce the most consistent results based on four evaluative measures. We discovered a consistent cluster found in three of the four methods composed of predominantly female, younger disease onset (43% between ages 42–73) diagnosed with depression and anxiety, with a quicker rate of progression compared to the average across other clusters.

          Conclusion

          Each clustering approach produced substantially different clusters and K-Means performed the best out of the four methods based on the four evaluative criteria. However, the consistent appearance of one particular cluster across three of the four methods potentially suggests the presence of a distinct disease subtype that merits further exploration. Our study underlines the variability of the results obtained from different clustering approaches and the importance of systematically evaluating different approaches for identifying disease subtypes in complex EHR.

          Supplementary Information

          The online version contains supplementary material available at 10.1186/s12911-021-01693-6.

          Related collections

          Most cited references59

          • Record: found
          • Abstract: not found
          • Article: not found

          Silhouettes: A graphical aid to the interpretation and validation of cluster analysis

            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Data Resource Profile: Clinical Practice Research Datalink (CPRD)

            The Clinical Practice Research Datalink (CPRD) is an ongoing primary care database of anonymised medical records from general practitioners, with coverage of over 11.3 million patients from 674 practices in the UK. With 4.4 million active (alive, currently registered) patients meeting quality criteria, approximately 6.9% of the UK population are included and patients are broadly representative of the UK general population in terms of age, sex and ethnicity. General practitioners are the gatekeepers of primary care and specialist referrals in the UK. The CPRD primary care database is therefore a rich source of health data for research, including data on demographics, symptoms, tests, diagnoses, therapies, health-related behaviours and referrals to secondary care. For over half of patients, linkage with datasets from secondary care, disease-specific cohorts and mortality records enhance the range of data available for research. The CPRD is very widely used internationally for epidemiological research and has been used to produce over 1000 research studies, published in peer-reviewed journals across a broad range of health outcomes. However, researchers must be aware of the complexity of routinely collected electronic health records, including ways to manage variable completeness, misclassification and development of disease definitions for research.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Clustering by passing messages between data points.

              Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such "exemplars" can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called "affinity propagation," which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.
                Bookmark

                Author and article information

                Contributors
                nonie.alexander.16@ucl.ac.uk
                Journal
                BMC Med Inform Decis Mak
                BMC Med Inform Decis Mak
                BMC Medical Informatics and Decision Making
                BioMed Central (London )
                1472-6947
                8 December 2021
                8 December 2021
                2021
                : 21
                : 343
                Affiliations
                [1 ]GRID grid.83440.3b, ISNI 0000000121901201, Institute of Health Informatics, , University College London, ; London, UK
                [2 ]GRID grid.507332.0, Health Data Research UK, ; London, UK
                [3 ]GRID grid.83440.3b, ISNI 0000000121901201, Centre for Medical Image Computing, Department of Computer Science, , University College London, ; London, UK
                [4 ]GRID grid.83440.3b, ISNI 0000000121901201, UCL Institute of Neurology, , University College London, ; London, UK
                [5 ]GRID grid.499548.d, ISNI 0000 0004 5903 3632, Alan Turing Institute, ; London, UK
                [6 ]GRID grid.509540.d, ISNI 0000 0004 6880 3010, Department of Radiology and Nuclear Medicine, , Amsterdam University Medical Centers, ; Amsterdam, The Netherlands
                Article
                1693
                10.1186/s12911-021-01693-6
                8653614
                34879829
                49794c97-74aa-4d41-9fc9-de012d51b5aa
                © The Author(s) 2021

                Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

                History
                : 22 June 2021
                : 15 November 2021
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/501100000265, Medical Research Council;
                Award ID: MR/R502248/1
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/501100000266, Engineering and Physical Sciences Research Council;
                Award ID: EP/M020533/1
                Award Recipient :
                Funded by: brc
                Funded by: Health Data Research UK
                Categories
                Research
                Custom metadata
                © The Author(s) 2021

                Bioinformatics & Computational biology
                clustering,ehr,alzheimer's disease,subtyping,k-means
                Bioinformatics & Computational biology
                clustering, ehr, alzheimer's disease, subtyping, k-means

                Comments

                Comment on this article