Inviting an author to review:
Find an author and click ‘Invite to review selected article’ near their name.
Search for authorsSearch for similar articles
0
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Leveraging social media data to study disease and treatment characteristics of Hodgkin’s lymphoma Using Natural Language Processing methods

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          The use of social media platforms in health research is increasing, yet their application in studying rare diseases is limited. Hodgkin’s lymphoma (HL) is a rare malignancy with a high incidence in young adults. This study evaluates the feasibility of using social media data to study the disease and treatment characteristics of HL.

          Methods

          We utilized the X (formerly Twitter) API v2 developer portal to download posts (formerly tweets) from January 2010 to October 2022. Annotation guidelines were developed from literature and a manual review of limited posts was performed to identify the class and attributes (characteristics) of HL discussed on X, and create a gold standard dataset. This dataset was subsequently employed to train, test, and validate a Named Entity Recognition (NER) Natural Language Processing (NLP) application.

          Results

          After data preparation, 80,811 posts were collected: 500 for annotation guideline development, 2,000 for NLP application development, and the remaining 78,311 for deploying the application. We identified nine classes related to HL, such as HL classification, etiopathology, stages and progression, and treatment. The treatment class and HL stages and progression were the most frequently discussed, with 20,013 (25.56%) posts mentioning HL’s treatments and 17,177 (21.93%) mentioning HL stages and progression. The model exhibited robust performance, achieving 86% accuracy and an 87% F1 score. The etiopathology class demonstrated excellent performance, with 93% accuracy and a 95% F1 score.

          Discussion

          The NLP application displayed high efficacy in extracting and characterizing HL-related information from social media posts, as evidenced by the high F1 score. Nonetheless, the data presented limitations in distinguishing between patients, providers, and caregivers and in establishing the temporal relationships between classes and attributes. Further research is necessary to bridge these gaps.

          Conclusion

          Our study demonstrated potential of using social media as a valuable preliminary research source for understanding the characteristics of rare diseases such as Hodgkin’s Lymphoma.

          Author Summary

          This study explores the potential of using X (formerly Twitter) social media to study Hodgkin’s Lymphoma (HL), a rare cancer prevalent among young adults. By accessing posts from January 2010 to October 2022 through the X, we collected 80,811 posts to analyze disease-related discussions. We developed a Named Entity Recognition (NER), Natural Language Processing (NLP) tool to categorize posts into various HL-related topics, such as disease classification, progression stages, and treatments. The most commonly discussed topics in the posts were HL treatments and disease progression. The NER tool proved highly effective, with accuracy and F1 scores reaching up to 87% and 95%, respectively, demonstrating that social media can serve as a valuable platform for gathering preliminary data on rare diseases like HL. However, the study also recognized challenges in differentiating posts by patients, caregivers, or providers and pinpointing the timing of the discussed events, suggesting further improvement.

          Related collections

          Most cited references60

          • Record: found
          • Abstract: found
          • Article: not found

          The 2008 WHO classification of lymphoid neoplasms and beyond: evolving concepts and practical applications.

          The World Health Organization classification of lymphoid neoplasms updated in 2008 represents a worldwide consensus on the diagnosis of these tumors and is based on the recognition of distinct diseases, using a multidisciplinary approach. The updated classification refined the definitions of well-recognized diseases, identified new entities and variants, and incorporated emerging concepts in the understanding of lymphoid neoplasms. However, some questions were unresolved, such as the extent to which specific genetic or molecular alterations define certain tumors, and the status of provisional entities, categories for which the World Health Organization working groups felt there was insufficient evidence to recognize as distinct diseases at this time. In addition, since its publication, new findings and ideas have been generated. This review summarizes the scientific rationale for the classification, emphasizing changes that have had an effect on practice guidelines. The authors address the criteria and significance of early or precursor lesions and the identification of certain lymphoid neoplasms largely associated with particular age groups, such as children and the elderly. The issue of borderline categories having overlapping features with large B-cell lymphomas, as well as several provisional entities, is reviewed. These new observations chart a course for future research in the field.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Twitter as a Tool for Health Research: A Systematic Review

            Background. Researchers have used traditional databases to study public health for decades. Less is known about the use of social media data sources, such as Twitter, for this purpose. Objectives. To systematically review the use of Twitter in health research, define a taxonomy to describe Twitter use, and characterize the current state of Twitter in health research. Search methods. We performed a literature search in PubMed, Embase, Web of Science, Google Scholar, and CINAHL through September 2015. Selection criteria. We searched for peer-reviewed original research studies that primarily used Twitter for health research. Data collection and analysis. Two authors independently screened studies and abstracted data related to the approach to analysis of Twitter data, methodology used to study Twitter, and current state of Twitter research by evaluating time of publication, research topic, discussion of ethical concerns, and study funding source. Main results. Of 1110 unique health-related articles mentioning Twitter, 137 met eligibility criteria. The primary approaches for using Twitter in health research that constitute a new taxonomy were content analysis (56%; n = 77), surveillance (26%; n = 36), engagement (14%; n = 19), recruitment (7%; n = 9), intervention (7%; n = 9), and network analysis (4%; n = 5). These studies collectively analyzed more than 5 billion tweets primarily by using the Twitter application program interface. Of 38 potential data features describing tweets and Twitter users, 23 were reported in fewer than 4% of the articles. The Twitter-based studies in this review focused on a small subset of data elements including content analysis, geotags, and language. Most studies were published recently (33% in 2015). Public health (23%; n = 31) and infectious disease (20%; n = 28) were the research fields most commonly represented in the included studies. Approximately one third of the studies mentioned ethical board approval in their articles. Primary funding sources included federal (63%), university (13%), and foundation (6%). Conclusions. We identified a new taxonomy to describe Twitter use in health research with 6 categories. Many data elements discernible from a user’s Twitter profile, especially demographics, have been underreported in the literature and can provide new opportunities to characterize the users whose data are analyzed in these studies. Twitter-based health research is a growing field funded by a diversity of organizations. Public health implications. Future work should develop standardized reporting guidelines for health researchers who use Twitter and policies that address privacy and ethical concerns in social media research.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study

              Background The COVID-19 pandemic is impacting mental health, but it is not clear how people with different types of mental health problems were differentially impacted as the initial wave of cases hit. Objective The aim of this study is to leverage natural language processing (NLP) with the goal of characterizing changes in 15 of the world’s largest mental health support groups (eg, r/schizophrenia, r/SuicideWatch, r/Depression) found on the website Reddit, along with 11 non–mental health groups (eg, r/PersonalFinance, r/conspiracy) during the initial stage of the pandemic. Methods We created and released the Reddit Mental Health Dataset including posts from 826,961 unique users from 2018 to 2020. Using regression, we analyzed trends from 90 text-derived features such as sentiment analysis, personal pronouns, and semantic categories. Using supervised machine learning, we classified posts into their respective support groups and interpreted important features to understand how different problems manifest in language. We applied unsupervised methods such as topic modeling and unsupervised clustering to uncover concerns throughout Reddit before and during the pandemic. Results We found that the r/HealthAnxiety forum showed spikes in posts about COVID-19 early on in January, approximately 2 months before other support groups started posting about the pandemic. There were many features that significantly increased during COVID-19 for specific groups including the categories “economic stress,” “isolation,” and “home,” while others such as “motion” significantly decreased. We found that support groups related to attention-deficit/hyperactivity disorder, eating disorders, and anxiety showed the most negative semantic change during the pandemic out of all mental health groups. Health anxiety emerged as a general theme across Reddit through independent supervised and unsupervised machine learning analyses. For instance, we provide evidence that the concerns of a diverse set of individuals are converging in this unique moment of history; we discovered that the more users posted about COVID-19, the more linguistically similar (less distant) the mental health support groups became to r/HealthAnxiety (ρ=–0.96, P<.001). Using unsupervised clustering, we found the suicidality and loneliness clusters more than doubled in the number of posts during the pandemic. Specifically, the support groups for borderline personality disorder and posttraumatic stress disorder became significantly associated with the suicidality cluster. Furthermore, clusters surrounding self-harm and entertainment emerged. Conclusions By using a broad set of NLP techniques and analyzing a baseline of prepandemic posts, we uncovered patterns of how specific mental health problems manifest in language, identified at-risk users, and revealed the distribution of concerns across Reddit, which could help provide better resources to its millions of users. We then demonstrated that textual analysis is sensitive to uncover mental health complaints as they appear in real time, identifying vulnerable groups and alarming themes during COVID-19, and thus may have utility during the ongoing pandemic and other world-changing events such as elections and protests.
                Bookmark

                Author and article information

                Contributors
                Role: Data curationRole: Formal analysisRole: MethodologyRole: Project administrationRole: ValidationRole: VisualizationRole: Writing – original draft
                Role: Data curationRole: Formal analysisRole: SoftwareRole: Validation
                Role: ConceptualizationRole: SupervisionRole: Writing – review & editing
                Role: ConceptualizationRole: SupervisionRole: Writing – review & editing
                Role: ConceptualizationRole: SupervisionRole: Writing – original draft
                Role: ConceptualizationRole: MethodologyRole: SoftwareRole: SupervisionRole: ValidationRole: Writing – review & editing
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: MethodologyRole: Project administrationRole: ResourcesRole: SoftwareRole: SupervisionRole: ValidationRole: Writing – review & editing
                Role: Editor
                Journal
                PLOS Digit Health
                PLOS Digit Health
                plos
                PLOS Digital Health
                Public Library of Science (San Francisco, CA USA )
                2767-3170
                19 March 2025
                March 2025
                : 4
                : 3
                : e0000765
                Affiliations
                [1 ] Department of Pharmaceutical Systems and Policy, School of Pharmacy, West Virginia University, Morgantown, West Virginia, United States of America
                [2 ] Real World Evidence, OPEN Health Evidence & Access, United States of America
                [3 ] Department of Pharmacotherapy, College of Pharmacy, University of North Texas Health Sciences Center, Fort Worth, Texas, United States of America
                [4 ] Department of Health Services Administration and Policy College of Public Health, Temple University, Philadelphia, Pennsylvania, United States of America
                undefined, UNITED STATES OF AMERICA
                Author notes

                The authors have declared that no competing interests exist.

                Author information
                https://orcid.org/0000-0002-6719-3228
                Article
                PDIG-D-24-00157
                10.1371/journal.pdig.0000765
                11922232
                40106471
                72b04029-28f1-42f5-9bd7-fbd157d1aee3
                © 2025 Siddiqui et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 17 April 2024
                : 23 January 2025
                Page count
                Figures: 3, Tables: 3, Pages: 19
                Funding
                The author(s) received no specific funding for this work.
                Categories
                Research Article
                Computer and Information Sciences
                Information Technology
                Natural Language Processing
                Medicine and Health Sciences
                Oncology
                Cancers and Neoplasms
                Hematologic Cancers and Related Disorders
                Lymphoma
                Hodgkin Lymphoma
                Medicine and Health Sciences
                Hematology
                Hematologic Cancers and Related Disorders
                Lymphoma
                Hodgkin Lymphoma
                Social Sciences
                Sociology
                Communications
                Social Communication
                Social Media
                Computer and Information Sciences
                Network Analysis
                Social Networks
                Social Media
                Social Sciences
                Sociology
                Social Networks
                Social Media
                Medicine and Health Sciences
                Clinical Genetics
                Stem Cell Therapy
                Medicine and Health Sciences
                Oncology
                Cancer Treatment
                Medicine and Health Sciences
                Oncology
                Cancer Treatment
                Radiation Therapy
                Medicine and Health Sciences
                Clinical Medicine
                Clinical Oncology
                Radiation Therapy
                Medicine and Health Sciences
                Oncology
                Clinical Oncology
                Radiation Therapy
                Medicine and Health Sciences
                Diagnostic Medicine
                Cancer Detection and Diagnosis
                Medicine and Health Sciences
                Oncology
                Cancer Detection and Diagnosis
                Medicine and Health Sciences
                Oncology
                Cancer Treatment
                Cancer Immunotherapy
                Medicine and Health Sciences
                Clinical Medicine
                Clinical Immunology
                Immunotherapy
                Cancer Immunotherapy
                Biology and Life Sciences
                Immunology
                Clinical Immunology
                Immunotherapy
                Cancer Immunotherapy
                Medicine and Health Sciences
                Immunology
                Clinical Immunology
                Immunotherapy
                Cancer Immunotherapy
                Custom metadata
                Data underlying this research were obtained from X. Access can be requested on the X platform using the following link: https://developer.x.com/en/use-cases/do-research. In addition, the code used in developing the manuscript can be accessed through the Zenodo repository at the following DOI: https://doi.org/10.5281/zenodo.14523003.

                Comments

                Comment on this article