The use of social media platforms in health research is increasing, yet their application in studying rare diseases is limited. Hodgkin’s lymphoma (HL) is a rare malignancy with a high incidence in young adults. This study evaluates the feasibility of using social media data to study the disease and treatment characteristics of HL.
We utilized the X (formerly Twitter) API v2 developer portal to download posts (formerly tweets) from January 2010 to October 2022. Annotation guidelines were developed from literature and a manual review of limited posts was performed to identify the class and attributes (characteristics) of HL discussed on X, and create a gold standard dataset. This dataset was subsequently employed to train, test, and validate a Named Entity Recognition (NER) Natural Language Processing (NLP) application.
After data preparation, 80,811 posts were collected: 500 for annotation guideline development, 2,000 for NLP application development, and the remaining 78,311 for deploying the application. We identified nine classes related to HL, such as HL classification, etiopathology, stages and progression, and treatment. The treatment class and HL stages and progression were the most frequently discussed, with 20,013 (25.56%) posts mentioning HL’s treatments and 17,177 (21.93%) mentioning HL stages and progression. The model exhibited robust performance, achieving 86% accuracy and an 87% F1 score. The etiopathology class demonstrated excellent performance, with 93% accuracy and a 95% F1 score.
The NLP application displayed high efficacy in extracting and characterizing HL-related information from social media posts, as evidenced by the high F1 score. Nonetheless, the data presented limitations in distinguishing between patients, providers, and caregivers and in establishing the temporal relationships between classes and attributes. Further research is necessary to bridge these gaps.
This study explores the potential of using X (formerly Twitter) social media to study Hodgkin’s Lymphoma (HL), a rare cancer prevalent among young adults. By accessing posts from January 2010 to October 2022 through the X, we collected 80,811 posts to analyze disease-related discussions. We developed a Named Entity Recognition (NER), Natural Language Processing (NLP) tool to categorize posts into various HL-related topics, such as disease classification, progression stages, and treatments. The most commonly discussed topics in the posts were HL treatments and disease progression. The NER tool proved highly effective, with accuracy and F1 scores reaching up to 87% and 95%, respectively, demonstrating that social media can serve as a valuable platform for gathering preliminary data on rare diseases like HL. However, the study also recognized challenges in differentiating posts by patients, caregivers, or providers and pinpointing the timing of the discussed events, suggesting further improvement.
See how this article has been cited at scite.ai
scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.