Adapting historical clinical genetic test records for anonymised data linkage: obstacles and opportunities

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Introduction

Cystic fibrosis (CF) heterozygotes (also known as ‘carriers’) are people who have one mutated copy of the CFTR gene. Research into the health risks of CF carriers has been limited by a lack of large cohorts tested for CF carrier status, but routine clinical testing identifies CF carriers in the population. Such test records additionally contain large amounts of clinical information, making them a valuable research resource to not only identify CF carriers in the population but also to provide additional data not found elsewhere.

Methods

Following governance approvals, we adapted 30 years worth of CF genetic testing records generated by the All-Wales Medical Genomics Service (AWMGS) and submitted them to the SAIL Databank for anonymised linkage.

Results

Unexpected obstacles meant that a minimum amount of clinical information could be annotated ahead of linkage. The raw data were highly heterogeneous due to the records’ longitudinal collection and clinical origins, making standardisation difficult. Moreover, the presence of unique identifiers in the clinical data violated the separation principle, requiring manual annotation to produce a cleaned dataset. Explicit identification of patients or their relatives throughout the records complicated split file anonymisation.

Conclusion

Extracting useful information from historical clinical genetic test records is a significant challenge with technical and governance aspects. The mixing of unique identifiers with clinical data in heterogeneous, unstructured free text combined with a lack of automated tools meant that manual annotation was required to adhere to the separation principle. As such, only a minimum of the available clinical data was annotatable within the project timeline and mutually exclusive access to the identifiable and pseudonymised data meant that annotations could not later be validated. Future efforts to link clinical genetic test records for research must consider these challenges in their approach.

Related collections

Most cited references 13

Record: found
Abstract: found
Article: not found

HGVS Recommendations for the Description of Sequence Variants: 2016 Update.

Johan T. den Dunnen, Raymond Dalgleish, Donna R. Maglott … (2016)

The consistent and unambiguous description of sequence variants is essential to report and exchange information on the analysis of a genome. In particular, DNA diagnostics critically depends on accurate and standardized description and sharing of the variants detected. The sequence variant nomenclature system proposed in 2000 by the Human Genome Variation Society has been widely adopted and has developed into an internationally accepted standard. The recommendations are currently commissioned through a Sequence Variant Description Working Group (SVD-WG) operating under the auspices of three international organizations: the Human Genome Variation Society (HGVS), the Human Variome Project (HVP), and the Human Genome Organization (HUGO). Requests for modifications and extensions go through the SVD-WG following a standard procedure including a community consultation step. Version numbers are assigned to the nomenclature system to allow users to specify the version used in their variant descriptions. Here, we present the current recommendations, HGVS version 15.11, and briefly summarize the changes that were made since the 2000 publication. Most focus has been on removing inconsistencies and tightening definitions allowing automatic data processing. An extensive version of the recommendations is available online, at http://www.HGVS.org/varnomen.

0 comments Cited 447 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

The SAIL databank: linking multiple health and social care datasets

Ronan Lyons, Kerina Jones, Gareth John … (2009)

Background Vast amounts of data are collected about patients and service users in the course of health and social care service delivery. Electronic data systems for patient records have the potential to revolutionise service delivery and research. But in order to achieve this, it is essential that the ability to link the data at the individual record level be retained whilst adhering to the principles of information governance. The SAIL (Secure Anonymised Information Linkage) databank has been established using disparate datasets, and over 500 million records from multiple health and social care service providers have been loaded to date, with further growth in progress. Methods Having established the infrastructure of the databank, the aim of this work was to develop and implement an accurate matching process to enable the assignment of a unique Anonymous Linking Field (ALF) to person-based records to make the databank ready for record-linkage research studies. An SQL-based matching algorithm (MACRAL, Matching Algorithm for Consistent Results in Anonymised Linkage) was developed for this purpose. Firstly the suitability of using a valid NHS number as the basis of a unique identifier was assessed using MACRAL. Secondly, MACRAL was applied in turn to match primary care, secondary care and social services datasets to the NHS Administrative Register (NHSAR), to assess the efficacy of this process, and the optimum matching technique. Results The validation of using the NHS number yielded specificity values > 99.8% and sensitivity values > 94.6% using probabilistic record linkage (PRL) at the 50% threshold, and error rates were < 0.2%. A range of techniques for matching datasets to the NHSAR were applied and the optimum technique resulted in sensitivity values of: 99.9% for a GP dataset from primary care, 99.3% for a PEDW dataset from secondary care and 95.2% for the PARIS database from social care. Conclusion With the infrastructure that has been put in place, the reliable matching process that has been developed enables an ALF to be consistently allocated to records in the databank. The SAIL databank represents a research-ready platform for record-linkage studies.

0 comments Cited 198 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

The SAIL Databank: building a national architecture for e-health research and evaluation

David Ford, Kerina Jones, Jean-Philippe Verplancke … (2009)

Background Vast quantities of electronic data are collected about patients and service users as they pass through health service and other public sector organisations, and these data present enormous potential for research and policy evaluation. The Health Information Research Unit (HIRU) aims to realise the potential of electronically-held, person-based, routinely-collected data to conduct and support health-related studies. However, there are considerable challenges that must be addressed before such data can be used for these purposes, to ensure compliance with the legislation and guidelines generally known as Information Governance. Methods A set of objectives was identified to address the challenges and establish the Secure Anonymised Information Linkage (SAIL) system in accordance with Information Governance. These were to: 1) ensure data transportation is secure; 2) operate a reliable record matching technique to enable accurate record linkage across datasets; 3) anonymise and encrypt the data to prevent re-identification of individuals; 4) apply measures to address disclosure risk in data views created for researchers; 5) ensure data access is controlled and authorised; 6) establish methods for scrutinising proposals for data utilisation and approving output; and 7) gain external verification of compliance with Information Governance. Results The SAIL databank has been established and it operates on a DB2 platform (Data Warehouse Edition on AIX) running on an IBM 'P' series Supercomputer: Blue-C. The findings of an independent internal audit were favourable and concluded that the systems in place provide adequate assurance of compliance with Information Governance. This expanding databank already holds over 500 million anonymised and encrypted individual-level records from a range of sources relevant to health and well-being. This includes national datasets covering the whole of Wales (approximately 3 million population) and local provider-level datasets, with further growth in progress. The utility of the databank is demonstrated by increasing engagement in high quality research studies. Conclusion Through the pragmatic approach that has been adopted, we have been able to address the key challenges in establishing a national databank of anonymised person-based records, so that the data are available for research and evaluation whilst meeting the requirements of Information Governance.

0 comments Cited 170 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Int J Popul Data Sci

Journal ID (iso-abbrev): Int J Popul Data Sci

Journal ID (publisher-id): IJPDS

Title: International Journal of Population Data Science

Publisher: Swansea University

ISSN (Electronic): 2399-4908

Publication date (Electronic, pub): 20 February 2025

Publication date (Electronic, collection): 2023

Volume: 8

Issue: 5

Electronic Location Identifier: 2924

Affiliations

[1 ] Wales Gene Park, Division of Cancer and Genetics, Cardiff University, Canolfan Iechyd Genomig Cymru/Wales Genomic Health Centre, Cardiff Edge Business Park, Longwood Drive, Whitchurch, Cardiff, CF14 7YU, UK

[2 ] Centre for Trials Research, Cardiff University, Cardiff, CF14 4XN, UK

[3 ] All-Wales Medical Genomics Service, Canolfan Iechyd Genomig Cymru / Wales Genomic Health Centre, Cardiff Edge Business Park, Longwood Drive, Whitchurch, Cardiff, CF14 7YU, UK

[4 ] Centre for Medical Education, Cardiff University, Cardiff, CF14 4XN, UK

Author notes

[*] [* ]Corresponding author: Robert T. Maddison maddisonr@ 123456cardiff.ac.uk

Statement on conflicts of interest: The authors have no conflicts to declare.

Article

Publisher ID: 8:5:2924

DOI: 10.23889/ijpds.v8i5.2924

PMC ID: 11922013

PubMed ID: 40110575

SO-VID: 9d754dfe-40f7-4b89-9f04-22ae06b8fb58

License:

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

History

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Most referenced authors 136

See all reference authors

Adapting historical clinical genetic test records for anonymised data linkage: obstacles and opportunities

Read this article at

Abstract

Introduction

Methods

Results

Conclusion

Related collections

Zamani: A Journal of African Historical Studies

Most cited references 13

HGVS Recommendations for the Description of Sequence Variants: 2016 Update.

The SAIL databank: linking multiple health and social care datasets

The SAIL Databank: building a national architecture for e-health research and evaluation

Author and article information

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 30

Most referenced authors 136