55
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      CogStack - experiences of deploying integrated information retrieval and extraction services in a large National Health Service Foundation Trust hospital

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Traditional health information systems are generally devised to support clinical data collection at the point of care. However, as the significance of the modern information economy expands in scope and permeates the healthcare domain, there is an increasing urgency for healthcare organisations to offer information systems that address the expectations of clinicians, researchers and the business intelligence community alike. Amongst other emergent requirements, the principal unmet need might be defined as the 3R principle (right data, right place, right time) to address deficiencies in organisational data flow while retaining the strict information governance policies that apply within the UK National Health Service (NHS). Here, we describe our work on creating and deploying a low cost structured and unstructured information retrieval and extraction architecture within King’s College Hospital, the management of governance concerns and the associated use cases and cost saving opportunities that such components present.

          Results

          To date, our CogStack architecture has processed over 300 million lines of clinical data, making it available for internal service improvement projects at King’s College London. On generated data designed to simulate real world clinical text, our de-identification algorithm achieved up to 94% precision and up to 96% recall.

          Conclusion

          We describe a toolkit which we feel is of huge value to the UK (and beyond) healthcare community. It is the only open source, easily deployable solution designed for the UK healthcare environment, in a landscape populated by expensive proprietary systems. Solutions such as these provide a crucial foundation for the genomic revolution in medicine.

          Related collections

          Most cited references24

          • Record: found
          • Abstract: found
          • Article: not found

          A simple algorithm for identifying negated findings and diseases in discharge summaries.

          Narrative reports in medical records contain a wealth of information that may augment structured data for managing patient information and predicting trends in diseases. Pertinent negatives are evident in text but are not usually indexed in structured databases. The objective of the study reported here was to test a simple algorithm for determining whether a finding or disease mentioned within narrative medical reports is present or absent. We developed a simple regular expression algorithm called NegEx that implements several phrases indicating negation, filters out sentences containing phrases that falsely appear to be negation phrases, and limits the scope of the negation phrases. We compared NegEx against a baseline algorithm that has a limited set of negation phrases and a simpler notion of scope. In a test of 1235 findings and diseases in 1000 sentences taken from discharge summaries indexed by physicians, NegEx had a specificity of 94.5% (versus 85.3% for the baseline), a positive predictive value of 84.5% (versus 68.4% for the baseline) while maintaining a reasonable sensitivity of 77.8% (versus 88.3% for the baseline). We conclude that with little implementation effort a simple regular expression algorithm for determining whether a finding or disease is absent can identify a large portion of the pertinent negatives from discharge summaries.
            Bookmark
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            An Overview of the Tesseract OCR Engine

            R. Smith (2007)
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Automated de-identification of free-text medical records

              Background Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, necessitating automatic methods for large-scale, automated de-identification. Methods We describe an automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc. The software uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI set that includes doctors' names and years of dates. To develop the de-identification approach, we assembled a gold standard corpus of re-identified nursing notes with real PHI replaced by realistic surrogate information. This corpus consists of 2,434 nursing notes containing 334,000 words and a total of 1,779 instances of PHI taken from 163 randomly selected patient records. This gold standard corpus was used to refine the algorithm and measure its sensitivity. To test the algorithm on data not used in its development, we constructed a second test corpus of 1,836 nursing notes containing 296,400 words. The algorithm's false negative rate was evaluated using this test corpus. Results Performance evaluation of the de-identification software on the development corpus yielded an overall recall of 0.967, precision value of 0.749, and fallout value of approximately 0.002. On the test corpus, a total of 90 instances of false negatives were found, or 27 per 100,000 word count, with an estimated recall of 0.943. Only one full date and one age over 89 were missed. No patient names were missed in either corpus. Conclusion We have developed a pattern-matching de-identification system based on dictionary look-ups, regular expressions, and heuristics. Evaluation based on two different sets of nursing notes collected from a U.S. hospital suggests that, in terms of recall, the software out-performs a single human de-identifier (0.81) and performs at least as well as a consensus of two human de-identifiers (0.94). The system is currently tuned to de-identify PHI in nursing notes and discharge summaries but is sufficiently generalized and can be customized to handle text files of any format. Although the accuracy of the algorithm is high, it is probably insufficient to be used to publicly disseminate medical data. The open-source de-identification software and the gold standard re-identified corpus of medical records have therefore been made available to researchers via the PhysioNet website to encourage improvements in the algorithm.
                Bookmark

                Author and article information

                Contributors
                richgjackson@gmail.com
                ismailemrekartoglu@gmail.com
                clive.stringer@nhs.net
                g.gorrell@sheffield.ac.uk
                angus.roberts@sheffield.ac.uk
                x.song@sheffield.ac.uk
                honghan.wu@kcl.ac.uk
                asha.agrawal@nhs.net
                k.lui@ucl.ac.uk
                t.groza@garvan.org.au
                lewsley@nhs.net
                doug.northwood@nhs.net
                amos.folarin@kcl.ac.uk
                robert.stewart@kcl.ac.uk
                richard.j.dobson@kcl.ac.uk
                Journal
                BMC Med Inform Decis Mak
                BMC Med Inform Decis Mak
                BMC Medical Informatics and Decision Making
                BioMed Central (London )
                1472-6947
                25 June 2018
                25 June 2018
                2018
                : 18
                : 47
                Affiliations
                [1 ]ISNI 0000 0001 2322 6764, GRID grid.13097.3c, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, ; 16 De Crespigne Park, London, SE5 8AF UK
                [2 ]ISNI 0000 0000 9439 0839, GRID grid.37640.36, South London and Maudsley NHS Foundation Trust, ; Denmark Hill, London, SE5 8AZ UK
                [3 ]ISNI 0000 0004 0391 9020, GRID grid.46699.34, King’s College Hospital, ; Denmark Hill, London, SE5 9RS UK
                [4 ]ISNI 0000 0004 1936 9262, GRID grid.11835.3e, University of Sheffield, ; Western Bank, Sheffield, S10 2TN UK
                [5 ]ISNI 0000000121901201, GRID grid.83440.3b, Farr Institute of Health Informatics Research, UCL Institute of Health Informatics, University College London, ; London, WC1E 6BT UK
                [6 ]ISNI 0000 0000 9983 6924, GRID grid.415306.5, Garvan Institute of Medical Research, ; Sydney, NSW 2010 Australia
                [7 ]InterDigital Communications, 64 Great Eastern Street, 1st Floor, London, EC2A 3QR UK
                [8 ]ISNI 0000 0004 1936 7988, GRID grid.4305.2, Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, ; Edinburgh, EH16 4UX UK
                Author information
                http://orcid.org/0000-0002-3278-8547
                Article
                623
                10.1186/s12911-018-0623-9
                6020175
                29941004
                8b4b565c-53db-4058-880f-bef5aa7068db
                © The Author(s) 2018

                Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 20 March 2017
                : 1 June 2018
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/100004440, Wellcome Trust;
                Award ID: MR/K006584/1
                Funded by: UK Infrastructure for Large-scale Clinical Genomics Research
                Award ID: MC_PC_14089
                Funded by: National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust
                Funded by: National Institute for Health Research (NIHR) Biomedical Research Centre at University College London Hospital
                Funded by: NHS England Enablement
                Funded by: FundRef http://dx.doi.org/10.13039/100010661, Horizon 2020 Framework Programme;
                Award ID: 644753
                Categories
                Software
                Custom metadata
                © The Author(s) 2018

                Bioinformatics & Computational biology
                elasticsearch,electronic health records,information extraction,clinical informatics,natural language processing

                Comments

                Comment on this article