12
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The use of comparative genomics for functional, evolutionary, and epidemiological studies requires methods to classify gene families in terms of occurrence in a given species. These methods usually lack multivariate statistical models to infer the partitions and the optimal number of classes and don’t account for genome organization. We introduce a graph structure to model pangenomes in which nodes represent gene families and edges represent genomic neighborhood. Our method, named PPanGGOLiN, partitions nodes using an Expectation-Maximization algorithm based on multivariate Bernoulli Mixture Model coupled with a Markov Random Field. This approach takes into account the topology of the graph and the presence/absence of genes in pangenomes to classify gene families into persistent, cloud, and one or several shell partitions. By analyzing the partitioned pangenome graphs of isolate genomes from 439 species and metagenome-assembled genomes from 78 species, we demonstrate that our method is effective in estimating the persistent genome. Interestingly, it shows that the shell genome is a key element to understand genome dynamics, presumably because it reflects how genes present at intermediate frequencies drive adaptation of species, and its proportion in genomes is independent of genome size. The graph-based approach proposed by PPanGGOLiN is useful to depict the overall genomic diversity of thousands of strains in a compact structure and provides an effective basis for very large scale comparative genomics. The software is freely available at https://github.com/labgem/PPanGGOLiN.

          Author summary

          Microorganisms have the greatest biodiversity and evolutionary history on earth. At the genomic level, it is reflected by a highly variable gene content even among organisms from the same species which explains the ability of microbes to be pathogenic or to grow in specific environments. We developed a new method called PPanGGOLiN which accurately represents the genomic diversity of a species (i.e. its pangenome) using a compact graph structure. Based on this pangenome graph, we classify genes by a statistical method according to their occurrence in the genomes. This method allowed us to build pangenomes even for uncultivated species at an unprecedented scale. We applied our method on all available genomes in databanks in order to depict the overall diversity of hundreds of species. Overall, our work enables microbiologists to explore and visualize pangenomes alike a subway map.

          Related collections

          Most cited references61

          • Record: found
          • Abstract: found
          • Article: not found

          Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

          Rapid advances in next-generation sequencing technologies have dramatically changed our ability to perform genome-scale analyses. The human reference genome used for most genomic analyses represents only a small number of individuals, limiting its usefulness for genotyping. We designed a novel method, HISAT2, for representing and searching an expanded model of the human reference genome, in which a large catalogue of known genomic variants and haplotypes is incorporated into the data structure used for searching and alignment. This strategy for representing a population of genomes, along with a fast and memory-efficient search algorithm, enables more detailed and accurate variant analyses than previous methods. We demonstrate two initial applications of HISAT2: HLA typing, a critical need in human organ transplantation, and DNA fingerprinting, widely used in forensics. These applications are part of HISAT-genotype, with performance not only surpassing earlier computational methods, but matching or exceeding the accuracy of laboratory-based assays.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Estimating the Dimension of a Model

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Prodigal: prokaryotic gene recognition and translation initiation site identification

              Background The quality of automated gene prediction in microbial organisms has improved steadily over the past decade, but there is still room for improvement. Increasing the number of correct identifications, both of genes and of the translation initiation sites for each gene, and reducing the overall number of false positives, are all desirable goals. Results With our years of experience in manually curating genomes for the Joint Genome Institute, we developed a new gene prediction algorithm called Prodigal (PROkaryotic DYnamic programming Gene-finding ALgorithm). With Prodigal, we focused specifically on the three goals of improved gene structure prediction, improved translation initiation site recognition, and reduced false positives. We compared the results of Prodigal to existing gene-finding methods to demonstrate that it met each of these objectives. Conclusion We built a fast, lightweight, open source gene prediction program called Prodigal http://compbio.ornl.gov/prodigal/. Prodigal achieved good results compared to existing methods, and we believe it will be a valuable asset to automated microbial annotation pipelines.
                Bookmark

                Author and article information

                Contributors
                Role: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: SoftwareRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: Data curationRole: Formal analysisRole: InvestigationRole: SoftwareRole: Writing – original draftRole: Writing – review & editing
                Role: Data curation
                Role: Data curation
                Role: Data curation
                Role: Data curationRole: Writing – review & editing
                Role: Data curation
                Role: Writing – review & editing
                Role: Writing – review & editing
                Role: Writing – review & editing
                Role: MethodologyRole: Writing – review & editing
                Role: MethodologyRole: Writing – review & editing
                Role: Writing – original draftRole: Writing – review & editing
                Role: ConceptualizationRole: SupervisionRole: Writing – original draftRole: Writing – review & editing
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput. Biol
                plos
                ploscomp
                PLoS Computational Biology
                Public Library of Science (San Francisco, CA USA )
                1553-734X
                1553-7358
                March 2020
                19 March 2020
                : 16
                : 3
                : e1007732
                Affiliations
                [1 ] LABGeM, Génomique Métabolique, CEA, Genoscope, Institut François Jacob, Université d’Évry, Université Paris-Saclay, CNRS, Evry, France
                [2 ] Microbial Evolutionary Genomics, Institut Pasteur, CNRS, UMR3525, Paris, France
                [3 ] Sorbonne Université, Collège doctoral, Paris, France
                [4 ] Laboratoire de Probabilités, Statistique et Modélisation, Sorbonne Université, Université de Paris, Centre National de la Recherche Scientifique, Paris, France
                [5 ] Laboratoire de Mathématiques et Modélisation d’Evry, UMR CNRS 8071, Université d’Evry Val d’Essonne, Evry, France
                CPERI, GREECE
                Author notes

                The authors have declared that no competing interests exist.

                [¤a]

                Current address: Hub de Bioinformatique et Biostatistique - Département Biologie Computationnelle, Institut Pasteur, USR 3756 CNRS, Paris, France

                [¤b]

                Current address: PathoQuest SAS, BioPark – bâtiment B, 11 rue Watt, 75013 Paris, France

                Author information
                http://orcid.org/0000-0002-0970-9361
                http://orcid.org/0000-0002-5656-4708
                http://orcid.org/0000-0002-7826-3316
                http://orcid.org/0000-0003-1490-7271
                http://orcid.org/0000-0003-4797-6185
                http://orcid.org/0000-0002-3905-1054
                http://orcid.org/0000-0002-2342-9729
                http://orcid.org/0000-0001-7704-822X
                http://orcid.org/0000-0001-6648-0332
                Article
                PCOMPBIOL-D-19-02015
                10.1371/journal.pcbi.1007732
                7108747
                32191703
                e1b3b3c4-a2ba-43d2-987c-6517aaf52b82
                © 2020 Gautreau et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 19 November 2019
                : 12 February 2020
                Page count
                Figures: 7, Tables: 0, Pages: 27
                Funding
                Funded by: Agence Nationale de la Recherche
                Award ID: ANR-16-CE12-29
                Award Recipient :
                Funded by: Agence Nationale de la Recherche
                Award ID: ANR-11-INBS-0013
                Award Recipient :
                Funded by: Agence Nationale de la Recherche
                Award ID: ANR-10-INBS-09-08
                Award Recipient :
                This research was supported in part by the IRTELIS and Phare PhD programs of the French Alternative Energies and Atomic Energy Commission (CEA) for GG and AB respectively, the French Government "Investissements d’Avenir" programs (namely FRANCE GENOMIQUE [ANR-10-INBS-09-08], the INSTITUT FRANÇAIS DE BIOINFORMATIQUE [ANR-11-INBS-0013], and the Agence Nationale de la Recherche [Projet ANR-16-CE12-29 for EPCR]). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Computational Biology
                Comparative Genomics
                Biology and Life Sciences
                Genetics
                Genomics
                Comparative Genomics
                Biology and Life Sciences
                Genetics
                Genomics
                Biology and Life Sciences
                Computational Biology
                Genome Evolution
                Biology and Life Sciences
                Genetics
                Genomics
                Genome Evolution
                Biology and Life Sciences
                Evolutionary Biology
                Molecular Evolution
                Genome Evolution
                Biology and Life Sciences
                Synthetic Biology
                Synthetic Genomics
                Engineering and Technology
                Synthetic Biology
                Synthetic Genomics
                Biology and Life Sciences
                Taxonomy
                Computer and Information Sciences
                Data Management
                Taxonomy
                Physical Sciences
                Mathematics
                Applied Mathematics
                Algorithms
                Research and Analysis Methods
                Simulation and Modeling
                Algorithms
                Biology and Life Sciences
                Computational Biology
                Genome Analysis
                Biology and Life Sciences
                Genetics
                Genomics
                Genome Analysis
                Research and Analysis Methods
                Database and Informatics Methods
                Biological Databases
                Genomic Databases
                Biology and Life Sciences
                Computational Biology
                Genome Analysis
                Genomic Databases
                Biology and Life Sciences
                Genetics
                Genomics
                Genome Analysis
                Genomic Databases
                Custom metadata
                vor-update-to-uncorrected-proof
                2020-03-31
                Archaeal and bacterial genomes were downloaded from the NCBI FTP server (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank) 17 April 2019. Metagenome-Assembled Genomes were downloaded from https://opendata.lifebit.ai/table/SGB. All analyses described here were run using PPanGGOLiN software (version 1.0). PPanGGOLiN source code is freely available from https://github.com/labgem/PPanGGOLiN under a CeCILL license. All relevant data are within the manuscript and its Supporting Information files.

                Quantitative & Systems biology
                Quantitative & Systems biology

                Comments

                Comment on this article