13
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Draft Macronuclear Genome Sequence of the Ruminal Ciliate Entodinium caudatum

      brief-report

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The transcriptionally active macronucleus of a ruminal ciliate, Entodinium caudatum MZG-1, was sequenced using the Illumina MiSeq and Oxford Nanopore MinION platforms. This is the first draft macronuclear genome sequence of a ruminal protozoon, and the genomic information will provide useful insight into the metabolism, physiology, and ecology of ruminal ciliates.

          ABSTRACT

          The transcriptionally active macronucleus of a ruminal ciliate, Entodinium caudatum MZG-1, was sequenced using the Illumina MiSeq and Oxford Nanopore MinION platforms. This is the first draft macronuclear genome sequence of a ruminal protozoon, and the genomic information will provide useful insight into the metabolism, physiology, and ecology of ruminal ciliates.

          Related collections

          Most cited references15

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information

          Background The recent introduction of the Pacific Biosciences RS single molecule sequencing technology has opened new doors to scaffolding genome assemblies in a cost-effective manner. The long read sequence information is promised to enhance the quality of incomplete and inaccurate draft assemblies constructed from Next Generation Sequencing (NGS) data. Results Here we propose a novel hybrid assembly methodology that aims to scaffold pre-assembled contigs in an iterative manner using PacBio RS long read information as a backbone. On a test set comprising six bacterial draft genomes, assembled using either a single Illumina MiSeq or Roche 454 library, we show that even a 50× coverage of uncorrected PacBio RS long reads is sufficient to drastically reduce the number of contigs. Comparisons to the AHA scaffolder indicate our strategy is better capable of producing (nearly) complete bacterial genomes. Conclusions The current work describes our SSPACE-LongRead software which is designed to upgrade incomplete draft genomes using single molecule sequences. We conclude that the recent advances of the PacBio sequencing technology and chemistry, in combination with the limited computational resources required to run our program, allow to scaffold genomes in a fast and reliable manner.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            A first look at the Oxford Nanopore MinION sequencer.

            Oxford Nanopore's third-generation single-molecule sequencing platform promises to decrease costs for reagents and instrumentation. After a 2-year hiatus following the initial announcement, the first devices have been released as part of an early access program. We explore the performance of this platform by resequencing the lambda phage genome, and amplicons from a snake venom gland transcriptome. Although the handheld MinION sequencer can generate more than 150 megabases of raw data in one run, at most a quarter of the resulting reads map to the reference, with less than average 10% identity. Much of the sequence consists of insertion/deletion errors, or is seemingly without similarity to the template. Using the lambda phage data as an example, although the reads are long, averaging 5 kb, at best 890 ± 1932 bases per mapped read could be matched to the reference without soft clipping. In the course of a 36 h run on the MinION, it was possible to resequence the 48 kb lambda phage reference at 16× coverage. Currently, substantially larger projects would not be feasible using the MinION. Without increases in accuracy, which would be required for applications such as genome scaffolding and phasing, the current utility of the MinION appears limited. Library preparation requires access to a molecular laboratory, and is of similar complexity and cost to that of other next-generation sequencing platforms. The MinION is an exciting step in a new direction for single-molecule sequencing, though it will require dramatic decreases in error rates before it lives up to its promise.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              The Oxytricha trifallax Macronuclear Genome: A Complex Eukaryotic Genome with 16,000 Tiny Chromosomes

              Introduction Oxytricha trifallax is a distinctive ciliate [1]—an ancient lineage of protists named for their coats of cilia. Like all ciliates, Oxytricha has two types of nuclei: a micronucleus, a germline nucleus that is largely transcriptionally inactive during vegetative growth, and a macronucleus, which is the transcriptionally active somatic nucleus [2]. Compared to the micronucleus, Oxytricha's macronucleus is massively enlarged due to ∼2,000-fold [2] amplification resulting from two rounds of DNA amplification [3] during development. In the model ciliates Oxytricha trifallax, Tetrahymena thermophila, and Paramecium tetraurelia, varying amounts of micronuclear DNA are deleted (including the “internally eliminated sequences,” or IESs, interspersed between “macronuclear destined sequences,” or MDSs) during conjugation or autogamy (two forms of sexual development) to give rise to the information-rich macronuclear genome (Figure 1). A much larger fraction of the Oxytricha micronuclear genome—∼96% of the micronuclear complexity [2]—is eliminated during the macronuclear formation than in the oligohymenophoreans, Tetrahymena and Paramecium (which both eliminate ∼30% of their micronuclear genomes [4],[5]). The most remarkable difference in macronuclear development between Oxytricha and the two oligohymenophoreans is that the micronuclear-encoded MDSs that give rise to the macronuclear chromosomes may be nonsequential, or even in different orientations in the micronuclear genome [6]. Consequently, unlike the oligohymenophoreans, Oxytricha needs to unscramble its micronuclear genome during macronuclear development. 10.1371/journal.pbio.1001473.g001 Figure 1 Development of the Oxytricha macronuclear genome from the micronuclear genome. During conjugation of Oxytricha cells, segments of the micronuclear genome (MDSs) are excised and stitched together to form the nanochromosomes of the new macronuclear genome, and the remainder of the micronuclear genome is eliminated (including the IESs interspersed between MDSs). The old macronuclear genome is also degraded during development. The segments that are stitched together may be either in order (e.g., forming nanochromosome 1, on the left) or out of order or inverted (e.g., forming the two forms of nanochromosome 2), in which case they need to be “unscrambled.” Two rounds of DNA amplification produce nanochromosomes at an average copy number of ∼1,900 [2]. Alternative fragmentation of DNA during nanochromosome development may also occur, irrespective of unscrambling, giving rise to longer (2a) and shorter (2b) nanochromosome isoforms. The mature nanochromosomes are capped on both ends with telomeres. Two fundamental differences distinguish Oxytricha's macronuclear chromosomes from those of Tetrahymena and Paramecium: Oxytricha's chromosomes are tiny (“nanochromosomes,” with a mean length ∼3.2 kb reported in this study), each typically encoding just a single gene with a minimal amount of surrounding non-protein-coding DNA [7], and they are differentially amplified (Figure 2 and 3) [8],[9]. In some cases, alternative fragmentation of macronuclear-destined micronuclear DNA produces different nanochromosome isoforms (Figure 2) 10–12, which may be present at very different levels of amplification (differing by as much as 10-fold [13]). Gene expression and nanochromosome copy number may be moderately correlated [14]. Macronuclear chromosomes in all the model ciliates segregate by amitosis during cellular replication (without a mitotic spindle) [15],—a process that may lead to allelic fixation 17–20. In ciliates with nanochromosomes, major fluctuations of nanochromosome copy number [8] may arise, since copy number is unregulated during normal cellular replication [21]. Theoretical models propose that these fluctuations are a cause of senescence in these ciliates [22]. In contrast to the lack of copy number regulation during cellular replication, both genetic [23]–[25] and epigenetic mechanisms [9],[26] may influence chromosome copy number during sexual development in ciliates. 10.1371/journal.pbio.1001473.g002 Figure 2 Comparison of key ciliate macronuclear genomes. The phylogeny represents the bootstrap consensus of 100 replicates from PhyML (with the HKY85 substitution model) based on a MUSCLE multiple sequence alignment of 18S rRNA genes from seven ciliates (Oxytricha trifallax—FJ545743; Stylonychia lemnae—AJJRB310497; Euplotes crassus—AJJRB310492; Nyctotherus ovalis—AJ222678; Tetrahymena thermophila—M10932; Ichthyophthirius multifiliis—IMU17354; and Paramecium tetraurelia—AB252009) rooted with two other alveolates (Perkinsus marinus—X75762 and Plasmodium falciparum—NC_004325). All bootstrap values are ≥80, except for the node between Nyctotherus and Oxytricha/Stylonychia/Euplotes, which has a boostrap value of 60. Euplotes and Nyctotherus both have nanochromosomes, like Oxytricha. Other than the genome statistics for Oxytricha trifallax, which were determined in this study, table statistics were obtained from the following sources: a - [2], b - [22],[116], c - [117], d - [99], e - [94], f - [56] (the number of chromosomes is an estimate), g -[118], h - [119], i - [120], j- [64] (for a single stage of the Ichthyophthirius life cycle), k - [121], l - [69], m - [122]. Table statistics for Perkinsus marinus are for the current assembly deposited in GenBank (GCA_000006405.1). 10.1371/journal.pbio.1001473.g003 Figure 3 Key features of Oxytricha protein-coding nanochromosomes. Representative nanochromosome features are not drawn to scale, but their lengths are indicated. UTR, untranslated region; UTS, untranscribed region. 3′ UTRs and the subtelomeric signal overlap. The subtelomeric base composition bias signal found on either end of the nanochromosome is shown above the nanochromosome diagram. As a consequence of the extensive fragmentation of the Oxytricha macronuclear genome, each macronucleus possesses tens of millions of telomeres, an abundance that enabled the first isolations of telomere end-binding proteins [27],[28]. Oxytricha also has micronuclear transposons bearing telomeric repeats (C4A4) that resemble those of nanochromosomes. These telomere-bearing elements, or TBE transposons [29], play an important role in macronuclear genome development [30]. The exact site of telomere addition may vary for some nanochromosome ends [31] and is followed by a roughly 50 bp subtelomeric region of biased base composition with an approximately 10 bp periodicity of the bias (Figure 3) [32],[33]. We report on the assembly and analysis of the first Oxytricha macronuclear genome, from the reference JRB310 strain. During and after assembly, we have addressed a number of challenges arising from the unusual structure of this genome, which we discuss. We focus on the most interesting and unique biological characteristics of this genome and place them in the context of the characteristics of the other sequenced ciliate macronuclear genomes. Results and Discussion Macronuclear Genome Assemblies To assemble the Oxytricha macronuclear genome for the type strain—JRB310 [1]—we chose to build upon three assemblies, from ABySS [34], IDBA [35], and PE-Assembler [36]/SSAKE [37], based on Illumina sequences, and supplemented by a Sanger/454 assembly. To combine these assemblies, we developed a specialized meta-assembly pipeline (see Materials and Methods and Figure S1). Current genome assembly strategies for second-generation sequence data often employ multiple, hybrid strategies to overcome the experimental biases leading to low sequence coverage in particular genomic regions and repetitive DNA [38]. Since Oxytricha's macronuclear genome was expected to have a low repeat content [2], repetitive DNA was expected to be a relatively insignificant issue, and thus even greedy genome assemblers were able to produce useful preliminary assemblies. However, unlike conventional genomes, the Oxytricha macronuclear genome provides assembly challenges by virtue of its fragmented architecture, variable processing (“alternative chromosome fragmentation”), and nonuniform nanochromosome copy number. We resolved these challenges during and after assembly. The initial 454/Sanger genome assemblies contained a mixture of bacterial DNA, mitochondrial DNA, and up to two additional alleles other than those expected from the strain we originally proposed to sequence (JRB310) due to accidental contamination by a commonly used strain in our lab (JRB510—a complementary mating type), whereas the Illumina assemblies were produced from purified macronuclear DNA from the type strain (JRB310) alone. Given these contamination issues, we built our final assembly with the Illumina assemblies as the primary data source, rather than the 454/Sanger assemblies, to maintain the purity of our final assembly. This excluded virtually all bacterial and mitochondrial contamination in our final assembly since very few contigs in the Illumina assemblies derive from these sources (sucrose gradient purification of macronuclei eliminated almost all such contaminants) that could potentially be extended by the 454/Sanger data. We also kept JRB510 allelic data to a minimum in our final assembly, (i) by preferring best extensions, which were most likely to be from the more similar JRB310 derived contigs or reads, either from the Illumina assemblies or from the Sanger/454 assemblies and raw Sanger data (see Materials and Methods), and (ii) by the sequence consensus majority rule (the most abundant base at each position from the assembled contigs) of the CAP3 assembler [39] used to combine the three Illumina assemblies versus one 454/Sanger assembly during contig construction. The conclusion that our final assembly strategy succeeded in keeping JRB510 allelic information to a minimum is supported by matches to the three pure JRB310 Illumina starting assemblies. Our final assembly is 82.0% covered by identical BLAT matches ≥100 bp long to one of these pure JRB310 assemblies and 92.5% covered by 99.5% identity, ≥100 bp BLAT matches to these assemblies (note that consensus building by CAP3 may result in alternating selection of JRB310 polymorphisms from the original assemblies, and hence even meta-contigs assembled from the pure JRB310 assemblies may have BLAT matches that differ from all the original assemblies). We chose to keep alleles apart by applying moderately strict criteria for merging during our meta-assembly (e.g., by merging contigs with overlaps at least 40 bp long and ≥99% identical with CAP3 [39]; see Materials and Methods). However, in order to maximize the number of complete nanochromosomes in our assembly, we collapsed some alleles (i.e., producing “quasi-nanochromosomes” derived from two alleles; see “Extensive Genome Homozygosity and High SNP Heterozygosity”). Merging of contigs is also complicated by alternative fragmentation, which affects ∼10% of the nanochromosomes and may either result in collapsing or splitting of nanochromosome isoforms (see “Extensive Alternative Nanochromosome Fragmentation”). We discriminated between homozygous and heterozygous nanochromosomes after assembly (see the next section). We have not attempted to determine the haplotypes of the heterozygous contigs due to computational complications arising from both alternative nanochromosome fragmentation and variable representation of alleles (which need not be 1∶1). A comparison of the Oxytricha macronuclear genome assemblies and meta-assembly is given in Table 1 (also see Tables S11–S17 for the progressive improvements in the genome assembly through the successive steps of our meta-assembly approach shown in Figure S1). Since the size selection used in the construction of our paired-end (PE) sequence library results in poor sequence coverage for a span of approximately 160 bp, roughly 100 bp from the telomeric ends (see Materials and Methods), the incorporation of single-end (SE) sequence data allowed ABySS to assemble far more contigs with telomeric sequences than either IDBA or the PE-Assembler/SSAKE assembler combination, neither of which could use SE and PE sequences simultaneously (Table 1). The ABySS assembly is larger (78.0 Mb) than the other assemblies (47.8 Mb for PE-Asm/SSAKE and 57.7 Mb for IDBA) and also more complete, as evidenced by a higher fraction of reads that can be mapped to this assembly (Table 1). The ABySS assembly also incorporates a substantially higher proportion of telomeric reads than the other two assemblers (91.8% for ABySS versus 45.4% for PE-Asm/SSAKE and 42.4% for IDBA) and contains a larger proportion of telomere-containing contigs (66.5% versus 18.3% for PE-Asm/SSAKE and 8.6% for IDBA) and longer contigs (mean length of 2,273 bp for ABySS versus 2,090 bp for PE-Asm/SSAKE and 1,204 bp for IDBA). Consequently, the ABySS assembler produced almost an order of magnitude more full-length nanochromosomes than the other two assemblers. Even though the ABySS assembly incorporates more telomeric reads than the other assemblies, it also excludes a higher proportion of telomeric reads than nontelomeric reads (92% telomeric PE reads versus 98% total PE reads map to the assembly; telomeric reads comprise ∼13% of all the reads). While the ABySS assembly appears to be the most complete, the majority of its contigs (81.3%) are still missing one or both telomeres. 10.1371/journal.pbio.1001473.t001 Table 1 Comparison of Oxytricha macronuclear genome assemblies. Assembler PE-Asm/SSAKE IDBA ABySS Final Assembly Assembly size (bp) 47,753,834 57,684,531 78,039,140 67,172,481 Number of contigs 22,840 47,882 34,330 22,450 Number of telomeres 4,883 3,929 30,071 38,724 N50 (bp) 2,579 1,938 3,473 3,736 Mean contig length (bp) 2,090 1,204 2,273 2,982 2-telomere contigs 683 262 6,421 15,993 Mean 2-telomere contig length (bp) 3,070 3,148 3,243 3,187 Number of 1-telomere contigs 3,497 3,376 16,399 5,303 Mean 1-telomere contig length (bp) 2,773 2,260 2,182 2,694 Number of 0-telomere contigs 18,660 44,244 11,510 1,154 Mean 0-telomere contig length (bp) 1,927 1,112 1,861 1,655 Number of multitelomere contigs 20 28 776 1,279 Mean multitelomere contig length (bp) 3,438 3,436 5,049 4,208 Raw PE read coverage 78.7% 87.2% 98.0% 98.0% PE telomeric read coverage 45.4% 42.4% 91.8% 91.0% Raw SE read coverage 72.5% 80.5% 88.2% 88.5% SE telomeric read coverage 29.0% 27.5% 64.4% 69.6% The 2-telomere contigs have both 5′ (CCCCAAAACCCC; with degenerate bases—see Materials and Methods) and 3′ (GGGGTTTTGGGG; with degenerate bases) telomeric repeats. Note that 2-telomere contigs are mostly complete nanochromosomes but may also be alternatively fragmented nanochromosomes with one or more additional missing ends and that multitelomere contigs may be either alternatively fragmented nanochromosomes or nanochromosomes with internal telomere-like repeats. Raw read coverage is calculated from LAST (default parameters; version 159; contig telomeres were masked) matches (≥70 bp long and ≥90% identical) to the assemblies. Read coverage was calculated for the total high quality PE sequence data set and one of the three lanes of SE sequence data. Of the PE reads 13% were telomere bearing, as opposed to 4.7% of the SE reads. The initial meta-assembly of the ABySS, IDBA, and PE-Asm assemblies yielded a modest improvement in the total number of full-length nanochromosomes relative to ABySS alone, with the ratio of full-length nanochromosomes to contigs increasing from 21% to 24% of the total number of contigs. Since the meta-assembly was still highly fragmented, and our aim was to assemble full-length nanochromosomes with complete genes, we developed a strategy that consisted of two rounds of extension of nontelomeric contig ends and reassembly (see Materials and Methods). This strategy produced an assembly where the majority (77%) of the contigs had both 5′ and 3′ telomeres. For the final meta-assembly, the average contig length (2,982 bp) is longer than that of any of the original assemblies (2,273 for ABySS versus 2,090 for PE-Asm/SSAKE and 1,204 for IDBA) and the read coverage is as high as or higher than the most complete ABySS assembly (98.0% coverage for PE reads for ABySS and the final assembly; 88.2% SE read coverage for ABySS and 88.5% SE read coverage for the final assembly). However, a larger fraction of telomeric reads than nontelomeric reads still do not map to the final assembly (e.g., 9.0% of telomeric PE reads versus 2.0% of all PE reads do not map to the assembly), indicating that some telomeric regions are still missing from the final assembly. Extensive Genome Homozygosity and High SNP Heterozygosity Since there are currently no published effective population size estimates for Oxytricha trifallax, we wanted to obtain an estimate from allelic diversity of the macronuclear genome. Furthermore, current estimates of effective population size for other free-living model ciliates, Paramecium and Tetrahymena, differ [40]–[42], so additional estimates from other species will be necessary to determine if there are general trends in population size within ciliates. Given our assembly conditions, we expect many allelic sequences to co-assemble, but visual inspection of reads mapped to the final assembly suggested that a substantial fraction of the genome is homozygous. A trivial explanation for this observed homozygosity is that the micronuclear genome regions (MDS) that form the nanochromosomes are homozygous, however it is also possible that a combination of other factors may be responsible for some of the observed homozygosity. These factors include both nanochromosomal allelic drift, arising from stochastic nanochromosome segregation during amitosis (during normal cellular replication) and nanochromosomal allelic selection, both of which could lead, in principle, to haplotype fixation (also known as “allelic assortment”), as well as allelic biases introduced during conjugation. Some well-studied macronuclear nanochromosomes, including the 81 locus [43], and the nanochromosomes encoding the telomere end-binding proteins α and β (Contig22209.0 and Contig22260.0) are homozygous in the Oxytricha JRB310 strain upon which our reference macronuclear genome assembly is based. Knowledge of the fraction of homozygous and heterozygous nanochromosomes is also necessary to obtain a reasonable estimate of macronuclear genome size. To determine nanochromosomal homozygosity, we focused on nonalternatively fragmented nanochromosomes in order to avoid both ambiguous alignments and possible false identification of heterozygosity due to the presence of alternative telomere locations. Of the nonalternatively fragmented nanochromosomes, 66% (7,487 out of 11,297) had no substantial BLAT [44] nonself matches (≥100 bp and ≥90% identical; default BLAT parameters) to any other contig in the final assembly (“matchless” nanochromosomes). Matchless nanochromosomes with variants present at ≥0.5% of the positions were considered heterozygous (see Materials and Methods for our precise definition of heterozygosity); otherwise they were considered to be homozygous. Note that our read mapping cutoff may reduce polymorphism estimates by filtering out reads with more polymorphisms (see Text S1, “Read Mapping Rationale”). These criteria overestimate low frequency variants at lower coverage sites and underestimate higher frequency variants at lower coverage sites (Figure S2) but comprise a small proportion of all the sites (e.g., 11.8% of sites identified as heterozygous are at 20–40× coverage). By these criteria, the well-characterized Oxytricha actin I [45],[46] nanochromosome (Contig19101.0) was correctly classified as heterozygous, with variants at 0.89% of the examined positions (12/1,350). Mismapped reads do not affect the identification of the heterozygous sites in this nanochromosome, since only two of the reads mapped to this nanochromosome map to any of the other contigs. For the actin I nanochromosome, the frequency of allelic variants varies from just over 5% (three positions) to 13.1% (mean 8.7%) corresponding to a roughly 1∶11 ratio of the minor∶major allele. Of the matchless nanochromosomes, 63% have sequence variants identified at 0%–0.5% of positions; hence, 37% of matchless nanochromosomes are classified as heterozygous by this criterion. Read mapping appears to be adequately sensitive to detect most SNPs (single nucleotide polymorphisms), since the mean number of reads per bp mapped to heterozygous and homozygous nanochromosomes is almost identical (see “Nanochromosome Copy Number Is Nonuniform”). Approximately 42% of nanochromosomes are homozygous when the proportion of putative homozygous nanochromosomes is calculated from the total number of matchless and matched nanochromosomes. The high levels of macronuclear genome homozygosity agree with preliminary observations of micronuclear sequence data for this strain (Chen et al., unpublished), suggesting that the majority of nanochromosomal homozygosity may derive from homozygosity in the micronuclear genome, rather than other possible factors (allelic assortment and/or developmental biases). Once the micronuclear genome is more complete, it will be possible to assess how much these factors have contributed to the observed homozygosity. Nevertheless, the abundance of homozygous nanochromosomes in the final assembly (∼42%) suggests that the wild-isolate JRB310 [47] may actually be substantially inbred and that this inbreeding arose at its source. Deleterious inbreeding effects may contribute to the complexity of Oxytricha trifallax mating types [47]. It will be interesting to determine whether the two “promiscuous” Oxytricha strains (JRB27 and JRB51 [47]) that mate with the broadest set of other mating types are less inbred. Sequence polymorphisms are abundant in the Oxytricha macronuclear genome: excluding the homozygous nanochromosomes (which may have arisen from inbreeding), the mean SNP heterozygosity is 4.0% (SD = 1.8%; Figure 4). From mapped reads, for heterozygous matchless nanochromosomes, mean SNP heterozygosity is 3.1% (SD = 1.3%), and for heterozygous matched nanochromosomes, it is 4.6% (SD = 1.9%; Figure 4). For alignments of heterozygous matched nanochromosomes (with BLAT matches ≥100 bp and ≥90% identical to each other) produced by MUSCLE (for nanochromosome pairs where one of the nanochromosomes is no more than 10% longer than the other and for alignments with ≤15% differences), mean SNP heterozygosity is 3.0% (SD = 2.5%). These estimates of SNP heterozygosity indicate that assembly has masked a substantial amount of allelic variation. Similar statistics were obtained for the subset of the final assembly's nanochromosomes that were present in the pure JRB310 strain ABySS assembly (e.g., heterozygous matchless mean SNP heterozygosity is 2.8%, SD = 1.2%, and for MUSCLE alignments of heterozygous matched nanochromosomes, mean heterozygosity is 3.2%, SD = 2.9%). Hence it is unlikely that any residual JRB510 strain allelic information in our final assembly has had a considerable effect upon our estimates of heterozygosity. Nevertheless, these are first estimates from a complex genome assembly, complicated by homozygosity due to potential inbreeding, and hence inferences based upon them (including our subsequent population size estimate) should be treated with caution until better estimates from the micronuclear genome and additional strains become available. 10.1371/journal.pbio.1001473.g004 Figure 4 Nanochromosomal SNP heterozygosity. The green histogram (left) corresponds to SNP heterozygosity estimated from mapped reads (see Materials and Methods) for “matchless” nanochromosomes (which have no non-self contig matches to the genome assembly) and includes homozygous nanochromosomes. The red histogram (right) corresponds to SNP heterozygosity estimated from mapped reads for “matched” nanochromosomes. The orange histogram (center) corresponds to SNP diversity assessed from pairwise alignments of matched nanochromosomes. The smallest bin is 0–0.005 (0%–0.5%) heterozygosity. At 4-fold synonymous sites from heterozygous nanochromosomes with matches (see Materials and Methods), the mean SNP heterozygosity is 8.3% (SD = 9.4%). This underestimates sequence diversity at 4-fold synonymous sites, since pairwise alignments of contigs underestimate SNP heterozygosity at all sites (see previous paragraph). If we apply a correction for the missing SNP heterozygosity based on our overall estimate of SNP heterozygosity, we obtain an estimate of 11.1% mean SNP heterozygosity at 4-fold synonymous sites [8.3%×(4.0%/3.0%)]. This 4-fold synonymous site SNP heterozygosity is very high—even higher than the current SNP heterozygosity record holder, Ciona savignyi, which has 8.0% 4-fold synonymous site mean SNP heterozygosity [48]. These high levels of SNP heterozygosity suggest that Oxytricha trifallax has a large effective population size. Assuming a mutation rate μ ∼10−9 per base per generation, as in Snoke et al. 2006 [42], for nucleotide diversity at 4-fold synonymous sites and π4S = 4Neμ, we estimate an effective population size of 2.6×107. This effective population size is on the same order estimated for P. tetraurelia (using silent site diversity, πS, which will yield a smaller population size estimate than one based on π4S) [42]. However, this estimate of the P. tetraurelia effective population size may be an overestimate due to incorrect classification of species within the Paramecium aurelia species complex and may be closer to the order of 106 [40]. In contrast, the T. thermophila effective population size is estimated to be considerably smaller than Oxytricha's, at Ne = 7.5×105 for πS = 0.003 (with μ = 10−9) [41],[42]. In laboratory culture conditions, Oxytricha trifallax tends to replicate asexually and rarely conjugates (resulting in meiotic recombination). Conjugation in the laboratory is induced by starvation as long as cells of compatible mating types are available. However, we do not know the frequency of conjugation relative to replication of Oxytricha trifallax in its natural environment. The relationships between the frequency of asexual reproduction, and additional population genetic factors arising from asexuality, such as the variance of asexual and sexual reproductive contributions, are complex and can result in increases or decreases in estimates of effective population size [49],[50]. As a result, effective population size estimates for Oxytricha should be treated with caution until these factors are better understood. As indicated for the well-characterized actin I locus, which has a roughly 1∶11 ratio of minor∶major allelic variant, there may be substantial deviations from the expected 1∶1 ratio for the two possible allelic variants at each site. For matchless nanochromosomes, we found that the distribution of median nanochromosomal variant frequency (i.e., the median frequency of putative allelic polymorphisms) is bimodal, with one mode close to the expectation at 40%–45% and the other at 5%–10% (Figure 5; since the lower peak is bounded by the cutoff we chose to assess variants, the true lower peak may be lower than this). The bimodality of this distribution persists even if we only consider nanochromosomes where the mean coverage of the variant sites is high (e.g., at mean coverage of ≥110× at variant sites). Though some deviation from the 1∶1 variant ratio might result from allele-specific read mapping biases [51], given the relatively relaxed read mapping parameters we used (see Materials and Methods, “Read Mapping and Variant Detection”), the two variant frequency modes differ too much to explain the lower mode's existence. Instead, the deviation from the expected ratio may indicate that allelic assortment has occurred, or that there are developmentally specific allelic biases. Since a high proportion of nanochromosomes deviate substantially from the 1∶1 expected allelic ratio, it is also possible that allelic assortment has occurred for some nanochromosomes, which may contribute to the observed abundance of homozygous nanochromosomes. Nanochromosomes that deviate the most from the expected 1∶1 allelic ratio tend to have lower mean SNP heterozygosities, which likely reflects the diminished ability to detect SNPs with more distorted variant frequencies (Figure 5). 10.1371/journal.pbio.1001473.g005 Figure 5 Nanochromosomal variant frequencies. (A) Normalized to form a probability density (cumulative frequency of 1) and (B) unnormalized median nanochromosomal variant frequencies for six increasing ranges of mean SNP heterozygosity. Variant frequencies were determined for nanochromosomes with no non-self matches to the genome assembly (the same nanochromosomes underlying the SNP heterozygosity histogram for “matchless” nanochromosomes in Figure 4), with variant positions called at the same minimum variant frequency as that used to determine potentially heterozygous sites (5% for sites with ≥20× read coverage). To exclude potentially paralogous mapped reads, we only analyzed nanochromosomes with ≤4 reads mapped to other contigs (using all nanochromosomes does not substantially change the form of the distributions). Variant frequency bins are labeled by their lower bounds. Variant frequencies ≥40 bp from either nanochromosome end were counted to avoid possible incorrect variant calling resulting from telomeric bases that were not masked (due to sequencing errors). Analysis of Assembly Redundancy and Estimation of Macronuclear Genome Size We desired an estimate for the total haploid Oxytricha trifallax macronuclear genome size since it is unknown. To obtain a reasonable estimate, we needed to determine the extent of redundancy in our genome assembly. As judged by visual inspection of our original assemblies, the main sources of redundancy are (i) the two alleles from the partially diallelic genome (see “Extensive Genome Homozygosity and High SNP Heterozygosity”), (ii) alternative nanochromosome fragmentation (see “Extensive Alternative Nanochromosome Fragmentation”), (iii) erroneous base calling that may result from high copy number regions and relatively abundant sequencing errors, and (iv) paralogous genes. The assembly with the most redundancy—from ABySS (Figure S3)—has approximately half of its contigs with nonself matches that are identical or almost identical (matches that are ≥100 bp and ≥99% identical). Visual inspection of the ABySS assembly revealed that much of the redundancy arose from the combined effect of high copy number DNA and sequencing errors. Our assembly strategy eliminated most of the redundancy from erroneous base calling, because it collapsed regions that are nearly identical. A small quantity of additional redundancy may have also been introduced by the inclusion of non-reference (JRB510) allelic sequences from the Sanger/454 genome assemblies, though the strategy we used prefers the inclusion of reference allelic sequences (see Materials and Methods, “Princeton Illumina Assembly and TGI Sanger/454 Assembly Integration”). Though some redundancy remains in our final genome assembly, this is counteracted by ∼1/4 of the nanochromosomes that have had their alleles collapsed during the assembly (see “Extensive Genome Homozygosity and High SNP Heterozygosity”). Given the ∼42% estimate of nanochromosomal homozygosity, we estimate that the haploid number of nanochromosomes is ∼15,600 [from 15,993+1,279 two- and multitelomere contigs and 5,303 single-telomere contigs (Table 1); we also estimate that ∼10% of nanochromosomes are alternatively fragmented (see “Extensive Alternative Nanochromosome Fragmentation”)]. With a mean nanochromosomal length of ∼3.2 kb, we estimate that the haploid macronuclear genome size is 50 Mb, which is similar to earlier experimental estimates [52],[53]. The Macronuclear Genome Encodes All the Genes Necessary for Vegetative Growth Traditional assessments of genome completeness are not very meaningful in Oxytricha because they usually measure genomes of uniform coverage with relatively long chromosomes. In two key ways, the Oxytricha macronuclear genome assembly is more similar to a de novo transcript assembly than to a conventional genome assembly: it contains multiple nanochromosome isoforms produced by alternative nanochromosome fragmentation (see “Extensive Alternative Nanochromosome Fragmentation”), and it is an assembly of nonuniformly amplified DNA (see “Nanochromosome Copy Number Is Nonuniform”). Unlike RNA transcripts, nanochromosome levels remain relatively stable during asexual growth [22], and variation of nanochromosome copy number is considerably lower than that of transcripts, so we are able to completely sample the genome's DNA over time. Simple genome metrics indicate that our assembly is largely complete. Firstly, we have sequenced the genome to a substantial depth: we have >62× haploid coverage of the genome assembly by Illumina 100 bp PE reads and >48× haploid coverage by SE reads. Secondly, nearly all high-quality reads map to our final assembly (98% of high quality PE reads) and the majority of contigs (71.3%) represent complete nanochromosomes, with only 5.1% of the contigs missing both telomeres and 23.6% missing one telomere (Table 1). Finally, our 50 Mb haploid genome assembly size estimate is similar to an earlier estimate of ∼55 Mb for the DNA complexity of the Oxytricha macronucleus [2]. To assess genome completeness we analyzed the completeness of two specific, functionally related gene sets—encoding ribosomal proteins and tRNAs—and one general gene data set in Oxytricha. All of these measures of completeness indicate that the macronuclear genome assembly is essentially complete. Firstly, the final genome assembly contains all 80 of the standard eukaryotic ribosomal proteins (32 small subunit and 48 large subunit proteins). Secondly, the Oxytricha macronuclear genome has a haploid complement of ∼59 unique tRNA nanochromosomes (including a selenocysteine tRNA on Contig21859.0). These tRNAs are sufficient to translate all of its codons if wobble position anticodon rules [54] are accounted for. As judged by searches of tRNAdb [55], codons without cognate tRNAs in Oxytricha are either absent or rare in other eukaryotes. Furthermore, with the exception of a Tetrahymena glycine tRNA that has a CCC anticodon [56], Oxytricha's tRNAs share the same anticodons as Tetrahymena. We also assessed the completeness of the macronuclear genome by searches of predicted proteins against 248 “core eukaryotic genes” (CEGs: defined by KOGS [57] based on the complete protein catalogs of H. sapiens, D. melanogaster, C. elegans, A. thaliana, S. cerevisiae, and S. pombe [58]). Using a strategy similar to that used to assess completeness during analyses of ncRNAs in Oxytricha [59], and based on the CEGMA analysis strategy [58], we searched for all the core proteins using a match coverage of ≥70% of the mean CEG sequence length, accumulated over the span of the query CEG sequence matches from BLASTP (BLAST+ [60]; with at least one of the matches for each query having an E-value≤1e-10). Of our predicted proteins 231 had substantial sequence similarity to the core eukaryotic protein sequences. The numbers of core eukaryotic proteins we found in the latest Tetrahymena (223) and Paramecium (230) gene predictions were similar to those found in Oxytricha (within the limits of the sensitivity of the searches we used and possible gene prediction failures). However, since CEGS are defined for just five genomes of animals, fungi, and one plant and exclude a diversity of other eukaryotes, the true set of CEGs may be somewhat smaller than this. Given the great evolutionary divergences of ciliates from these eukaryotes (possibly in excess of 1.5 billion years ago [61]), it is also possible that the BLAST criteria employed by CEGMA are not sufficiently sensitive to detect more distant ciliate homologs. This suggests that the predicted proteomes of all three ciliates are largely complete. Given that the deep divergences of ciliates might prevent detection of their homologs to the remaining 17 CEGS without matches, we attempted more sensitive searches at the domain level using HMMER3 [62]. We assigned Pfam domains to each KOG if the domains were best Pfam hits to the majority of the members of each KOG. Using domain searches, 13 of the remaining 17 core proteins had matches in Pfam-A (with domain, full-sequence E-values 62× coverage of the PE reads. Mindful of these limitations, within the sequenced JRB310 clonal population of cells, nanochromosome copy number does not appear to vary as much as gene transcription. The most highly amplified nanochromosome, encoding the 18S, 5.8S, and 28S rRNA (Contig451.1), has a copy number that is ∼56× the mean of nonalternatively fragmented nanochromosomes, yet its transcripts typically yield more than 90% of the RNA in our non-poly(A)-selected RNA-seq samples. There is a roughly 2-fold difference between the most highly amplified nanochromosome and the next most highly amplified nanochromosome, encoding the 5S rRNA (Contig14476.0/Contig17968.0; quasi-allelic contigs). 10.1371/journal.pbio.1001473.g007 Figure 7 Nanochromosome copy number variation. (A) Relative nanochromosome copy number distribution (number of telomere-less reads/bp of nonsubtelomeric nanochromosome; see Materials and Methods) for homozygous matchless, heterozygous matchless, and heterozygous matching nanochromosomes. The mean copy number of the combined homozygous matchless and heterozygous matchless nanochromosomes is indicated by a dashed line at 0.94, with dotted lines corresponding to an interval of ∼1.3σ (∼0.12 to 1.76) either side of the mean, which includes ∼90% of all nanochromosomes. (B) Relative nanochromosome copy number of nonalternatively fragmented (with a single, directional fragmentation site per nanochromosome) versus alternatively fragmented nanochromosomes measured by the number of telomeric reads per nanochromosome. (C) Relative nanochromosome copy number of nonalternatively fragmented chromosomes versus nonalternatively fragmented chromosomes encoding ribosomal proteins and tRNAs. Since the method we used to estimate nanochromosome copy number combines reads from both possible alleles for heterozygous nanochromosomes, it is necessary to map the reads sensitively to avoid exclusion of reads and to minimize incorrect mapping in order to obtain accurate estimates (see “Genome Homozygosity and SNP Heterozygosity”). Our mapping procedure seems to be appropriate for matchless nanochromosomes, since there is no substantial difference in copy number distributions for homozygous and heterozygous matchless nanochromosomes (mean copy number of 0.93, SD = 0.61, and 0.97, SD = 0.67, respectively; Figure 7A). However, for heterozygous nanochromosomes with matches, the mean nanochromosome copy number is lower (0.81; SD = 0.59; Figure 7A) than for matchless nanochromosomes. This is likely because some of the nanochromosomes with matches exhibit higher heterozygosity regions than matchless heterozygous nanochromosomes (>6% mean SNP heterozygosity; see “Genome Homozygosity and SNP Heterozygosity”) and the mapping criteria (≥94% read identity to the mapped contig) eliminated some of the more heterozygous reads. To assess nanochromosome copy number of alternatively fragmented versus nonalternatively fragmented nanochromosomes, we examined the relationship between the number of telomeric reads and the number of nontelomeric reads per bp of the nanochromosomes (see Materials and Methods). We found that there was a good correlation between telomeric reads from either end of the nonalternatively fragmented nanochromosomes (Figure S6B) with r = 0.90. However, there are examples where the number of reads from each nanochromosome end differs substantially (e.g., Contig22209.0 and Contig5780.0 from Table S3). This may indicate the failure to extend the ends of some nanochromosomes completely or that the ends derive from the DNA of the nonreference strain (JRB510) and are relatively divergent with few reads from the reference DNA mapped to them (e.g., Contig5780.0). Alternatively, there may be experimental biases that skew the numbers of reads mapped to the two ends (e.g., Contig22209.0, which has JRB310 telomeric reads mapped to the nanochromosome end with fewer reads but no reads extending further, even with relaxed read mapping parameters). The correlation between the number of reads per bp and the number of telomeric reads per nanochromosome is also strong (r = 0.89; Figure S6A), indicating that assessment of telomeric reads alone is appropriate for large-scale analyses of nanochromosome copy number. Furthermore, our estimates of relative nanochromosome copy number, either via reads per bp or the number of telomeric reads per contig, are in good agreement with those obtained by qPCR (Table S3; Figure S7). For relative nanochromosome copy number measured by telomeric reads, the mean number of telomeric reads per alternatively fragmented nanochromosome with a single (directional) alternative fragmentation site (i.e., only two nanochromosome isoforms) is 2.4 times (885 reads, SD = 768 reads) that of nonalternatively fragmented nanochromosomes (363 reads; SD = 290 reads; K-S one-sided test D = 0.59 and p value nonalternatively fragmented nanochromosome copy number; Figure 7B). It follows that the DNA of the shorter alternative nanochromosome isoforms is even more highly amplified than that of nonalternatively fragmented nanochromosomes. The greater amplification of alternatively fragmented nanochromosomes relative to nonalternatively fragmented nanochromosomes supports a model of net overamplification of specific alternatively fragmented nanochromosomes isoforms rather than a model of net destruction. The higher amplification of alternatively fragmented nanochromosomes may indicate a commensal DNA relationship between two genes, arising when one of the genes benefits from the amplification signal of a more highly amplified nanochromosome isoform bearing another gene. This relationship requires no functional association between the genes on alternatively fragmented nanochromosomes, consistent with our general observations (e.g., no specific functional associations between nonribosomal genes and ribosomal genes on alternatively fragmented nanochromosomes). For nonalternatively fragmented nanochromosomes, the ribosomal protein-encoding nanochromosomes are ∼3.9× more highly amplified than nonribosomal protein nanochromosomes, and tRNA-encoding nanochromosomes are ∼3.6× more highly amplified than non-tRNA-encoding nanochromosomes (Figure 7C; for ribosomal versus nonribosomal nanochromosomes: K-S one-sided test D = 0.64 and p value nonribosomal nanochromosome copy number; for tRNA versus non-tRNA nanochromosomes: K-S one-sided test D = 0.62 and p value non-tRNA nanochromosome copy number). Similarly, the ribosomal protein- and tRNA-encoding nanochromosome isoforms arising from alternative fragmentation are typically overamplified relative to the isoforms that encode other genes (50/54 alternatively fragmented ribosomal nanochromosomes and 25/28 alternatively fragmented tRNA nanochromosomes; Figure S8). Given the modest variation in nanochromosome copy number, most notably the limited overamplification of nanochromosomes encoding highly expressed genes (rRNAs, tRNAs, and ribosomal proteins), even if a strong correlation exists between nanochromosome copy number and transcription levels, copy number may only be a modest contributor to the final RNA and protein expression levels. Regulation of expression at the transcriptional/posttranscriptional level may be essential to buffer the variation in DNA copy number that arises during extended periods of vegetative growth. Nanochromosome Length Variation Oxytricha nanochromosomes range in length from ∼500 bp to 66 kb, with a mean size of ∼3.2 kb (Figure 8). Few nanochromosomes were assembled at either extremity of the length distribution, with just 32 shorter than 600 bp long and 61 longer than 15 kb, consistent with observations of macronuclear DNA on electrophoretic gels [2],[70]. While the mean length of two-telomere nanochromosomes in the final Oxytricha macronuclear genome assembly is ∼3.2 kb (Table 1), the true average length of nanochromosomes is shorter than this because the longest isoform of alternatively fragmented nanochromosomes is the one that tends to be assembled. On electrophoretic gels, Oxytricha nanochromosomes are visibly longer than those of Euplotes [2],[71], which we propose is primarily a consequence of the lack of alternative fragmentation in Euplotes (inspection of mapped reads to our preliminary Euplotes crassus assembly indicated no signs of alternative fragmentation; unpublished data). The longest isoforms of alternatively fragmented nanochromosomes average 5.0 kb (SD = 2.4 kb), while nonalternatively fragmented nanochromosomes have a mean length of 3.0 kb (SD = 2.4 kb; Figure 8). The mean length of the shortest nanochromosome isoforms produced by alternative fragmentation is 2.4 kb (SD = 1.6 kb). For single-gene nonalternatively fragmented nanochromosomes, the mean nanochromosome length is 2.2 kb (SD = 1.0 kb). 10.1371/journal.pbio.1001473.g008 Figure 8 Length distributions of alternatively and nonalternatively fragmented nanochromosomes. The shortest nanochromosome isoforms produced from single (directional) alternative fragmentation sites are labeled as “Short isoform.” The histograms show normalized frequencies for 1,587 alternatively fragmented nanochromosomes and 15,219 nonalternatively fragmented nanochromosomes. Alternatively fragmented nanochromosomes have at least one strongly supported (≥10 Illumina reads) alternative fragmentation site >250 bp from either end of the nanochromosome (these nanochromosomes are >500 bp long). The shortest assembled nanochromosome (Contig20269.0) is a mere 248 bp, excluding the telomeric sequences. Though we were unable to identify any ORFs or any ncRNAs on this nanochromosome by RFAM searches, we found two matching RNA-seq PE reads, suggesting that there is expression from this nanochromosome. The shortest nanochromosome (Contig19982.0) with a known protein is 469 bp (excluding the telomeres) and encodes a 98 aa ThiS/MoaD family protein, while the shortest ncRNA-bearing nanochromosome we found is 540 bp (excluding telomeres) and encodes tRNA-Gln(CUG) (see Text S1, “Analysis of Short Protein and ncRNA-Encoding Nanochromosomes”). Searches for shorter possible nanochromosomes in the Illumina and Sanger reads did not reveal additional plausible nanochromosome candidates (Text S1, “Reads Containing Both Putative Telomeric Repeats Are Not Genuine Nanochromosomes”). The longest nanochromosomes (>15 kb) typically encode a single large structural protein (Table S4), such as dynein heavy chain proteins (e.g., Contig354.1). None of the 20 longest nanochromosomes are alternatively fragmented. Seven of these 20 nanochromosomes contain multiple predicted genes (up to a maximum of four); however, all but one of these gene predictions are oriented head-to-tail, consistent with the possibility that their predictions may have been incorrectly split. Hence most of the longest nanochromosomes are likely still single-gene nanochromosomes. One ∼20 kb nanochromosome (Contig289.1) does indeed contain multiple genes, since it encodes a Pkinase domain (PF00069) protein on the opposite strand to two predicted PAS_9 domain (PF14326) proteins (though these latter two proteins may also be incorrectly split). Six of the longest nanochromosomes encode single proteins with no detectable Pfam domains (Pfam-A 26; independent E-value 2,000 aa). The longest nanochromosome (Contig7580.0) is 66 kb (65,957 bp; excluding telomeres) and encodes a single giant protein (“Jotin,” after a Norse giant) with BLASTP best hits to Titin-like genes in the NCBI nrdb (see Text S1, “Characterization of the Jotin Protein”). We note that this single-gene nanochromosome is comparable in size to the entire, relatively large and gene-rich ∼70 kb Oxytricha mitochondrial genome [65], which was largely eliminated by the sucrose gradient isolation of macronuclei (see Materials and Methods). The Oxytricha Jotin ORF is 64,614 bp. AUGUSTUS predicts four short introns (117, 151, 77, and 63 bp), two of which are supported by and one of which conflicts with RNA-seq reads. This gene's entire coding sequence is well supported by pooled RNA-seq reads (covered from end to end). Gene Predictions The gene prediction software, AUGUSTUS, predicted complete genes on 15,387 of the complete nanochromosomes we surveyed (96%) and 91% of the final assembly's contigs. Examination of three developmental time points (0, 10, and 20 h after initiation of conjugation) confirms transcription of 97% of Oxytricha nanochromosomes (94% of all contigs). AUGUSTUS predicts genes on 94% of nanochromosomes with expression evidence. Most Oxytricha nanochromosomes (80%) contain single genes, consistent with earlier studies (Figure S9) [2],[33]. Alternatively fragmented nanochromosomes tend to encode more genes per nanochromosome: only 15% of alternatively fragmented nanochromosomes have single gene predictions, versus 90% of all nonalternatively fragmented nanochromosomes. Roughly half (48%) of multigene nanochromosomes have alternative fragmentation. All nanochromosomes with five or more (maximum eight) predicted genes are alternatively fragmented (Figure S9), and only two nonalternatively fragmented nanochromosomes encode four genes. The nanochromosome with the largest number of separate gene products (Contig8800.0; ∼6.8 kb) is alternatively fragmented, with a shorter ∼3.5 kb nanochromosome isoform that encodes 12 C/D snoRNAs [59] and a putative protein-coding gene encoded by the remainder of the full-length isoform. Key properties of Oxytricha's gene predictions are consistent with a pilot survey [33], including relative AT-richness (34% GC) with noncoding regions that are more AT-rich than coding regions (e.g., introns are 23.6% GC), and 1.6 introns per gene (Table 2). Oxytricha gene lengths (mean length 1,839 bp excluding UTRs) are similar to those predicted for Tetrahymena [56]. 10.1371/journal.pbio.1001473.t002 Table 2 Properties of gene predictions. Feature Number Mean (bp) Min (bp) Max (bp) %GC Genes 17,040 1,839 150 65,451 34.0 Exons 43,759 661 3 45,409 34.2 Introns 26,719 90 28 549 23.6 5′ upstream of start codon 11,439 266 35 5,398 26.1 3′ downstream of stop codon 11,439 210 10 3,859 26.7 Prediction features were obtained for complete nanochromosomes (14,388 in total) only. Gene lengths are from the start to stop codons and exclude UTRs. Up- and downstream regions were only determined for single-gene nanochromosomes and include the 5′ and 3′ UTRs. %GC estimates exclude telomeric bases. Possible Functional Differences in the Predicted Proteomes of Ciliates With and Without Genome Unscrambling Functional differences between the model ciliates may have evolved in numerous ways, given their tremendous divergence. Here we focus on two key differences: absence/presence of protein domains in specific ciliates and expansions of protein families at the level of protein domain. We were particularly interested in comparing the protein domains present in either ciliates with gene scrambling (Oxytricha) or that lack evidence of gene scrambling (Paramecium, Tetrahymena, and Euplotes; Figure 2; Tables S5 and S6; also see Table S7 for genes found in Paramecium and Tetrahymena that are absent in Oxytricha), since many species-specific proteins appear associated with macronuclear genome differentiation. Examples include the transposases in Oxytricha (micronuclear-limited TBE transposases) [30], Tetrahymena and Paramecium (piggyBac transposase—“PiggyMac”) [72],[73], and the Paramecium RNA binding Nowa proteins [74]. Since we were interested in DNA rearrangement, we searched for differences in the nucleic acid binding and nucleic acid metabolism domain content between the ciliates with and without evidence of extensive gene scrambling (see Materials and Methods). We identified 43 such nucleic-acid-related domains that are present in Oxytricha (“Oxytricha-specific” domains) but absent from both Tetrahymena and Paramecium (Table S5 and Table S6). Domesticated macronuclear transposases The most striking difference in the nucleic-acid-associated protein domains of Oxytricha and Paramecium/Tetrahymena is the number of Oxytricha-specific transposon-like or transposon-associated domains—27 out of 83 Pfam domain matches (Table S5 and Table S6). All of the proteins that possess these transposase domains, with the exception of the proteins with the DDE_Tnp_1 and DDE_Tnp_1_3 domains (whose genes are on telomere-lacking contigs, Contig6077.0 and Contig4212.0, and whose best GenBank BLASTP matches are all bacterial), have typical Oxytricha glutamine codon usage (including UAG and UAA codons), which precludes bacterial contamination. We discovered three types of Oxytricha-specific macronuclear-encoded transposase-like domains: Phage_integrase, DDE_Tnp_IS1595, and MULE. All of these domains belong to proteins encoded on complete nanochromosomes. Since the nanochromosomes encoding either DDE transposase domain (DDE_Tnp_IS1595 or MULE) lack terminal inverted repeats characteristic of transposons with these types of transposases, they appear to be domesticated versions, like the macronucleus-encoded PiggyMac transposases in Paramecium and Tetrahymena [72],[73]. The DDE_Tnp_IS1595 and MULE domains are the two most abundant nucleic-acid-related, Oxytricha-specific protein domains, present in 11 and 9 distinct proteins (Figure 9). 10.1371/journal.pbio.1001473.g009 Figure 9 Transposase-like domains of proteins found in Oxytricha but neither Paramecium nor Tetrahymena. Proteins are shown with black lines with a scale in amino acids indicated above the longest protein. Protein names are to the left of the protein diagrams. Domain coordinates are the Pfam domain envelope coordinates. Representative domains are given their Pfam names, with transposase-like domains shown in bold. Gene expression levels are log2[10,000×normalized RNA-seq counts (see Text S1; Supporting Materials and Methods) divided by CDS length (in bp)] before (“fed”) and during conjugation (0–60 h). The MULE domain name derives from the class of transposons known as Mutator-like transposable elements, which are widely distributed among angiosperms [75]. The domain is found in transposases that regulate the activity of the most mutagenic plant transposon, Mutator (reviewed in [75]). Domesticated MULE transposases are present in other eukaryotes, having most notably given rise to the FAR1 domain-containing FAR1 and FHY3 transcription factors involved in regulating light signaling in Arabidopsis [76],[77]. Within ciliates, the FAR1 domain is specific to Oxytricha and is encoded on proteins that both contain (Contig14154.0.g51) or lack (Contig18814.0.g95) MULE domains. The dinoflagellate Perkinsus marinus, a sister clade to the ciliates [133], contains 21 MULE domain-containing proteins, just one of which also contains a predicted SWIM domain (C5KSG6_9ALVE), like Oxytricha (Contig14154.0.g51), and no proteins with the FAR1 domain. The DDE catalytic motif is present in some of the Oxytricha MULE domains, at a similar spacing to that found in other MULE domains [78], suggesting that these proteins are functional. We also found two MULE matches (independent E-value 8 h) at 4°C, then centrifuged at 16,000 rcf for 30 min, and washed twice, for 10 min in 70% ethanol, before resuspension in the kit's elution buffer. Genomic shotgun libraries were prepared for different size fractions of Oxytricha macronuclear DNA using standard methods employed at The Genome Institute (TGI) for both Sanger and 454 sequencing (see Text S1, “Preparation of Nanochromosome DNA for Sanger/454 Sequencing” to “454 Sequencing of DNA from Nanochromosome Size Fractions”), while a special method was developed for the construction and 454 sequencing of paired telomeric ends (Text S1, “Whole Nanochromosome Telomere-Based Library Construction”). Both PE and SE Illumina libraries were prepared at Princeton University from pure JRB310 Macronuclear DNA using standard Illumina kits (Text S1, “llumina Genomic Library Construction and Sequencing”). Genome Assembly We developed a meta-assembly method (Figure S1) to build a reference genome assembly that is primarily derived from Illumina sequence data (Princeton Illumina assembly) but also takes advantage of an earlier Sanger/454 hybrid assembly (TGI 2.1.8 assembly). The data that were used for each of the assemblies are summarized in Table S8. TGI 2.1.8 Sanger/454 assembly In total we obtained ∼900 Mb of ABI3730 Sanger reads from whole genome shotgun (WGS) sequencing of Oxytricha macronuclear DNA (Tables S8, S9, S10). To coassemble Sanger and 454 data, the PCAP (2.x series assemblies) [106] and Newbler (6.0–9.0 assemblies) assemblers were used. Four hundred and fifty-four reads were added to the PCAP assemblies with the add454Reads.perl script from Consed [107],[108]. The Newbler assemblies coassembled Sanger and 454 reads. The PCAP assemblies were improved by Consed's autoedit utility and manually edited after visual inspection of the assemblies in Consed. The final TGI assembly (2.1.8) was produced by manually merging the 2.1.7 PCAP assembly and Newbler 9.0 contigs and scaffolds. We removed a small quantity of vector sequences not masked in the raw Sanger reads from the 2.1.8 assembly (∼5.4 Mb from 138 Mb) using cross_match (-minmatch 10 -minscore 15) [109] postassembly. Both the PCAP and Newbler assemblies are available at http://trifallax.princeton.edu/cms/databases/raw-data/genome/mac/assembly/WUGSC/ and additional notes about the production of these assemblies are at http://trifallax.princeton.edu/cms/raw-data/genome/mac/assembly/WUGSC/README.txt. Princeton Illumina assembly and TGI Sanger/454 assembly integration We selected high-quality reads ≥100 bp long for assembly, trimmed by TQS_fastq.py (from the SSAKE package [37]) with the parameters -t 20 (phred quality threshold) -c 40 (min number of bases passing quality threshold). Quality trimming produced 4.8 Gb of lllumina GAIIX SE reads and 6.2 Gb of Illumina HiSeq2000 PE reads (mean outer paired distance 362 bp, SD = 52 bp). Only PE reads were assembled with the PE-Assembler [36] version “pe_asm_hg18” (default parameters) and IDBA version 0.18 (default parameters and –mink = 60 –maxk = 90 –scaffold) [35], as neither of these assemblers was designed to use a mixture of PE and SE reads. Since the PE sequence coverage was poor over a ∼160 bp span, ∼100 bp from the end of the nanochromosomes, due to the size selection procedure used to create the Illumina PE library (e.g., Figure S8), we extended the PE-Assembler contigs with SE reads using SSAKE (v3.7) [37] in TASR mode [110]. Extension of the PE-assembler contigs yielded only a modest improvement, from 20 full-length nanochromosomes to 741, and from 999 single-telomere contigs to 3,933. We assembled both the SE and PE reads with ABySS version 1.2.7 (parameters: k = 50, n = 10; parameters were chosen to maximize the number of full-length nanochromosomes in the assembly) [34]. We split the scaffolds created by ABySS, where spans between the contigs were unresolved (i.e., filled with “N” characters; affecting approximately ∼10% of the nanochromosomes; typically with one unresolved span per scaffold), since the presence of these spans hinders subsequent meta-assembly. We produced a meta-assembly from the three assembly programs by assembling them with the CAP3 assembler with strict overlap parameters (-o 40 -p 99) (Figure S1) [39]. After the meta-assembly with CAP3, there was a large fraction of incompletely assembled nanochromosomes (Table S11) and so we developed a strategy of two successive end-extensions and re-assembly to increase the proportion of complete nanochromosomes (Figure S1, Table S11–S17). We attempted to extend every non-telomere-bearing contig end by finding 100 bp contig or read BLAST matches to the ends of the contigs from each of the original assemblies. In the following order of priority, we extended the contigs with three different data sources: (a) contigs from the CAP3 Illumina meta-assembly, (b) contigs from the 454/Sanger assembly (2.1.8 assembly), and (c) high-quality, vector filtered Sanger sequence reads. In the case of the 454/Sanger contigs and Sanger reads, we may have created hybrid contigs with JRB510 polymorphisms, rather than the desired JRB310 polymorphisms, but this is mitigated by the application of the majority rule during meta-assembly with CAP3 since three Illumina JRB310 assemblies were used versus one Sanger/454 JRB310/510 assembly. To integrate the contig extensions from the three different assemblies, the longest end-extension for each possible end was selected. The contigs used for extensions were selected based on the highest identity match (≥94% identity). We desired a range of match identities to accommodate both sequencing and assembly errors as well as allelic rate variation. This resulted in a fraction of “quasi-allelic” contigs when closely related alleles were merged. The end extension conditions we chose rely on the assumption that there are relatively few close paralogs in the Oxytricha macronuclear genome and that close paralogs are typically more divergent than the most divergent alleles (at a 400 bp from contig ends are hexagonally binned. The x-axis units are an estimate of relative nanochromosome copy number based on the total number of mapped Illumina reads, both telomeric and nontelomeric, per bp for each contig. On the y-axis, reads per fragmentation site were calculated for a 100 bp window centered on the site with the most reads supporting the alternative fragmentation site. (A) 454-telomeric end reads (2,792 contigs; 3,634 sites); (B) Illumina telomeric reads (3,331 contigs; 4,392 sites). (TIFF) Click here for additional data file. Table S1 Location of alternative fragmentation sites relative to inter- and intracoding sequence regions for two-gene nanochromosomes. Alternative fragmentation sites with decreasing numbers of supporting telomeric reads are shown in three successive columns. To exclude conventional TASs, only alternative fragmentation sites at least 100 bp away from either end of the contig were selected. Nanochromosomes with single alternative fragmentation sites were selected. AUGUSTUS gene predictions were used to determine inter-/intra-CDS regions. Similar trends were found for 454 telomeric reads. %GC was determined for a 50 bp window either side of alternative fragmentation sites. (RTF) Click here for additional data file. Table S2 Location of alternative fragmentation sites relative to coding and noncoding sequence regions for single-gene nanochromosomes. Alternative fragmentation sites with decreasing numbers of supporting telomeric reads are shown in three successive columns. To exclude conventional TASs, only alternative fragmentation sites at least 100 bp away from either end of the contig were selected. Nanochromosomes with single alternative fragmentation sites were selected. CDS/non-CDS regions were determined from the AUGUSTUS gene predictions. Similar trends were observed for 454 telomeric reads (not shown). %GC was determined for a 50 bp window either side of alternative fragmentation sites. (RTF) Click here for additional data file. Table S3 Estimates of nanochromosome copy number. Only the rRNA nanochromosome in this table is alternatively fragmented (with a site at 634 bp supported by 11 reads and two sites at 1,253 bp and 6,077 bp supported by a single read). 5′- and 3′-telomeric reads refer to reads that are mapped either to the 5′ or 3′ end as it is oriented in the genome assembly. (RTF) Click here for additional data file. Table S4 Large predicted proteins. The 20 longest nanochromosomes, excluding cases that appear to be redundant (i.e., quasi-alleles), are shown. None of these nanochromosomes is alternatively fragmented. Nanochromosome lengths include telomeres. Protein domain names are abbreviations from Pfam-A (version 26). Semicolons separate predicted protein lengths and protein domain architectures. (RTF) Click here for additional data file. Table S5 Oxytricha nucleic-acid-associated protein domains not found in Paramecium and Tetrahymena. aJudging from multiple sequence alignments, domain appears to exist in Paramecium (GSPATP00020413001) and Tetrahymena (TTHERM_00721450) but was not detected by hmmscan (HMMER3) bindependent E-value greater than the threshold (0.001), but domain exists (e.g., protein TTHERM_01211800 in Tetrahymena). cProtein encoded on contigs, with no telomeric repeats, that are likely bacterial contaminants dPredicted protein for an incompletely assembled part of a previously characterized mitochondrial plasmid encoding a viral/organellar type DNA polymerase [69]. eIndependent E-value greater than the threshold (0.001) used, but domain exists in Paramecium [124] and Tetrahymena. fIndependent E-value greater than the threshold (0.001) used, but sequence alignments of proteins containing N-terminal domain (TFIIA_gamma_N) to homologs from Oxytricha suggest that hmmscan failed to detect this domain in Tetrahymena and Paramecium. (RTF) Click here for additional data file. Table S6 Oxytricha putative nucleic-acid-associated protein domains (not annotated in pfam2go) not found in Paramecium and Tetrahymena. Domains in this table are considered to have putative nucleic-acid-related functions based on Pfam descriptions and literature cited for these domains. aProteins encoded on contigs with no telomeric repeats. bProtein is truncated due to an incorrect intron prediction; this protein is an allelic variant of the protein Contig14154.0.g51, which is the correctly predicted allelic variant. cProtein encoded on contigs with no telomeric repeats that are likely bacterial contaminants. (RTF) Click here for additional data file. Table S7 Nucleic-acid-associated protein domains found in both Paramecium and Tetrahymena but not Oxytricha. Domains marked with * are present in translated ORFs, but were not originally detected as AUGUSTUS failed to predict them. Protein IDs are given for Tetrahymena. (RTF) Click here for additional data file. Table S8 Data sources for genome assemblies. Data for the genome assemblies incorporated in the final meta-assembly may be downloaded from http://dx.doi.org/10.5061/dryad.d1013 [132]. (RTF) Click here for additional data file. Table S9 Genomic and RNA libraries Sanger sequenced on ABI3730 sequencers. (RTF) Click here for additional data file. Table S10 454 genomic DNA libraries. Short read archive data can be downloaded from http://www.ncbi.nlm.nih.gov/sra. (RTF) Click here for additional data file. Table S11 Meta-contig statistics after first CAP3 assembly before extension. “Single” refers to an SE being complete (≥1 5′ or 3′ telomeres). “Both” refers to one or more telomeres on both ends of the contig (≥1 5′ and ≥1 3′ ends). “Multiple” refers to greater than two ends on either end of the contig (≥2 5′ or ≥2 3′ ends). All lengths are given in bp. (RTF) Click here for additional data file. Table S12 Meta-contig statistics after first extension. “Single” refers to an SE being complete (≥1 5′ or 3′ telomeres). “Both” refers to one or more telomeres on both ends of the contig (≥1 5′ and ≥1 3′ ends). “Multiple” refers to greater than two ends on either end of the contig (≥2 5′ or ≥2 3′ ends). All lengths are given in bp. (RTF) Click here for additional data file. Table S13 Meta-contig statistics after CAP3 reassembly of extended contigs. “Single” refers to an SE being complete (≥1 5′ or 3′ telomeres). “Both” refers to one or more telomeres on both ends of the contig (≥1 5′ and ≥1 3′ ends). “Multiple” refers to greater than two ends on either end of the contig (≥2 5′ or ≥2 3′ ends). All lengths are given in bp. (RTF) Click here for additional data file. Table S14 Meta-contig statistics after second extension. “Single” refers to an SE being complete (≥1 5′ or 3′ telomeres). “Both” refers to one or more telomeres on both ends of the contig (≥1 5′ and ≥1 3′ ends). “Multiple” refers to greater than two ends on either end of the contig (≥2 5′ or ≥2 3′ ends). All lengths are given in bp. (RTF) Click here for additional data file. Table S15 Meta-contig statistics after CAP3 reassembly of second round of extended contigs. “Single” refers to an SE being complete (≥1 5′ or 3′ telomeres). “Both” refers to one or more telomeres on both ends of the contig (≥1 5′ and ≥1 3′ ends). “Multiple” refers to greater than two ends on either end of the contig (≥2 5′ or ≥2 3′ ends). All lengths are given in bp. (RTF) Click here for additional data file. Table S16 Meta-contig statistics after chimera splitting and end trimming. “Single” refers to an SE being complete (≥1 5′ or 3′ telomeres). “Both” refers to one or more telomeres on both ends of the contig (≥1 5′ and ≥1 3′ ends). “Multiple” refers to greater than two ends on either end of the contig (≥2 5′ or ≥2 3′ ends). All lengths are given in bp. (RTF) Click here for additional data file. Table S17 Meta-contig statistics for the final CAP3 assembly. “Single” refers to an SE being complete (≥1 5′ or 3′ telomeres). “Both” refers to one or more telomeres on both ends of the contig (≥1 5′ and ≥1 3′ ends). “Multiple” refers to greater than two ends on either end of the contig (≥2 5′ or ≥2 3′ ends). All lengths are given in bp. (RTF) Click here for additional data file. Table S18 Properties of intergenic regions. Prediction features were obtained for complete nanochromosomes (14,388 in total) only. Intergenic regions are between start and stop codons, including UTRs. Alternative fragmentation sites are those that are strongly supported by Illumina telomeric reads. %GC estimates exclude telomeric bases. Intergenic regions are subdivided according to whether they have a site of alternative fragmentation within the region or not. (RTF) Click here for additional data file. Table S19 Missing Moco biosynthesis enzymes in ciliates. (RTF) Click here for additional data file. Table S20 Small ribosomal proteins. Gene identifiers are given as contig identifiers with a gene suffix beginning with “g” followed by a number (which is arbitrary in this context). Only proteins ≤100 aa with domains found in Pfam 26.0 with an E-value 10 kb) fosmid library construction, p. 38. Whole nanochromosome telomere-based library construction, p. 39. Illumina genomic library construction and sequencing, p. 43. Read mapping rationale, p. 43. Determination of pN/pS values, p. 45. Genome assembly validation and redundancy analysis, p. 45. Classification of strongly and weakly supported alternative fragmentation sites, p. 47. Determination of sequences surrounding telomere addition sites, p. 48. RNA isolation, NuGEN cDNA synthesis and Illumina sequencing, p. 48. RNA-seq mapping and read counting, p. 50. Gene prediction, p. 52. Length determination of “untranscribed” and untranslated regions, p. 55. Protein domain identification and GO term selection, p. 55. tRNA searches, p. 57. Euplotes crassus culturing, DNA isolation and preliminary macronuclear genome assembly, p. 57. Supporting References, p. 59. (RTF) Click here for additional data file.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                Microbiol Resour Announc
                Microbiol Resour Announc
                ga
                mra
                MRA
                Microbiology Resource Announcements
                American Society for Microbiology (1752 N St., N.W., Washington, DC )
                2576-098X
                12 July 2018
                July 2018
                : 7
                : 1
                : e00826-18
                Affiliations
                [a ]Department of Animal Sciences, The Ohio State University, Columbus, Ohio, USA
                [b ]Molecular and Cellular Imaging Center, Ohio Agricultural Research and Development Center, The Ohio State University, Wooster, Ohio, USA
                [c ]Department of Plant Pathology, The Ohio State University, Wooster, Ohio, USA
                University of California, Riverside
                Author notes
                Address correspondence to Z. Yu, yu.226@ 123456osu.edu .

                Citation Park T, Wijeratne S, Meulia T, Firkins J, Yu Z. 2018. Draft macronuclear genome sequence of the ruminal ciliate Entodinium caudatum. Microbiol Resour Announc 7:e00826-18. https://doi.org/10.1128/MRA.00826-18.

                Author information
                https://orcid.org/0000-0002-4480-4524
                https://orcid.org/0000-0002-6165-8522
                Article
                MRA00826-18
                10.1128/MRA.00826-18
                6211341
                f7404069-da52-47dd-a386-18c1de74efbd
                Copyright © 2018 Park et al.

                This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license.

                History
                : 5 June 2018
                : 11 June 2018
                Page count
                Figures: 0, Tables: 0, Equations: 0, References: 21, Pages: 2, Words: 1443
                Funding
                Funded by: U.S. Department of Agriculture (USDA), https://doi.org/10.13039/100000199;
                Award ID: 2012-67015-19437
                Award Recipient : Award Recipient :
                Categories
                Genome Sequences
                Custom metadata
                July 2018

                Comments

                Comment on this article