82
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus

      research-article
      1 , 2 , * , 1 , 2
      Bioinformatics
      Oxford University Press

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation: We have implemented a coalescent simulation program for a structured population with selection at a single diploid locus. The program includes the functionality of the simulator ms to model population structure and demography, but adds a model for deme- and time-dependent selection using forward simulations. The program can be used, e.g. to study hard and soft selective sweeps in structured populations or the genetic footprint of local adaptation. The implementation is designed to be easily extendable and widely deployable. The interface and output format are compatible with ms. Performance is comparable even with selection included.

          Availability: The program is freely available from http://www.mabs.at/ewing/msms/ along with manuals and examples. The source is freely available under a GPL type license.

          Contact: gregory.ewing@ 123456univie.ac.at

          Supplementary information: Supplementary data are available at Bioinformatics online.

          Related collections

          Most cited references12

          • Record: found
          • Abstract: found
          • Article: not found

          Soft sweeps: molecular population genetics of adaptation from standing genetic variation.

          A population can adapt to a rapid environmental change or habitat expansion in two ways. It may adapt either through new beneficial mutations that subsequently sweep through the population or by using alleles from the standing genetic variation. We use diffusion theory to calculate the probabilities for selective adaptations and find a large increase in the fixation probability for weak substitutions, if alleles originate from the standing genetic variation. We then determine the parameter regions where each scenario-standing variation vs. new mutations-is more likely. Adaptations from the standing genetic variation are favored if either the selective advantage is weak or the selection coefficient and the mutation rate are both high. Finally, we analyze the probability of "soft sweeps," where multiple copies of the selected allele contribute to a substitution, and discuss the consequences for the footprint of selection on linked neutral variation. We find that soft sweeps with weaker selective footprints are likely under both scenarios if the mutation rate and/or the selection coefficient is high.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Properties of a neutral allele model with intragenic recombination.

            An infinite-site neutral allele model with crossing-over possible at any of an infinite number of sites is studied. A formula for the variance of the number of segregating sites in a sample of gametes is obtained. An approximate expression for the expected homozygosity is also derived. Simulation results are presented to indicate the accuracy of the approximations. The results concerning the number of segregating sites and the expected homozygosity indicate that a two-locus model and the infinite-site model behave similarly for 4Nu less than or equal to 2 and r less than or equal to 5u, where N is the population size, u is the neutral mutation rate, and r is the recombination rate. Simulations of a two-locus model and a four-locus model were also carried out to determine the effect of intragenic recombination on the homozygosity test of Watterson (Genetics 85, 789-814; 88, 405-417) and on the number of unique alleles in a sample. The results indicate that for 4Nu less than or equal to 2 and r less than or equal to 10u, the effect of recombination is quite small.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Inferring the Demographic History and Rate of Adaptive Substitution in Drosophila

              Introduction A long-standing interest in evolutionary biology has been to estimate the rate of adaptive substitution. Adaptive events can be inferred from interspecific data by comparing nonsynonymous and synonymous substitution rates [1]. A second approach has been to use a combination of both interspecific and intraspecific data in employing the McDonald-Kreitman method [2]. It has been found that positive selection could play a role in the human-chimpanzee lineages [3] and that as much as 45% of all amino-acid substitutions have been fixed by natural selection in Drosophila [4]. However, the methods that include interspecific data (in particular, the McDonald-Kreitman test) may be sensitive to fairly small fluctuations in effective population size and other demographic changes [5]. An alternative is to use only data on intraspecific variation and to explicitly model the effects of demographic changes and positive selection [6–8]. Footprints of very recent positive selection can be detected by identifying selective sweeps in the genome (in particular, valleys of reduced polymorphism). In the last few years, several methods have been proposed to detect selective sweeps [9–13]. To distinguish signatures of sweeps from those of demography and estimate the rate of adaptive substitutions, we use here a modification of the approach of Li and Stephan [12]. Reduced polymorphism due to hitchhiking will be restored after about 0.1Ne generations [9,14]. This feature enables us to detect very recent hitchhiking events and to reveal the relationship between adaptation and habitat change in a species that invaded new territory. For these purposes, the cosmopolitan species D. melanogaster serves as an appropriate model since this species, originally from Africa, expanded its population size worldwide very recently [15,16]. We analyzed DNA polymorphism at more than 250 noncoding loci on the X chromosome from two D. melanogaster populations: the Netherlands and east Africa [16–18]. The homologous sequences of D. simulans are used as outgroup data to infer the ancestral status of a polymorphic site and to estimate divergence between D. melanogaster and D. simulans. Results/Discussion Inferring Demography: General Approach Demographic change affects the genome-wide polymorphism pattern in a species or population. Thus, we used the whole dataset to infer demographic processes in the two populations. For the African population, the dataset is given in terms of the mutation frequency spectrum (MFS), where the MFS is the distribution describing the relative abundance of derived mutations occurring i = 1, 2, …, n − 1 times in n homologous sequences. Following Nielsen [19], the likelihood for the kth locus is given as , where is a set of (nk − 1) expected branch lengths [12] under the demographic scenario. The branch length is scaled so that one unit represents 2NA 0 generations, where NA 0 is the current effective population size for the X chromosome in the African population; nk is the sample size of the kth locus, ξ ik is the number of derived mutations carried by i sampled chromosomes for the kth locus, and E(lik) is the expected length of branches with i descendants for the kth locus under the demographic scenario. P(ξ ik |E(lik) is given by the Poisson probability, i.e., , with λ ik = E(lik)θ Ak /2, which is the expected number of derived mutations occurring i times in nk sampled sequences at the kth locus, where θ Ak = 4NA 0ξ ik , and μ k is the mutation rate of the kth locus. Since loci are independent given the expected branch lengths, the likelihood for all loci is , where m is the number of loci. To infer the demographic change in the derived European population, we used the joint MFS [20] (Figure 1). If the sample sizes of the African and European populations are nA and nE (nA ≥ 0 and nE ≥ 0), respectively, the joint MFS for one locus is where ω ij is the number of derived mutations carried by i sampled chromosomes in the sample from the African population and by j sampled chromosomes in the sample from the European population. The values of ω00 ω nA nE and (denoting the numbers of mutations that are not present and fixed in the sample, respectively) are not considered in the analysis. Figure 1 Demographic Models of the African and European Populations (A) The demographic histories are plotted together. (B) The demographic histories are plotted for both populations separately. (C) The joint MFS for an example genealogy where the sample size of European lines (indicated by E) is 3, and that of African lines (A) is 4. ω ij is the number of mutations carried by i chromosomes of the African sample and by j chromosomes of the European sample. Finally, we assume that the out-of-Africa migration does not affect the genetic polymorphism in the African population (Figure 1). This is reasonable because the size of the founder population is likely to be very small compared to the size of the ancestral African population. Thus, we estimated the demographic scenario of the European population conditional on the estimated demographic scenario of the African population. Under this assumption, the likelihood for the joint MFS is calculated in a similar way as described above (see Materials and Methods). Demographic History of the African Population Before entering the analysis, it is crucial to examine whether the mutation rate among the noncoding loci is homogeneous. We found that the level of genetic polymorphism of a locus (measured by Watterson's θ W ) is significantly positively correlated with divergence between D. melanogaster and D. simulans (Figure 2). Based on the Poisson distribution, we compared the mutation rate μ k of each locus k (estimated from divergence) with the average mutation rate of loci over the whole X chromosome (i.e., the average of mutation rates across loci weighted by sequence length). The null hypothesis is , which is tested using Monte-Carlo simulations. The estimated mutation rate of 62 of the 266 loci (23.3%) is significantly lower than the average (at 1% significance level, one-tailed test), while that of 51 of the 266 loci (19.2%) is significantly higher. This suggests that the mutation rate among loci is not homogeneous. Therefore, we used two models in the following analysis: (i) a constant mutation rate model in which the mutation rate of each locus is and (ii) a varying mutation rate model in which the mutation rate of locus k is μ k . The constant mutation rate model underestimates the variance of mutation rates among loci while the varying mutation rate model overestimates this variance (because of the sampling error of the estimated mutation rates). Figure 2 Watterson's θ W versus Divergence between D. melanogaster and D. simulans Pearson's r = 0.65, p 0. It is assumed that δ is homogeneous over windows because all windows have the same length. We also assume that the windows are independent of one another. Then L(δ, f(s)) is given by . By dividing the outcomes for a window into neutral and selected cases, we have P(M w |δ, f(s)) = (1 − δ)P(M w |neutral) + δ∫f(s)P(M w |s)ds , where P(M w |neutral) and P(M w |s) are estimated by rejection sampling (described in Materials and Methods). An obvious advantage of our approach is that we do not make any assumption about f(s). Let be defined as the rate of adaptive substitution with a selection coefficient within the interval (s 1, s 2). We have . Thus, δ and f(s) are estimated as { , , …} and , respectively, where s 1 > 0, s 1 Q(s) is expected if the data are simulated under neutrality. If the hitchhiking data with selection coefficient s′ are simulated, we expect that a maximum Q(s) (where s > 0) could be obtained when s = s′. In this study, Q(s) is calculated for 18 values of s (i.e., 0.06, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 4, 6, 8, 10; all values in percent). Then we partition the value of s (given in percentage) into six regions, which are [0.05, 0.1), [0.1, 0.3), [0.3, 0.5), [0.5, 0.7), [0.7, 1), and [1, 20). Thus, to estimate δ and f(s), we need to estimate six parameters (δ0.05,0.1, δ0.1,0.3, δ0.3,0.5, δ0.5,0.7, δ0.7,1, and δ1,20), and δ is given by their summation. We treat the cases (s < 0.05%) as neutral because of the very low power to detect them (Figures S1 and S2). To maximize the likelihood, likelihoods in a six-dimensional parameter space are calculated, where each dimension represents one parameter. To be specific, the minimum and maximum values of a parameter are 0 and 0.0167 × 10−9 per site per generation (see above), respectively. The spacing of the grid of parameter values is 0.0003 × 10−9 per site per generation. Estimating δ and f(s) for the European population. The genetic polymorphism in the European sample could be affected by sweeps that occurred in the ancestral African population before the split. Thus, we need to consider the effect of “old” sweeps when estimating δ E and f(sE). Here, we use the indices of A and E to distinguish the parameters for the European and the African populations. For hitchhiking events that occurred in the derived European population, τ is uniformly distributed within [0, tE0 ]. To estimate δ E and f(sE), we divide the outcomes for a window in the European sample into four cases: (a) there is no sweep; (b) a sweep occurred in the European population after the split; (c) a sweep occurred in the ancestral African population before the split; and (d) a sweep occurred in the European population after the split, and another sweep in the ancestral African population before the split. Given a sweep originated in the African population, the probability that the sweep occurred before the split is η = (tA 0 − tE 0 − tE 1)/tA 0. Then, the probability is given by where and are known parameters estimated from the African sample, and Q(sE, sA ) = P(M w |sE, sA ). The related Q is estimated by the method described above. When we estimate Q(sE, sA ), we use B = 100. Supporting Information Figure S1 The Power of the LRT to Detect Sweeps in the African Sample The length of each window is 100 kb, and the power is obtained by averaging over the windows. s is the true value under which the data are simulated, and is the assigned (fixed) value in the hitchhiking model. The values of s and are given in percentage. (168 KB DOC) Click here for additional data file. Figure S2 The Power of the LRT to Detect Sweeps in the European Population (164 KB DOC) Click here for additional data file. Figure S3 The Comparison of Derived MFS under Different Population Expansion Scenarios in the African Population Maximum likelihood estimates: , and the strength of the expansion = 5.0. The other three expansion scenarios are chosen such that the parameter values are within the estimated CIs. Expansion1: , and the strength of the expansion = 4.0; Expansion2: , and the strength of the expansion = 6.0; Expansion3: , and the strength of the expansion = 8.0. The sum of the squares of the residuals (between the expected and the observed) is 0.0012, 0.0066, 0.0025 and 0.0035, respectively. (81 KB DOC) Click here for additional data file. Table S1 Evaluation of Demographic Models for the African and European Populations (26 KB DOC) Click here for additional data file. Table S2 Results of the Nonoverlapping Window Analysis of the African Sample Based on Different Hitchhiking Models (124 KB DOC) Click here for additional data file. Table S3 Results of the Nonoverlapping Window Analysis of the European Sample Based on Different Hitchhiking Models (122 KB DOC) Click here for additional data file. Table S4 List of 13 Loci Which Have High Mutation Rate but Low Diversity in the African Sample (44 KB DOC) Click here for additional data file. Accession Numbers The sequences used in this study were obtained from the EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl) (AJ568984 to AJ571588, AJ568984 to AJ571588) and GenBank (http://www.ncbi.nlm.nih.gov/Genbank) (AY925214 to AY926258).
                Bookmark

                Author and article information

                Journal
                Bioinformatics
                bioinformatics
                bioinfo
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                15 August 2010
                30 June 2010
                30 June 2010
                : 26
                : 16
                : 2064-2065
                Affiliations
                1Department of Mathematics, University of Vienna, Nordbergstrasse 15, A-1090 Vienna, Austria and 2Max F. Perutz Laboratories, Dr. Bohrgasse 9, A-1030 Vienna, Austria
                Author notes
                * To whom correspondence should be addressed.

                Associate Editor: Jeffrey Barrett

                Article
                btq322
                10.1093/bioinformatics/btq322
                2916717
                20591904
                bb0ff30d-ecf0-46c6-b526-b9baf9dcd29d
                © The Author(s) 2010. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 9 April 2010
                : 4 June 2010
                : 10 June 2010
                Categories
                Applications Note
                Genetics and Population Analysis

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article