Abstract
Molecular sequences obtained at different sampling times from populations of rapidly evolving pathogens and from ancient subfossil and fossil sources are increasingly available with modern sequencing technology. Here, we present a Bayesian statistical inference approach to the joint estimation of mutation rate and population size that incorporates the uncertainty in the genealogy of such temporally spaced sequences by using Markov chain Monte Carlo (MCMC) integration. The Kingman coalescent model is used to describe the time structure of the ancestral tree. We recover information about the unknown true ancestral coalescent tree, population size, and the overall mutation rate from temporally spaced data, that is, from nucleotide sequences gathered at different times, from different individuals, in an evolving haploid population. We briefly discuss the methodological implications and show what can be inferred, in various practically relevant states of prior knowledge. We develop extensions for exponentially growing population size and joint estimation of substitution model parameters. We illustrate some of the important features of this approach on a genealogy of HIV1 envelope (env) partial sequences.
ONE of the most significant developments in population genetics modeling in recent times was the introduction of coalescent or genealogical methods (Kingman 1982a,b). The coalescent is a stochastic process that provides good approximations to the distribution of ancestral histories that arise from classical forwardtime models such as the FisherWright (Fisher 1930; Wright 1931) and Moran population models. The explicit use of genealogies to estimate population parameters allows the nonindependence of sampled sequences to be accounted for. (“Genealogy” and “tree” are used interchangeably throughout. In both cases we are referring to a collection of edges, nodes, and node times that together completely specify a rooted history.) Many coalescentbased estimation methods focus on a single genealogy (Fu 1994; Neeet al. 1995; Pybuset al. 2000) that is typically obtained using standard phylogenetic methods. However, there is often considerable uncertainty in the reconstructed genealogy. To allow for this uncertainty it is necessary to compute the average likelihood of the population parameters of interest. The calculation involves integrating over genealogies distributed according to the coalescent (Griffiths and Tavare 1994; Kuhneret al. 1995). We can carry out this integration for some models of interest, using Monte Carlo methods. Importancesampling algorithms have been developed to estimate the population parameter Θ = 2N_{e}μ (Griffiths and Tavare 1994; Stephens and Donnelly 2000), migration rates (Bahlo and Griffiths 2000), and recombination (Griffiths and Marjoram 1996; Fearnhead and Donnelly 2001). MetropolisHastings Markov chain Monte Carlo (MCMC; Metropoliset al. 1953; Hastings 1970) has been used to obtain samplebased estimates of Θ (Kuhneret al. 1995), exponential growth rate (Kuhneret al. 1998), migration rates (Beerli and Felsenstein 1999, 2001), and recombination (Kuhneret al. 2000).
In addition to developments in coalescentbased population genetic inference, sequence data sampled at different times are now available from both rapidly evolving viruses, such as human immunodeficiency virus (HIV; Holmeset al. 1992; Wolinskyet al. 1996; Rodrigoet al. 1999; Shankarappaet al. 1999), and from ancient DNA sources (Hanniet al. 1994; Leonardet al. 2000; Loreilleet al. 2001; Barneset al. 2002; Lambertet al. 2002). This temporally spaced data provides the potential to observe the accumulation of mutations over time and thus estimate mutation rate (Drummond and Rodrigo 2000; Rambaut 2000). In fact, it is even possible to estimate variation in the mutation rate over time (Drummondet al. 2001). This leads naturally to the more general problem of simultaneous estimation of population parameters and mutation parameters from temporally spaced sequence data (Rodrigo and Felsenstein 1999; Rodrigoet al. 1999; Drummond and Rodrigo 2000; Drummondet al. 2001).
In this article we estimate population and mutation parameters, dates of divergence, and tree topology from temporally spaced sequence data, using samplebased Bayesian inference. The important novelties in the inference are the data type (i.e., temporally sampled sequences), the relatively large number of unknown model parameters, and the MCMC sampling procedures, not the Bayesian framework itself. The coalescent gives the expected frequency with which any particular genealogy arises under the FisherWright population model. The coalescent may then be treated either as part of the observation process defining the likelihood of population parameters or as the prior distribution for the unknown true genealogy. In either case we must integrate the likelihood over the state space of the coalescent. Bayesian and purely likelihoodbased population genetic inference use the same reasoning and software up to the point where prior distributions are given for the parameters of the coalescent and mutation processes.
Are there then any important difficulties or advantages in a Bayesian approach over a purely likelihoodbased approach? The principal advantage is the possibility of quantifying the impact of prior information on parameter estimates and their uncertainties. The new difficulty is to represent different states of prior knowledge of the parameters of the coalescent and mutation processes as probability densities. However, such prior elicitation is often instructive. In the absence of prior information, researchers frequently choose to use noninformative/improper priors for the parameters of interest. Such an approach may be problematic and can result in improper posterior distributions. There exist a number of important cases in the literature in which knowledgeable authors inadvertently analyze a meaningless, improper posterior distribution. Why then do we choose to treat improper priors in this article? We are developing and testing inferential and sampling methods. These methods become more difficult as the amount of information in the prior is reduced. The sampling problem becomes significantly more difficult. We therefore treat the “worst case” prior that might naturally arise. Since this prior is improper, we are obliged to check that the posterior is proper. However, when confronted with a specific analysis, detailed biological knowledge should be encoded in the prior distributions wherever possible.
Although Bayesian reasoning has frequently been applied to phylogenetic inference (Yang and Rannala 1997; Thorneet al. 1998; Mauet al. 1999; Huelsenbecket al. 2000), it has thus far been the exception in population genetic inference (Wilson and Balding 1998).
In this article, we begin with a description of the models we use. We then give the overall structure of the inferential framework, followed by an overview of how MCMC is carried out. We mention extensions of the basic inference that allow for (1) deterministically varying populations and (2) estimation of substitution parameters. Finally, we illustrate our methods with a group of studies of a sample of HIV1 envelope (env) sequences and a second group of studies of synthetic sequence data.
Kingman coalescent with temporally offset leaves: In this section we define the coalescent density for the constantsized FisherWright population model. In extensions we give the corresponding density for the case of a population with deterministic exponential growth. It is assumed genealogies are realized by the Kingman coalescent process. Our time units in this article are “calendar units before the present” [e.g., days before present (BP)], where the present is the time of the most recent leaf and set to zero. Let ρ denote the number of calendar units per generation and θ = N_{e}ρ. The scale factor θ converts “coalescent time” to calendar time and is one of two key objects of our inference. Note that we do not estimate ρ and N_{e} separately, only their product.
Consider a rooted binary tree g with n leaf nodes and n − 1 ancestral nodes. For node i, let t_{i} denote the age of that node in calendar units. Node labels are numerically increasing with age so i > j implies t_{i} ≥ t_{j}. Let I denote the set of leaf node labels and let Y denote the set of ancestral node labels. There is one leaf node i ϵ I associated with each individual in the data. These individuals are selected, possibly at different times, from a large background population. An edge ⟨i, j⟩, i > j of g represents an ancestral lineage. Going back in time, an ancestral node i ϵ Y corresponds to a coalescence of two ancestral lineages. The root node, with label i = 2n − 1, represents the most recent common ancestor (MRCA) of all leaves. Let t_{I} be the times of the leaves and t_{Y} be the divergence times of the ancestral nodes. Let E_{g} denote the edge set of g, so that g = (E_{g}, t_{Y}) specifies a realization of the coalescent process. For given n and t_{I}, let Γ denote the class of all coalescent trees (E_{g}, t_{Y}) with n leaf nodes having fixed ages t_{I}. The ages t_{Y} are subject to the obvious parentchild age order constraint. The element of measure in Γ is dg = dt_{n}_{+1} … dt_{2}_{n}_{−1} with counting measure over distinct topologies associated with the distinguishable leaves.
The probability density for a tree, f_{G}(gθ), g ϵ Γ is computed as follows. Let k_{i} denote the number of lineages present in the interval of time between the node i − 1 and the node i. The coalescent process generates g = (E_{g}, t_{Y}) with probability density
Mutation: We use the standard finitesites selectionneutral likelihood framework (Felsenstein 1981) with a general timereversible (GTR) substitution model (Rodriguezet al. 1990). However, as we are considering genealogies in calendar units (or generations) as opposed to mutations we take some space to develop notation.
Associated with each leaf node i ϵ I there is a nucleotide sequence D_{i} = (D_{i,1}, D_{i,2}, …, D_{i,s}, …, D_{i,L}) of some fixed length L, say. Nucleotide base characters D_{i,s}, i ϵ I, s = 1, 2, …, L take values in the set C = {A, C, G, T}. An additional gap character, ϕ, indicates missing data. Let D = (D_{1}, D_{2}, …, D_{n})^{T} denote the n × L matrix of sequences associated with the tree leaves, and let D_{A} denote the (n − 1) × L matrix of unknown sequences associated with the ancestral nodes. The data are D together with t_{I}, that is, the n sequences observed in the leaf individuals and the n ages at which those individual sequences were taken. Let D = C^{(n−1)L} denote the set of all possible ancestral sequences. Consider a site s = 1, 2, …, L in the nucleotide sequence of an individual. The character at site s mutates in forward time according to a Poisson jump process with 4 × 4 rate matrix Q. Here, Q_{i,j} is the instantaneous rate for the transition from character i to character j, and A ← 1, C ← 2, G ← 3, T ← 4. We assume mutations are independent between sites. Let π = (π_{A}, π_{C}, π_{G}, π_{T}) be a 1 × 4 vector of base frequencies, corresponding to the stationary distribution of the mutation process, πQ = (0, 0, 0, 0).
The matrix Q is parameterized in terms of a symmetric “relative rate” matrix R,
The conversion factor μ is the second of the two principal objects of our inference. In addition to μ, the relative rates, R, may be estimated. We have found that wherever it is feasible to estimate the scale parameters μ and θ, our data are informative about the elements of R. We return to inference for relative rates in extensions.
We now write the likelihood for μ. Consider an edge ⟨i, j⟩ ϵ E_{g} of tree g. The individual associated with node j is a direct descendant of the individual associated with node i. However, the sequences D_{i} and D_{j} may differ if mutations have occurred in the interval. Let e^{Q} denote the 4 × 4 matrix exponential of Q. In the standard finitesites selectionneutral likelihood framework Pr{D_{j,s} = c′D_{i,s} = c} = [e^{−Qμ(ti−tj)}]_{c,c′} for c ϵ C. The probability for any particular set of sequences D, D_{A} to be realized at the nodes of a given tree is
Bayesian inference for scale parameters: We now consider Bayesian inference for scale parameters μ and θ. Both of these quantities take a real positive value. The joint posterior density, h_{MΘG}(μ, θ, gD), for the scale parameters and genealogy, is given in terms of the likelihood and coalescent densities above and two additional densities, f_{M}(μ) and f_{Θ}(θ). These functions quantify prior information about the scale parameters. Let Z be an unknown normalizing constant. The posterior is then
Consider now the densities f_{M}(μ) and f_{Θ}(θ). In any particular application these functions will be chosen to summarize available prior knowledge of scale parameters. It is common practice to avoid the problem of prior elicitation and attempt to construct a “noninformative” prior. This notion is poorly defined, since a prior may be noninformative with respect to some hypotheses, but informative with respect to others. Nevertheless, we illustrate samplebased Bayesian inference under a prior that contains little information. We do this for two reasons. First, we wish to give our sampling instruments a thorough workout. From this point of view an improper prior is the best choice. Second, when carrying out Bayesian inference, it is necessary to test the sensitivity of conclusions to changes in the state of prior knowledge. What conclusions would a person in a state close to ignorance reach from these data? The improper prior we consider represents ignorance of a rather natural kind. People using our methods will very likely want to consider this particular state of knowledge, along with others that are more representative of their own.
In our case μ and θ are both scale parameters (for time). The Jeffreys prior, f(z) ∝ 1/z, z > 0, invariant under scale transformations z → az, and the uniform prior on z > 0 are candidates for f_{M}(μ) and f_{Θ}(θ). If f_{M} ∝ 1/μ, f_{Θ} ∝ 1/θ, and f_{G}(gθ) and Pr{Dg, μ} are as given in Equations 1 and 5 then it may be shown that the posterior density in Equation 6 is not finitely normalizable. We may nevertheless consider ratios of posterior densities. But that means the only feasible Bayesian inference, at least under the uniform, improper prior, is exactly frequentist inference. We cannot treat the parameters of interest as random variables. Suppose fixed upper limits μ ≤ μ* and
MARKOV CHAIN MONTE CARLO FOR EVOLUTIONARY PARAMETERS
The posterior density h_{MΘG} is a complicated function defined on a space of high dimension (between 30 and 40 in the examples that follow). We summarize the information it contains by computing the expectations, over h_{MΘG}, of various statistics of interest. These expectations are estimated using samples distributed according to h_{MΘG}. We use MCMC to gather the samples we need. MCMC and importance sampling are part of a family of Monte Carlo methods that may be used individually or in concert to solve the difficult integration problems that arise in population genetic inference. Earlier work on this subject is cited in the Introduction. Figure 1 shows a cartoon of two proposal mechanisms used. See the appendix for details of the proposal mechanisms and MCMC integration performed.
As always in MCMC, it is not feasible to test for convergence to equilibrium. MCMC users are obliged to test for stationarity as a proxy. We make three basic tests. First, we check that results are independent of the starting state using 10 independent runs with very widely dispersed initializations. Second, we visually inspect output traces. These should contain no obvious trend. Third, we check that the MCMC output contains a large number of segments that are effectively independent of one another, independent, at least, in the distribution determined empirically by the MCMC output. Let ρ_{f}(k) give the autocorrelation at lag k for some function f of the MCMC output. Let γ_{f} denote the asymptotic standard deviation of some estimate of ρ_{f}(k), formed from the MCMC output. Large lag autocorrelations should fall off to zero and remain within O(γ_{f}) of zero, as discussed by Geyer (1992). Note that in the examples section, these standards are not uniformly applied. The first two analyses pass all three checks. The last two analyses pass the first test. Here we are displaying the limitations of our MCMC algorithm. However, we believe the convergence is adequate for the points we make. In the appendix, Convergence and standard errors describes the integrated autocorrelation time (IACT) and effective sample size (ESS) measures used to test the efficiency of our sampler.
The MCMC algorithm we used was implemented twice, more or less independently, by A. Drummond, in JAVA and by G. K. Nicholls in MatLab. This allowed us to compare results and proved very useful in debugging some of the more complex proposal mechanism combinations. To minimize programming burden, one of our implementations (G. K. Nicholls in MatLab) was partial, allowing only fixed population size and fixed R to be compared. This is discussed more extensively in Implementation issues in the appendix.
EXTENSIONS
Extending the framework of the Introduction and MCMC for evolutionary parameters to include deterministically varying models of population history and estimation of relative rate parameters is straightforward. Let Φ = (0, ∞)^{5} be the state space for the relative rates of R above the diagonal and excluding R_{G↔T}. Let s = (μ, θ, g, r, R), and let h_{S}(sD) denote the posterior density for
EXAMPLES
In this section, we illustrate our methods on two HIV1 env data sets and a series of synthetic data sets of comparable size.
HIV1 env data: The method was first tested on HIV1 partial envelope sequences obtained from a single patient over five sampling occasions spanning ~3 years: an initial sample (day 0) followed by additional samples after 214, 671, 699, and 1005 days. Details of this dataset have been published previously (Rodrigoet al. 1999). An important feature of these data is that monotherapy with Zidovudine was initiated on day 409 (Drummondet al. 2001) and continued during the remainder of the study. The total dataset consists of 60 sequences from these five time points. The length of the alignment is 660 nucleotides. Gapped columns were included in the analysis. The evidence for recombination seems to be negligible in this dataset (Rodrigoet al. 1999) and recombination is ignored for the purposes of illustrating our method. Rough estimates of N_{e} may be obtained by assuming a generation length of ρ = 1 day per generation (Rodrigoet al. 1999). However, we emphasize that we estimate N_{e}ρ only in this work. The dataset was split into two subsets for separate analysis. One contained all pretreatment sequences (28 sequences), and the other contained all sequences after treatment commenced (32 sequences; henceforth called posttreatment). The rationale behind this split is that both (1) population size and (2) mutation rate per unit time may be affected by a replication inhibitor such as Ziduvodine. In all of the analyses, base frequencies were fixed to empirically determined values; however, inference of these would have been trivial. Two analyses are undertaken on each dataset. The pretreatment data are strongly informative for all parameters estimated. The results are robust to the choice of priors and MCMC convergence is quick. In contrast, the posttreatment data are only weakly informative for μ, θ, and t_{root} parameters; the results are sensitive to the choice of prior; and MCMC convergence is very slow.
Pretreatment data, constant population size, HKY substitution: In this first analysis of the pretreatment dataset, we fit the HKY substitution model and assume a constant population size. We are estimating μ, θ, g, and κ. We illustrate our methods using uniform prior distributions on μ and θ, an upper limit on mutation rate of μ* = 1, a lower limit on N_{e}ρ of θ* = 1, and a very conservative upper limit on t_{root} of t* = 10^{7} days. Ten MCMC runs were made, with starting values for mutation rate distributed on a log scale from 5 × 10^{−3} down to 10^{−7} mutations/site/day. This range greatly exceeds the range of values supported by the posterior. To test MCMC convergence on tree topologies, each of the 10 MCMC runs was started on a random tree drawn from a coalescent distribution with population size equal to 1000 (in exploratory work we initialize on a sUPGMA or neighborjoining topology). The 10 Markov chain simulations were run for 2,000,000 steps and the first 100,000 steps were discarded as burnin. Each run took ~4 hr on a machine with a 700 MHz Pentium III processor. The mean IACT of the mutation rate parameter was 4190, giving an ESS of ~450 per simulation. Table 1 presents parameter estimates for all 10 runs, illustrating close concordance between runs. Note also that the variability, between runs, of estimated means is in line with standard errors estimated within runs. This is a consistency check on our estimation of the IACT. Figures 2 and 3 show the marginal posterior density of μ and θ for each of the 10 runs. In all 10 runs the consensus tree computed from the MCMC output was the same, despite the fact that the starting trees were drawn randomly (data not shown). Combining the output of all 10 runs, the 95% highest posterior density (HPD) intervals for the mutation rate and t_{root} are, respectively, [4.20, 8.28] × 10^{−5} mutations per site per day, and [580, 1040] days.
Pretreatment data, exponential growth, general substitution model: In this second analysis of the pretreatment dataset, we fit the general timereversible substitution model, with exponential growth of population size. We are estimating μ, θ, g, r, R_{A↔C}, R_{A↔G}, R_{A↔T}, R_{C↔G}, and R_{C↔T}. This is the most parameterrich model we fit. To assess the convergence characteristics of this analysis we ran 10 independent runs of 3,000,000 cycles, each starting with an independent random tree topology (the mean IACT for μ was 7955 giving an ESS of 358 per run). Figure 4 shows the 10 estimates of the marginal posterior density of mutation rate. Table 2 shows parameter estimates for each of the 10 runs. Convergence is still achieved with the extra parameters.
Compare the distribution of summary statistics under the two models described here and in Pretreatment data, constant population size, HKY substitution. Given the nature of infection of HIV1, it seems likely that an exponential growth rate assumption is more accurate. Estimated 95% HPD intervals for the growth rate r, [1.09 × 10^{−3}, 6.65 × 10^{−3}], exclude small growth rates, corroborating this view. The 95% HPD intervals for the mutation rate and t_{root} are, respectively, [3.61, 8.11] × 10^{−5} mutations per site per day and [570, 1090] days. Compare these with the model in Pretreatment data, constant population, HKY substitution. The change in model has minimal effect (<10%) on the posterior mean mutation rate.
Posttreatment: The posttreatment data are analyzed twice under the HKY substitution model with constant population size. The first analysis uses the same priors as the first pretreatment analysis. In contrast to the pretreatment dataset, the mutation rate of the posttreatment dataset is difficult to estimate. This is illustrated in Figures 5 and 6, in which the marginal posterior densities of μ and θ estimated from 10 independent MCMC runs, each 5,000,000 cycles long, are compared. We were unable to compute an IACT for each run, so we are unable to compare within and betweenrun variability. However, the betweenrun concordance visible in Figure 5 justifies the following statement. The posttreatment mutation rate shows one mode at ~2.8 × 10^{−5} mutations/site/day with a second mode on the lower boundary. The data determine a diffuse, and bimodal, marginal posterior on μ. One of the modes is associated with states (μ, θ, g) with physically unrealistic root times (greater than the age of the patient). These are allowed, if we are not prepared to assert some restriction on t_{root}. This behavior also occurs when we use a Jeffreys prior on the mutation rate (data not shown). It reflects a real property of the data, namely that states of low μ and large t_{root} are not well distinguished from otherwise identical states of larger μ and smaller t_{root}.
In the second posttreatment analysis, we revise the upper limit on t_{root} downwards, from 10^{7} to t* = 3650, a value more representative of actual prior knowledge for this dataset. The new limit, set 3 years before seroconversion occurred in the infected patient, is still conservative. Here we explored the prior belief that HIV infection most often originates from a small, homogenous population and then subsequently accumulates variation. This prior effectively assumes that all viruses in an infected individual share a common ancestor at most as old as the time of infection of the host. The estimated 95% HPD interval for the mutation rate was [1.16, 4.27] × 10^{−5} mutations/site/day, markedly down from the pretreatment mutation rate. Figure 7 depicts the resulting unimodal marginal posterior density for mutation rate, showing that the spurious mode has been eliminated. Again, no IACT was computed. However, betweenrun variability was much improved over Figures 5 and 6. Information about t_{root} has been converted into information about mutation rates and population size.
Simulated sequence data: To test the ability of our inference procedure to recover accurate estimates of parameters from the above HIV1 dataset we undertook four simulation studies. In each experiment we generated 100 synthetic datasets. For experiment 1, the posterior estimates of θ, μ, and κ obtained from the pretreatment dataset in Pretreatment data, constant population size, HKY substitution were used to generate 100 coalescent trees and then simulate sequences on each of the resulting trees. The synthetic data were generated under a constantsize population model with the HKY mutation model but analyzed under an exponentially growing population model and a GTR mutation model. In the second experiment, 100 synthetic datasets were generated using the pretreatment parameter estimates in Pretreatment data, exponential growth, general substitution model as the true values. In this case the models for simulation and inference are matched. Synthetic data were generated under an exponentially growing population model and a GTR mutation model. In both experiments 1 and 2 uniform bounded priors were used for all parameters. Experiments 3 and 4 differed from experiments 1 and 2 only in that we used Jeffreys' (1946) prior for scale parameters (mutation rate, population size, and relative rates).
All datasets had the same number of sequences (28), the same sampling times (0 and 214 days), and the same sequence length (660) as the pretreatment dataset. Table 3 shows that the true values are successfully recovered (i.e., fall within the 95% HPD interval) ≥90% of the time in all cases except for the relative rate parameters in experiment 1. In the most complex model we fit, we recover true parameter values. The overparameterization present in experiments 1 and 3 does not seem problematic for estimating mutation rate, θ, or growth rate. These results suggest that inference of biologically realistic growth rates is quite feasible. The relative rates performed most poorly in the parameters of interest. This is caused predominantly because the uniform prior on relative rates introduces metric factors that inflate the densities. In experiment 1, when the true value of a relative rate parameter was not within the 95% HPD interval (which occurred 75 times out of 500), it was almost always overestimated (74 out of 75 times). Furthermore, conditioning on a tranversion (R_{G↔T} = 1), a rare event, may also have an impact. However, experiments 3 and 4 demonstrate that the use of a Jeffreys prior for these and other scale parameters results in >90% recovery in all parameters. We are not aiming to prescribe any particular noninformative prior. Our choice of uniform prior in earlier experiments is deliberately crude. However, it allows us to lay out the methodology with as little emphasis as possible on prior elicitation. The reader should undertake this process for a specific problem.
DISCUSSION
We have described Bayesian coalescentbased methods to estimate and assess the uncertainty in mutation parameters, population parameters, tree topology, and dates of divergence from aligned temporally spaced sequence data. The samplebased Bayesian framework allows us to bring together information of different kinds to reduce uncertainty in the objects of the inference. Much of the hard work is in designing, implementing, and testing a suitable Monte Carlo algorithm. We found a suite of MCMC updates that do the job.
We have analyzed two contrasting HIV1 datasets and 400 synthetic datasets to illustrate the main features of our methods. The results of the three HIV1 env data subsections show that a robust summary of parameterrich models, including the joint estimation of mutation rate and population size, is possible for some moderatesized datasets. The pretreatment data restrict the set of plausible parameter values to a comparatively small range. For this dataset, useful results can be obtained from a state of ignorance about physically plausible outcomes. This situation is in contrast to the situation illustrated in the Posttreatment section. For this dataset, prior ignorance implies posterior ambiguity, in the form of a bimodal posterior distribution for the mutation rate. One of these modes is supported by genealogies conflicting with very basic current ideas about HIV population dynamics. We modify the coalescent prior on genealogies to account for this prior knowledge, restricting the most recent common ancestor to physically realistic values. The ambiguity in mutation rate is removed. Similar results could be obtained in a likelihoodbased analysis of the posttreatment data, since the prior information amounts to an additional hard constraint on the root time of the coalescent genealogy.
There is some redundancy in the set of MCMC updates we used, in the sense that the limiting distribution of the MCMC is unaltered if we remove the scaling update (move 1) or the WilsonBalding update (move 2; see appendix for details of these moves). However, these two updates types are needed in practice. There are two timescales in MCMC, time to equilibrium and mixing time in equilibrium. The scaling move sharply reduces mixing time in equilibrium. The WilsonBalding update is needed to bring the equilibrium time to acceptable values. We have seen MCMC simulations, minus the WilsonBalding move, in which an apparently stationary Monte Carlo process undergoes a sudden and unheralded mean shift at ~2,000,000 updates. This problem was picked up at the debugging stage, in comparisons between our two MCMC implementations. Subsequent simulation has shown that the genealogies explored in the first 2,000,000 updates of that simulation were just one of the tree clusters supported by the target distribution.
The methods presented here reduce to those of Felsenstein and coworkers (Kuhneret al. 1995) in the case of a uniform prior on Θ = 2N_{e}μ, a fixed R, a fixed μ, and contemporaneous data, if instead of summarizing results using 95% HPD interval estimates, we use the mode and curvature of the posterior density for Θ to recover the maximumlikelihood estimate (MLE) and its associated confidence interval.
A distinction can be made between a dataset, like the pretreatment dataset, for which there is strong statistical information about mutation rates (we refer to populations from which such datasets may be obtained as “measurably evolving”), and a dataset, like the posttreatment data, in which the statistical signal is weak. In both of these datasets the familiar parameter Θ = 2N_{e}μ is in fact well determined by the data (not shown above), so that MCMC convergence in Θ is quick. However, it is only in the pretreatment data that this parameter can be separated easily into its two factors. This is related to the wellknown problem of identifiability for population size and mutation rate. We can see that temporally spaced data may or may not contain information that allows us to separate these two factors. In this particular example, lineages of the posttreatment viruses branch from those of the pretreatment viral population. Consequently a more appropriate analysis for this dataset would allow for a change of mutation rate and/or population size over the genealogy of the entire set of sequences. In the case of mutation rate this has already been demonstrated within a likelihood framework (Drummondet al. 2001). In a Bayesian analysis, coalescence of posttreatment lineages with pretreatment lineages will tend to limit the age of the most recent common ancestor of the posttreatment data, so that the pretreatment lineages will play the role of the reduced upperbound
A software package called molecular evolutionary population inference (MEPI), developed using the phylogenetic analysis library (PAL; Drummond and Strimmer 2001), implementing the described method and further extensions (codon position rate heterogeneity, etc.), is available from http://www.cebl.auckland.ac.nz/mepi/index.html.
Acknowledgments
We gratefully acknowledge two anonymous reviewers for helpful comments that much improved the manuscript. In addition, A.D. thanks A. Ferreira. A.D. was supported by a New Zealand Foundation for Research, Science and Technology Bright Futures scholarship. Research by A.G.R. and A.D. was also supported by National Institutes of Health grant GM59174.
APPENDIX: MCMC DETAILS AND MOVE TYPES
Markov chain Monte Carlo for temporally spaced sequence data including proposal mechanism used is described.
Denote by Ω_{MΘG} the space [0, ∞) × [0, ∞) × Γ of all possible (μ, θ, g) values. Let
We now describe a Monte Carlo algorithm realizing a Markov chain X_{n}, n = 0, 1, 2, … with states
Suppose X_{n} = x. A value for X_{n}_{+1} is computed using a MetropolisHastings algorithm. Define a set of random operations on the state. A given move may alter one or more of μ, θ, and g. Label the different move types m = 1, 2, …, M. The random operation with label m, acting on state x, generates state x′, with probability density q_{m}(x′x), say. Let (a ∧ b) equal a if a < b and otherwise b and (a ∨ b) equal a if a > b and otherwise b, let
Proposal mechanisms: In this section we describe the proposal mechanisms (moves) and their acceptance probabilities. In each move, X_{n} = x, with x = (μ, θ, (E_{g}, t_{Y})). For each node i let parent(i) ϵ Y denote the label of the node ancestral to i and connected to i by an edge. We get a compact notation if we treat Y and g as if Y contained a notional parent(root) node with t_{parent(root)} = ∞, as we did in Equation 4. Also, we now drop the convention that node labels increase with age.
Let dx = dμ dθ dg in
Scaling move: Label this move m = 1. Let a real constant β > 1 be given. For β^{−1} ≤ δ ≤ β, let x → δx denote the transformation
The move is as follows. Choose a δ ~ Unif(β^{−1}, β) and set x′ = δx. If
WilsonBalding move: Label this move m = 2. A random subtree is moved to a new branch. This move is based on the branchswapping move of Wilson and Balding (1998). The SPR move in PAUP* (Swofford 1999) is similar. However, the move below acts on a rooted tree and maintains all node ages except one.
Two nodes, i, j ϵ I ∪ Y are chosen uniformly at random without replacement. Let jp = parent(j) and ip = parent(i). If t_{jp} ≤ t_{i}, if ip = j or ip = jp, then the move fails and we set X_{n+1} = x. Given i and j, the candidate state x′ = (μ, θ, g′) is generated in the following way. Let ĩ denote the child of ip that is not i, and let ipp = parent(ip), the grandparent of i. Reconnect node ip so that it is a child of jp and a parent of j; that is, set
Subtree exchange: Label this move m = 3. Choose a node i ϵ I ∪ Y. Let ip = parent(i), jp = parent(ip), and let j denote the child of jp that is not ip. If node i is the root or a direct child of the root, or t_{ip} < t_{j}, then the move fails and we set X_{n+1} = x. Given i and j, the candidate state x′ = (μ, θ, g′) is generated in the following way. Swap nodes i and j, setting
The subtree exchange above is a local operation. In a second version of this move we chose node j uniformly at random over the whole tree.
Node age move: Label this move m = 4. Choose an internal node, i ϵ Y, uniformly at random. Let ip = parent(i) and let j and k be the two children of i [so i = parent(j) and i = parent(k), j ≠ k]. If i is not the root, choose a new time
Random walk moves for θ and μ: Label this move m = 5. The random walk update to θ is as follows. Let a real constant w_{θ} > 0 be given. Choose δ ~ Unif(−w_{θ}, w_{θ}) and set x′ = (μ, θ + δ, g). If
Implementation, convergence checking, and debugging: Convergence and standard errors: The efficiency of our Markov sampler, as a tool for estimating the mean of a given function f, is measured by calculating from the output τ_{f} = 1 + 2Σ ρ_{f}(k), the IACT of f. Dividing the run length by τ_{f}, we get the number of “effective independent” samples in the run (the number of independent samples required to get the same precision for estimation of the mean of f). We call this the ESS. Better MCMC algorithms have smaller IACTs and thus larger ESSs, though it may be necessary to measure τ in units of CPU time to make a really useful comparison. One will typically want to run the Markov chain at least a few hundred times the IACT, to test convergence and get reasonably stable marginal histograms. Note first that we do not know the IACT when we set the MCMC running. Exploratory runs are needed. Second, a statement like “We ran the MCMC for 10^{6} updates discarding the first 10^{4}” is worthless without some accompanying measurement of an IACT or equivalent. This point is made in Sokal (1989). The summation cutoff in the estimate for the IACT, τ_{f}, is determined using a monotone sequence estimator (Geyer 1992). The IACTs we get for our MCMC algorithms suggest that analysis of large datasets (50–100 sequences and 500–1000 nucleotides) is feasible with current desktop computers. Examples may be found in examples (Table 2). The inverse of the IACT of a given statistic is the “mixing rate.” Statistics with small mixing rates are called the “slow modes” of a MCMC algorithm. The mutation rate μ was the slowest mode among those we checked, and we therefore present IACTs for that statistic in examples.
Implementation issues: In this section we discuss debugging and MCMC efficiency of our two implementations. We compare expectations computed in the coalescent with estimates obtained from MCMC output. Standard errors are obtained from estimates of the corresponding IACT. Consider a tree with four leaves, two at time zero and two offset τ time units to greater age. Consider simulation in the coalescent, with no data. The expectation of t_{root} is
For problems involving data, expectations are not available. However, an MCMC algorithm with several different move types may be tested for consistency. The equilibrium is the posterior distribution of μ, θ, and g and should not alter as we vary the proportions in which move types are used to generate candidate states. For example, move 2 (WilsonBalding) is irreducible on its own, while moves 3 and 4 (subtree exchange and nodeage move) form another irreducible group. We fix a small synthetic dataset and compare the output of two MCMC runs: one generated using move 2 alone and the other using moves 3 and 4 alone in tandem.
We now turn to questions of MCMC efficiency. Each update has a number of parameters. These are adjusted, by trial and error for each analysis, so that the MCMC is reasonably efficient. An ad hoc adaptive scheme, based on monitoring acceptance rates, and akin to that described in Larget and Simon (1999), was used. The samples used in output analysis are taken from the final portion of the run, in which these parameters are fixed. The scaling and WilsonBalding updates are particularly effective.
We have experimented with a range of other moves. However, while it is easy to think up computationally demanding updates with good mixing rates per MCMC update, we have focused on developing a set of primitive moves with good mixing rate per CPU second. In our experience simple moves may have low acceptance rates, but they are easy to implement accurately and are rapidly evaluated. They may give good mixing rates when we measure in CPU seconds. Larget and Simon (1999) have given an effective MCMC scheme for a similar problem. We did not use their scheme, as its natural data structure did not fit well with our other operators. A second update, which may be useful to us in the future, would use the importancesampling process of Stephens and Donnelly (2000) to determine an independence sampling update.
Because of the explicit nature of MCMC inference, the details of a particular analysis, including the proposal mechanisms, the chain length, the evolutionary model, and the prior distributions, can be quite difficult to keep track of. One of us (A. Drummond) developed an XML data format to describe phylogenetic/population genetic analyses. This enables the user to write down the details of an analysis in a humanreadable format that can also be used as the input for the computer program. For the more visually inclined a graphical user interface (GUI) was developed that can generate the XML input files, given a NEXUS or PHYLIP alignment.
Footnotes

Communicating editor: J. Hein
 Received December 15, 2001.
 Accepted March 12, 2002.
 Copyright © 2002 by the Genetics Society of America