Bayesian inference in phylogeny

Bayesian inference in phylogeny
Classification	Evolutionary biology
Subclassification	Molecular phylogenetics
Optimally search criteria	Bayesian inference
	v; t; e;

Bayesian inference of phylogeny combines the information in the prior and in the data likelihood to create the so-called posterior probability of trees, which is the probability that the tree is correct given the data, the prior and the likelihood model. Bayesian inference was introduced into molecular phylogenetics in the 1990s by three independent groups: Bruce Rannala and Ziheng Yang in Berkeley,^[1]^[2] Bob Mau in Madison,^[3] and Shuying Li in University of Iowa,^[4] the last two being PhD students at the time. The approach has become very popular since the release of the MrBayes software in 2001,^[5] and is now one of the most popular methods in molecular phylogenetics.

Bayesian inference of phylogeny background and bases

Bayesian inference refers to a probabilistic method developed by Reverend Thomas Bayes based on Bayes' theorem. Published posthumously in 1763 it was the first expression of inverse probability and the basis of Bayesian inference. Independently, unaware of Bayes' work, Pierre-Simon Laplace developed Bayes' theorem in 1774.^[6]

Bayesian inference or the inverse probability method was the standard approach in statistical thinking until the early 1900s before RA Fisher developed what's now known as the classical/frequentist/Fisherian inference. Computational difficulties and philosophical objections had prevented the widespread adoption of the Bayesian approach until the 1990s, when Markov Chain Monte Carlo (MCMC) algorithms revolutionized Bayesian computation.

The Bayesian approach to phylogenetic reconstruction combines the prior probability of a tree P(A) with the likelihood of the data (B) to produce a posterior probability distribution on trees P(A|B).^[7] The posterior probability of a tree will be the probability that the tree is correct, given the prior, the data, and the correctness of the likelihood model.

MCMC methods can be described in three steps: first using a stochastic mechanism a new state for the Markov chain is proposed. Secondly, the probability of this new state to be correct is calculated. Thirdly, a new random variable (0,1) is proposed. If this new value is less than the acceptance probability the new state is accepted and the state of the chain is updated. This process is run thousands or millions of times. The number of times a single tree is visited during the course of the chain is an approximation of its posterior probability. Some of the most common algorithms used in MCMC methods include the Metropolis–Hastings algorithms, the Metropolis-Coupling MCMC (MC³) and the LOCAL algorithm of Larget and Simon.

Metropolis–Hastings algorithm

One of the most common MCMC methods used is the Metropolis–Hastings algorithm,^[8] a modified version of the original Metropolis algorithm.^[9] It is a widely used method to sample randomly from complicated and multi-dimensional distribution probabilities. The Metropolis algorithm is described in the following steps:^[10] ^[11]

An initial tree, T_i, is randomly selected.
A neighbour tree, T_j, is selected from the collection of trees.
The ratio, R, of the probabilities (or probability density functions) of T_j and T_i is computed as follows: R = f(T_j)/f(T_i)
If R ≥ 1, T_j is accepted as the current tree.
If R < 1, T_j is accepted as the current tree with probability R, otherwise T_i is kept.
At this point the process is repeated from Step 2 N times.

The algorithm keeps running until it reaches an equilibrium distribution. It also assumes that the probability of proposing a new tree T_j when we are at the old tree state T_i, is the same probability of proposing T_i when we are at T_j. When this is not the case Hastings corrections are applied. The aim of Metropolis-Hastings algorithm is to produce a collection of states with a determined distribution until the Markov process reaches a stationary distribution. The algorithm has two components:

A potential transition from one state to another (i → j) using a transition probability function q_i,j
Movement of the chain to state j with probability α_i,j and remains in i with probability 1 – α_i,j.^[2]

Metropolis-coupled MCMC

Metropolis-coupled MCMC algorithm (MC³) ^[12] has been proposed to solve a practical concern of the Markov chain moving across peaks when the target distribution has multiple local peaks, separated by low valleys, are known to exist in the tree space. This is the case during heuristic tree search under maximum parsimony (MP), maximum likelihood (ML), and minimum evolution (ME) criteria, and the same can be expected for stochastic tree search using MCMC. This problem will result in samples not approximating correctly to the posterior density. The (MC³) improves the mixing of Markov chains in presence of multiple local peaks in the posterior density. It runs multiple (m) chains in parallel, each for n iterations and with different stationary distributions $\pi _{j}(.)\$ , $j=1,2,\ldots ,m\$ , where the first one, $\pi _{1}=\pi \$ is the target density, while $\pi _{j}\$ , $j=2,3,\ldots ,m\$ are chosen to improve mixing. For example, one can choose incremental heating of the form:

\pi _{j}(\theta )=\pi (\theta )^{1/},\ \ \lambda >0,

so that the first chain is the cold chain with the correct target density, while chains $2,3,\ldots ,m$ are heated chains. Note that raising the density $\pi (.)$ to the power $1/T\$ with $T>1\$ has the effect of flattening out the distribution, similar to heating a metal. In such a distribution, it is easier to traverse between peaks (separated by valleys) than in the original distribution. After each iteration, a swap of states between two randomly chosen chains is proposed through a Metropolis-type step. Let $\theta ^{(j)}\$ be the current state in chain $j\$ , $j=1,2,\ldots ,m\$ . A swap between the states of chains $i\$ and $j\$ is accepted with probability:

\alpha ={\frac {\pi _{i}(\theta ^{(j)})\pi _{j}(\theta ^{(i)})}{\pi _{i}(\theta ^{(i)})\pi _{j}(\theta ^{(j)})}}\

At the end of the run, output from only the cold chain is used, while those from the hot chains are discarded. Heuristically, the hot chains will visit the local peaks rather easily, and swapping states between chains will let the cold chain occasionally jump valleys, leading to better mixing. However, if $\pi _{i}(\theta )/\pi _{j}(\theta )\$ is unstable, proposed swaps will seldom be accepted. This is the reason for using several chains which differ only incrementally.

An obvious disadvantage of the algorithm is that $m\$ chains are run and only one chain is used for inference. For this reason, $\mathrm {MC} ^{3}\$ is ideally suited for implementation on parallel machines, since each chain will in general require the same amount of computation per iteration.

LOCAL algorithm of Larget and Simon

The LOCAL algorithms^[13] offers a computational advantage over previous methods and demonstrates that a Bayesian approach is able to assess uncertainty computationally practical in larger trees. The LOCAL algorithm is an improvement of the GLOBAL algorithm presented in Mau, Newton and Larget (1999)^[14] in which all branch lengths are changed in every cycle. The LOCAL algorithms modifies the tree by selecting an internal branch of the tree at random. The nodes at the ends of this branch are each connected to two other branches. One of each pair is chosen at random. Imagine taking these three selected edges and stringing them like a clothesline from left to right, where the direction (left/right) is also selected at random. The two endpoints of the first branch selected will have a sub-tree hanging like a piece of clothing strung to the line. The algorithm proceeds by multiplying the three selected branches by a common random amount, akin to stretching or shrinking the clothesline. Finally the leftmost of the two hanging sub-trees is disconnected and reattached to the clothesline at a location selected uniformly at random. This would be the candidate tree.

Suppose we began by selecting the internal branch with length $t_{8}\$ that separates taxa $A\$ and $B\$ from the rest. Suppose also that we have (randomly) selected branches with lengths $t_{1}\$ and $t_{9}\$ from each side, and that we oriented these branches. Let $m=t_{1}+t_{8}+t_{9}\$ , be the current length of the clothesline. We select the new length to be $m^{\star }=m\exp(\lambda (U_{1}-0.5))\$ , where $U_{1}\$ is a uniform random variable on $(0,1)\$ . Then for the LOCAL algorithm, the acceptance probability can be computed to be:

{\frac {h(y)}{h(x)}}\times {\frac {{m^{\star }}^{3}}{m^{3}}}\

Assessing convergence

To estimate a branch length $t$ of a 2-taxon tree under JC, in which $n_{1}$ sites are unvaried and $n_{2}$ are variable, assume exponential prior distribution with rate $\lambda \$ . The density is $p(t)=\lambda e^{-\lambda t}\$ . The probabilities of the possible site patterns are: