Hidden Markov Models and Phylogenetic Tree Construction using Markov Chain Monte Carlo

Some discussion and follow-up from BIO 502 lecture

Why should I care?
This question should really be translated "What questions do Markov chains and hidden Markov models help me answer?"

General answer: They are tools that allow us to statistically model processes in our world. That could be the weather, speech, DNA sequences, phylogeny, etc.

Some concrete answers: They are used to predict genes (HMM) and construct phylogenetic trees (MCMC).

What is the difference between a Markov chain and a hidden Markov model?
Let's start with the similarities:
 * 1) Both are statistical models that have states.
 * 2) The execution of the model consists of moving from one state to another (possibly the same) state. This movement is stochastic, thus, they are both statistical models.

Now onto some differences:
 * 1) The output/observations of a Markov chain are the sequence of states that were visited. Think back to the example we used in class that had three states: Sunny, Cloudy, and Rainy. The observation/output of that model was the states themselves.
 * 2) The output/observations of a hidden Markov model are not the states. There are separate output/observation probabilities for each state.

Thought question: How could you turn our Sunny, Cloudy, Rainy model into a hidden Markov Model? This really boils down to determining what is hidden about the weather. Disclaimer: I know very little about how the weather is actually predicted, but I could imagine a model where the states are things like low pressure, high pressure, low humidity, etc. The output would still be Sunny, Cloudy, or Rainy, but now the states are more conceptual (hidden).

OK. Great. But why did we dive into all of the math and computation?
My witty (or attempt at) answer: Math leads to understanding

More concrete answer: How can you know what questions a hidden Markov model or a Markov chain can answer without at least an intuitive understanding of the computational and mathematical concepts behind them?

An e-mail conversation
Initial prompt: This may be a trivial question, but I had to run after class and wasn't able to ask you about it. You mentioned training HMMs with (pure?) numerical algorithms, but I thought they were trained with actual (known) sequence data. Are both methods used? Also, in terms of sequence assembly, looking over the papers from this semester, evolutionary relationships were heavily focused on distance-based methods (maximum parsimony); I didn't catch a whole lot of MCMC methods mentioned. Are these methods more frequently used for distantly related taxa? (i.e. They lack direct sequence alignments, and may have more re-arrangements, therefore need more sequence-level data.)

My response: You're right. We need the sequences (observations) in order to train our HMM. The training algorithm (numerical method) tunes the HMM such that the model conforms to the sequences. i.e., if we are training a model to predict the weather in Seattle and it never predicts that it will rain, that is a bad model.

In regards to MCMC vs distance based methods, here is a summary of what Li et al. say in the paper:

Distance based methods generally perform poorly in simulation studies, although they do perform well under additive models as the number of sites approaches infinity. They are useful for large data sets because of their computational efficiency.

Maximum parsimony have been by far the most widely used, which tends to work best when the mutation rate is low and may fail badly when the mutation rate is high and branches are heterogeneous in length.

Maximum likelihood evaluates the likelihood that the given evolutionary model will yield the observed sequences. Bootstrap methods must be applied to determine if the global maximum is reached.

All of these methods produce only one or a few estimates of the true phylogeny, without providing any measure of the variability of the estimates. That's were MCMC really shines.

Andy's follow-up response: Yes this is very good overview perspectives on the different approaches. One of the reasons you see tons of NJ tres is that the number of possible trees to search (evaluate) goes up exponentially with the number of taxa (sequences) in your aligned data matrix. So very quickly above, say 10 multiple aligned sequences, you require massive CPU to evaluate all possible (global) tree sapces that are often needed to eleiminate getting stuck in local optima. So NJ is a v. quick and dirty way to deal with large data sets that you can also bootstrap v. quickly.

Max Parsimony evaluates the large tree space but of course assumes no underlying evolutionary model of character change (although does accommodate differential character weighting schemes, which is not the same thing as a model-based approach) and just counts changes and picks minimal numbers of changes as the criterion for selecting the "best" tree. This is the preferred approach when you are dealing with characters like comparative anatomy where there is no good underlying quantitative and consistent model of character change. With DNA sequences, we can do much better with model-based inference such as Max Likelihood, where you are actually fitting the best tree, given your data and the underlying model. So v. different than MP. Like MP it takes a lot of CPU tome to do bootstrap resampling of your original estimates if you have more than a handful of taxa so this is why most people skip this and just use distance methods unless they have access to running parallel jobs on a computing cluster.

And Paul's point the MCMC shines is why Bayesian inference is now consuming the field of molecular systematics. Because the Bayesian approach is also based on the likelihood mathematics, but allows you to improve your estimates (posterior probabilities) based on input of some prior evidence (priors), this then can be ground out for hundreds of millions of cycles of inference until the log likelihoods of your tree topologies (posteriors) converge. The result is that you generate hundreds of millions of trees with log likelihood values that you can subsample at confidence levels above a certain desired posterior proabability (such as 80%) and then plot the consensus of those millions of trees in a manner that is analogous to the bootstrap. However, in this case your consensus is statistically derived directly from the Bayes Theorem, and not from resampling your original data matrix with the limitations of pseudoreplication (such as in the bootstrap).

So MCMC roack but usually you need to tell the software what models and parameters you want to use, and how many cyles you want to run, and what priors you are assuming, etc etc. Thus have to know what's under the hood to drive the car intelligently!

What phylogeny programs are available?
Great list and resource: http://evolution.genetics.washington.edu/phylip/software.html

The why and what of Bayesian inference
[[Media:Spring_2012_Bayesian_Inference.ppt|Bayesian Inference Presentation]]