Register | Sign In


Understanding through Discussion


EvC Forum active members: 63 (9162 total)
1 online now:
Newest Member: popoi
Post Volume: Total: 916,332 Year: 3,589/9,624 Month: 460/974 Week: 73/276 Day: 1/23 Hour: 0/1


Thread  Details

Email This Thread
Newer Topic | Older Topic
  
Author Topic:   Discussion of Phylogenetic Methods
herebedragons
Member (Idle past 876 days)
Posts: 1517
From: Michigan
Joined: 11-22-2009


Message 1 of 5 (775410)
01-01-2016 1:01 PM


Introduction
A phylogeny is a hypothesis about the evolutionary history of a group of taxa and since the phylogeny we present is a hypothesis, we want to know how well our hypothesis is supported compared to other hypotheses. Thus, the various phylogenetic methods have been developed to provide researchers with ways to evaluate those hypotheses and determine which hypothesis is the best. In other words, phylogenetic programs don't just build trees but more importantly, they evaluate them so that researchers can present the most well supported hypothesis.
The first thing that needs to be clarified is what is meant by the tree space. The tree space is all possible topologies that a particular combination of taxa could produce. The number of possible bifurcating, rooted trees for a given number of taxa m is given by the formula:
(2m - 3)!/[2m-2(m-2)!]
So, for just 10 taxa, there are 34,459,425 possible trees in the tree space, which demonstrates that the tree space becomes extremely large with even a small number of taxa and thus makes it all but impossible to evaluate ALL the various trees within the tree space.
The assumption is that one of these 34,459,425 trees represents the true evolutionary history of the 10 taxa in question. However, since we can never know for sure which tree is the TRUE tree, what we want to do is propose our best hypothesis as to which tree best represents the true evolutionary history of the taxa. How we do that is by specifying some optimality criteria and then finding the tree that has the highest value for our specified optimality criteria.
Again, I don’t think the point that a phylogeny is not (or should not be) presented as the true evolutionary history of a set of taxa cannot be overemphasized. What a phylogeny presents is our best estimate of the evolutionary history of a set of taxa. We evaluate how confident we are in that estimate by the type of optimality criteria used and the statistical support for the topology.
For example, if a phylogeny of 20 taxa were presented based on 200 nucleotide characters optimized by parsimony, I would have almost no confidence that the hypothesis was correct; in fact, I would pretty much dismiss it as worthless. However, if those same taxa were evaluated using 5000 nucleotide characters from 4 genes optimized by maximum-likelihood with bootstrap support values >90% on more than 3/4 of the branches, I would be very confident in the hypothesis. It is really about confidence levels, which is why newer phylogenetic methods, such as maximum-likelihood and Bayesian, rely so heavily on statistical models.
Genomicus requested that I discuss the Bayesian method since it is probably the least understood method - and the most difficult conceptually. Bayes and Maximum-likelihood (abbr. ML ) are pretty much the standard for phylogenetic analysis these days and often researchers will present the results of both analyses. The other methods are falling out of use, but still have some limited applications where Bayes and ML are not appropriate. However, I think it important to cover these other methods before diving in to Bayesian methods because they explain some key concepts that are needed in order to understand Bayesian concepts. The next post will cover this introductory material and provide the background for the discussion of Bayesian methods.
HBD
(ABE: Biological Evolution I suppose)
Edited by herebedragons, : No reason given.

Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca
"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.
Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.

Replies to this message:
 Message 2 by herebedragons, posted 01-01-2016 1:15 PM herebedragons has not replied
 Message 3 by Admin, posted 01-01-2016 2:07 PM herebedragons has replied

herebedragons
Member (Idle past 876 days)
Posts: 1517
From: Michigan
Joined: 11-22-2009


Message 2 of 5 (775411)
01-01-2016 1:15 PM
Reply to: Message 1 by herebedragons
01-01-2016 1:01 PM


Introduction to Phylogenetic Methods
In this post, I want to give brief explanations of phylogenetic methods of neighbor-joining (NJ), parsimony, and maximum-likelihood (ML). I am only going to cover the key principles involved (especially as they will be applicable to the discussion of Bayesian analysis) and the main advantages and disadvantages of the method. If anyone would like more information about a specific method, please ask.
Neighbor-joining
Neighbor-joining is the most widely used distance-based method. Distances are calculated based on pairwise comparisons and the tree built with those taxa that are closest genetically being most closely related. Distance measurements can be corrected using different nucleotide substitution models. NJ trees can be evaluated by bootstrapping (bootstrapping will be covered at the end of this post), which allows some statistical power.
Advantages:
One of the major advantages is that NJ is very fast. Even very large datasets take only a few minutes as opposed to hours or even days with other methods. Another major advantage is that NJ will return a single, best tree (best as far as NJ analysis can determine). Why this is an advantage will become more apparent as the other methods are discussed.
Disadvantages:
The major disadvantage is that NJ assumes that all lineages evolve at the same rate and that genetic distance is directly proportional to relatedness, which is not necessarily the case. Also, distance analyses have no way to take intermediate steps into consideration but can only consider distances between terminal taxa. Thus NJ has no way to consider the possibilities of reversals and parallel changes in determining relatedness.
Parsimony
Parsimony, in my opinion, is the most unreliable phylogenetic method. It relies on the assumption that evolution occurs in the fewest steps possible, which is often a faulty assumption. In order to determine the most parsimonious tree, the phylogenetic program must search the tree space and evaluate every tree for the number of steps and select the tree that has the fewest. However, with any more than a few taxa, the tree space becomes so large that it is practically impossible to search every tree within the tree space (PAUP limits a full search to 15 taxa, IIRC). So instead, we use a technique called a heuristic search.
A heuristic search is a systematic method of searching the tree space. It begins with a randomly selected tree and through branch swapping, looks for an optimal tree. This optimal tree is a local optimum, since the search does not cover the entire tree space. The process then repeats a number of times, each starting at a different, random place in the tree space. I am not aware of a standard to determine how many times this process should be replicated, but 100 random addition replicates is common. The idea is that this should provide sufficient coverage to find the globally optimized tree.
Advantages:
The main advantage that I see for parsimony analysis is when evaluating taxa for which DNA data is unavailable, such as fossil species. Parsimony is a close enough approximation in this case (although I think ML and Bayesian analyses can be used with morphological data and so would be a better choice than parsimony).
Disadvantages:
As already mentioned, parsimony is a weak assumption when it comes to molecular evolution and since there are more rigorous methods available it seems pointless to use parsimony analysis on molecular data.
Another major disadvantage is that parsimony analysis often returns multiple most parsimonious reconstructions. My own analysis of 110 taxa and 2100 nucleotide characters returned 2048 most parsimonious reconstructions! How do you choose which tree to present since there is no way to favor one reconstruction over another? What you end up doing is creating a consensus tree, so the tree that is presented is not even an actual tree but an artefactual representation of multiple trees.
Another disadvantage is that since the tree space was searched heuristically and not completely, there is a possibility that even though the search found hundreds of most parsimonious reconstructions, there is a tree with few steps somewhere in the tree space. Without evaluating every tree in the tree space, there is no way of knowing the most parsimonious tree has been found. A sufficient number of sequence addition replicates helps to ensure that enough of the tree space has been covered to minimize this problem, but of course, that adds time to the analysis.
Maximum-likelihood
The maximum-likelihood approach asks What is the probability of the observed data given an evolutionary model and a phylogenetic tree? Using an evolutionary model that defines the probability of different nucleotide substitutions (such as what is the probability of an A --> T or a C --> G, etc.), the probability for a site is the sum of the probabilities of every possible reconstruction of ancestral states. The probability for the full tree is the product of the likelihoods at each of the sites.
Consider a case of 4 taxa where character 1 is C C A G for taxa 1, 2, 3 & 4 respectively and the topology shown in the figure below where taxa 1 and 2 are sister taxa and there are 2 unobserved ancestral states. In order to determine the likelihood for this tree, we calculate the probability for every possible combination of ancestral character states. The sum of all these probabilities is the probability for site 1. The -Ln ( ) is the likelihood for that site. Now repeat that procedure for every site and sum the -Ln ( ) of all characters and that is the likelihood for the full tree.
As you can imagine, these calculations can be very, very computationally demanding and they need to be done for every possible topology. Like parsimony, maximum-likelihood uses a heuristic method to search the tree space.
Advantages:
Maximum-likelihood is a very rigorous method that considers possible ancestral states and can be bootstrapped for very good statistical confidence. Although it is possible that there will be more than 1 topology with the best likelihood value, typically ML analysis will return a single, best tree. This is an advantage because the maximum-likelihood tree can be presented as the favored hypothesis.
Disadvantages:
Extremely time consuming and requires tremendous computational resources. For my project of 110 taxa, I figured it was going to take about 7 days to run 50 replicates in a heuristic search plus another 7 - 14 days for 50 bootstrap replicates. Newer ML programs have been developed that are considerably faster. RaxML and PHyML are a couple examples. I was able to complete my analysis in about 4 hours (as opposed to 14+ days for the heuristic search) for 100 bootstrap replicates using PHyML. Honestly though, I am not really all that confident in the results. Not many of the branches had good support where by other methods most of them were quite well supported.
Bootstrapping
Bootstrapping is a widely used method that can provide a measure of support for the branches of a phylogenetic tree. In order to create a bootstrap replicate, a new dataset (of the same size as the original) is created by randomly choosing characters (with replacement) from the original dataset. For example, a bootstrap replicate made from a dataset with 10 characters (numbered 1 - 10) might include characters 1, 3, 4, 4, 7, 8, 8, 8, 9, 10. So, characters 2, 5 & 6 are not represented in the replicate while 4 & 8 are represented multiple times. The new dataset is then analyzed the same way the original dataset was. This process is repeated the specified number of times. A consensus of the trees generated from the replicates is created and the bootstrap support value for a branch is the percentage of times that particular branch appears in the set of replicate trees.
The reasoning behind this process is that if a branch is well supported there should be a significant number of characters that support that topology and by selecting characters at random, the strength of the phylogenetic signal should be detectable. The downside to this process is that not only is the tree that will be presented a consensus, but it is also created from artificial data.
------------
This post should provide the background to the principles and theories that led to the development of Bayesian analysis. I realize that there is a lot of information here but I tried to be as brief as I felt I could be. Hopefully it is all relatively clear but I expect that in my attempt at brevity, I left some explanations vague or unclear. I would appreciate questions, comments or discussion regarding anything related to this topic. I would also expect that my comments about parsimony may be controversial and might generate some discussion which would also be welcomed.
I will come back to discuss Bayesian analysis as soon as possible.
HBD

Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca
"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.
Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.

This message is a reply to:
 Message 1 by herebedragons, posted 01-01-2016 1:01 PM herebedragons has not replied

Admin
Director
Posts: 13013
From: EvC Forum
Joined: 06-14-2002
Member Rating: 1.9


Message 3 of 5 (775413)
01-01-2016 2:07 PM
Reply to: Message 1 by herebedragons
01-01-2016 1:01 PM


Re: Introduction
I'd like to make sure this can be easily understood by rank laypeople.
herebedragons writes:
A phylogeny is a hypothesis about the evolutionary history of a group of taxa...
About "phylogeny," is this the way biologists actually use the term? I'm asking because when I look "phylogeny" up in the dictionary it seems like the above could be more easily understood if it were phrased like this: "A phylogenetic tree is a specific hypothesis about the evolutionary history of a group of taxa..."
Thus, the various phylogenetic methods have been developed to provide researchers with ways to evaluate those hypotheses and determine which hypothesis is the best.
A natural objection might be that selecting the best among a bunch of poor hypotheses is not of much value. What tells us that the better hypotheses have a fair chance of being true? You go on to describe some evaluation criteria like optimality, but giving a name to a criteria in this case explains little.
For example, if a phylogeny of 20 taxa were presented based on 200 nucleotide characters optimized by parsimony,..
"Nucleotide characters" means groups of three nucleotides that program for amino acids? Or do just mean individual nucleotides?
However, if those same taxa were evaluated using 5000 nucleotide characters from 4 genes optimized by maximum-likelihood with bootstrap support values >90%...
You might be descending into jargon here.
I don't have time now to tackle your next post.

--Percy
EvC Forum Director

This message is a reply to:
 Message 1 by herebedragons, posted 01-01-2016 1:01 PM herebedragons has replied

Replies to this message:
 Message 4 by herebedragons, posted 01-01-2016 3:41 PM Admin has seen this message but not replied

herebedragons
Member (Idle past 876 days)
Posts: 1517
From: Michigan
Joined: 11-22-2009


Message 4 of 5 (775424)
01-01-2016 3:41 PM
Reply to: Message 3 by Admin
01-01-2016 2:07 PM


Re: Introduction
About "phylogeny," is this the way biologists actually use the term?
A phylogeny is the evolutionary history of a group of organisms and the result of a phylogenetic analysis. A phylogenetic tree is a graphic representation of a phylogeny. The phylogeny itself is the hypothesis about the evolutionary history of a group of taxa. I could try to clarify that better.
What tells us that the better hypotheses have a fair chance of being true? You go on to describe some evaluation criteria like optimality, but giving a name to a criteria in this case explains little.
I explain the optimality criteria more in depth in the next post.
"Nucleotide characters" means groups of three nucleotides that program for amino acids? Or do just mean individual nucleotides?
This issue would need to be resolved during alignment. Once the alignment is done the phylogenetic analysis treats each individual nucleotide as a separate character. The alignment is critical to any phylogenetic analysis but I wasn't sure there would be interest in a prolonged discussion about alignment, so I was trying to gloss over it.
You might be descending into jargon here.
I could probably delete that whole paragraph as it was just meant to be a quick example about how confidence affects our conclusions. I could save that for later.
I don't have time now to tackle your next post.
I will wait to make any corrections until I get your comments on that post.
HBD

Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca
"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.
Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.

This message is a reply to:
 Message 3 by Admin, posted 01-01-2016 2:07 PM Admin has seen this message but not replied

Admin
Director
Posts: 13013
From: EvC Forum
Joined: 06-14-2002
Member Rating: 1.9


Message 5 of 5 (775516)
01-02-2016 11:23 AM


Thread Copied to Biological Evolution Forum
Thread copied to the Discussion of Phylogenetic Methods thread in the Biological Evolution forum, this copy of the thread has been closed.

Newer Topic | Older Topic
Jump to:


Copyright 2001-2023 by EvC Forum, All Rights Reserved

™ Version 4.2
Innovative software from Qwixotic © 2024