Understanding through Discussion


Welcome! You are not logged in. [ Login ]
EvC Forum active members: 86 (8936 total)
42 online now:
Captcass, DrJones*, dwise1, jar, PaulK, ramoss, RAZD, ringo, Thugpreacha (AdminPhat) (9 members, 33 visitors)
Chatting now:  Chat room empty
Newest Member: ssope
Post Volume: Total: 861,608 Year: 16,644/19,786 Month: 769/2,598 Week: 15/251 Day: 15/23 Hour: 1/5


Thread  Details

Email This Thread
Newer Topic | Older Topic
  
Author Topic:   Discussion of Phylogenetic Methods
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


(2)
Message 1 of 288 (775512)
01-01-2016 1:01 PM


Introduction
A phylogeny is a hypothesis about the evolutionary history of a group of taxa and since the phylogeny we present is a hypothesis, we want to know how well our hypothesis is supported compared to other hypotheses. Thus, the various phylogenetic methods have been developed to provide researchers with ways to evaluate those hypotheses and determine which hypothesis is the best. In other words, phylogenetic programs don't just build trees but more importantly, they evaluate them so that researchers can present the most well supported hypothesis.

The first thing that needs to be clarified is what is meant by the tree space. The tree space is all possible topologies that a particular combination of taxa could produce. The number of possible bifurcating, rooted trees for a given number of taxa m is given by the formula:

(2m - 3)!/[2m-2(m-2)!]

So, for just 10 taxa, there are 34,459,425 possible trees in the tree space, which demonstrates that the tree space becomes extremely large with even a small number of taxa and thus makes it all but impossible to evaluate ALL the various trees within the tree space.

The assumption is that one of these 34,459,425 trees represents the true evolutionary history of the 10 taxa in question. However, since we can never know for sure which tree is the “TRUE” tree, what we want to do is propose our best hypothesis as to which tree best represents the true evolutionary history of the taxa. How we do that is by specifying some optimality criteria and then finding the tree that has the highest value for our specified optimality criteria.

Again, I don’t think the point that a phylogeny is not (or should not be) presented as the “true” evolutionary history of a set of taxa cannot be overemphasized. What a phylogeny presents is our best estimate of the evolutionary history of a set of taxa. We evaluate how confident we are in that estimate by the type of optimality criteria used and the statistical support for the topology.

For example, if a phylogeny of 20 taxa were presented based on 200 nucleotide characters optimized by parsimony, I would have almost no confidence that the hypothesis was correct; in fact, I would pretty much dismiss it as worthless. However, if those same taxa were evaluated using 5000 nucleotide characters from 4 genes optimized by maximum-likelihood with bootstrap support values >90% on more than 3/4 of the branches, I would be very confident in the hypothesis. It is really about confidence levels, which is why newer phylogenetic methods, such as maximum-likelihood and Bayesian, rely so heavily on statistical models.

Genomicus requested that I discuss the Bayesian method since it is probably the least understood method - and the most difficult conceptually. Bayes and Maximum-likelihood (abbr. ML ) are pretty much the standard for phylogenetic analysis these days and often researchers will present the results of both analyses. The other methods are falling out of use, but still have some limited applications where Bayes and ML are not appropriate. However, I think it important to cover these other methods before diving in to Bayesian methods because they explain some key concepts that are needed in order to understand Bayesian concepts. The next post will cover this introductory material and provide the background for the discussion of Bayesian methods.

HBD

(ABE: Biological Evolution I suppose)

Edited by herebedragons, : No reason given.


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


Replies to this message:
 Message 2 by herebedragons, posted 01-01-2016 1:15 PM herebedragons has not yet responded
 Message 3 by Admin, posted 01-01-2016 2:07 PM herebedragons has responded
 Message 24 by vaporwave, posted 12-18-2016 9:47 AM herebedragons has not yet responded

  
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


(1)
Message 2 of 288 (775513)
01-01-2016 1:15 PM
Reply to: Message 1 by herebedragons
01-01-2016 1:01 PM


Introduction to Phylogenetic Methods
In this post, I want to give brief explanations of phylogenetic methods of neighbor-joining (NJ), parsimony, and maximum-likelihood (ML). I am only going to cover the key principles involved (especially as they will be applicable to the discussion of Bayesian analysis) and the main advantages and disadvantages of the method. If anyone would like more information about a specific method, please ask.

Neighbor-joining

Neighbor-joining is the most widely used distance-based method. Distances are calculated based on pairwise comparisons and the tree built with those taxa that are closest genetically being most closely related. Distance measurements can be corrected using different nucleotide substitution models. NJ trees can be evaluated by bootstrapping (bootstrapping will be covered at the end of this post), which allows some statistical power.

Advantages:
One of the major advantages is that NJ is very fast. Even very large datasets take only a few minutes as opposed to hours or even days with other methods. Another major advantage is that NJ will return a single, best tree (“best” as far as NJ analysis can determine). Why this is an advantage will become more apparent as the other methods are discussed.

Disadvantages:
The major disadvantage is that NJ assumes that all lineages evolve at the same rate and that genetic distance is directly proportional to relatedness, which is not necessarily the case. Also, distance analyses have no way to take intermediate steps into consideration but can only consider distances between terminal taxa. Thus NJ has no way to consider the possibilities of reversals and parallel changes in determining relatedness.

Parsimony

Parsimony, in my opinion, is the most unreliable phylogenetic method. It relies on the assumption that evolution occurs in the fewest steps possible, which is often a faulty assumption. In order to determine the most parsimonious tree, the phylogenetic program must search the tree space and evaluate every tree for the number of steps and select the tree that has the fewest. However, with any more than a few taxa, the tree space becomes so large that it is practically impossible to search every tree within the tree space (PAUP limits a full search to 15 taxa, IIRC). So instead, we use a technique called a heuristic search.

A heuristic search is a systematic method of searching the tree space. It begins with a randomly selected tree and through branch swapping, looks for an optimal tree. This optimal tree is a local optimum, since the search does not cover the entire tree space. The process then repeats a number of times, each starting at a different, random place in the tree space. I am not aware of a standard to determine how many times this process should be replicated, but 100 random addition replicates is common. The idea is that this should provide sufficient coverage to find the globally optimized tree.

Advantages:
The main advantage that I see for parsimony analysis is when evaluating taxa for which DNA data is unavailable, such as fossil species. Parsimony is a close enough approximation in this case (although I think ML and Bayesian analyses can be used with morphological data and so would be a better choice than parsimony).

Disadvantages:
As already mentioned, parsimony is a weak assumption when it comes to molecular evolution and since there are more rigorous methods available it seems pointless to use parsimony analysis on molecular data.

Another major disadvantage is that parsimony analysis often returns multiple most parsimonious reconstructions. My own analysis of 110 taxa and 2100 nucleotide characters returned 2048 most parsimonious reconstructions! How do you choose which tree to present since there is no way to favor one reconstruction over another? What you end up doing is creating a consensus tree, so the tree that is presented is not even an actual tree but an artefactual representation of multiple trees.

Another disadvantage is that since the tree space was searched heuristically and not completely, there is a possibility that even though the search found hundreds of “most parsimonious” reconstructions, there is a tree with few steps somewhere in the tree space. Without evaluating every tree in the tree space, there is no way of knowing the most parsimonious tree has been found. A sufficient number of sequence addition replicates helps to ensure that enough of the tree space has been covered to minimize this problem, but of course, that adds time to the analysis.

Maximum-likelihood

The maximum-likelihood approach asks “What is the probability of the observed data given an evolutionary model and a phylogenetic tree?” Using an evolutionary model that defines the probability of different nucleotide substitutions (such as what is the probability of an A --> T or a C --> G, etc.), the probability for a site is the sum of the probabilities of every possible reconstruction of ancestral states. The probability for the full tree is the product of the likelihoods at each of the sites.

Consider a case of 4 taxa where character 1 is C C A G for taxa 1, 2, 3 & 4 respectively and the topology shown in the figure below where taxa 1 and 2 are sister taxa and there are 2 unobserved ancestral states. In order to determine the likelihood for this tree, we calculate the probability for every possible combination of ancestral character states. The sum of all these probabilities is the probability for site 1. The -Ln ( ) is the likelihood for that site. Now repeat that procedure for every site and sum the -Ln ( ) of all characters and that is the likelihood for the full tree.

As you can imagine, these calculations can be very, very computationally demanding and they need to be done for every possible topology. Like parsimony, maximum-likelihood uses a heuristic method to search the tree space.

Advantages:
Maximum-likelihood is a very rigorous method that considers possible ancestral states and can be bootstrapped for very good statistical confidence. Although it is possible that there will be more than 1 topology with the best likelihood value, typically ML analysis will return a single, best tree. This is an advantage because the maximum-likelihood tree can be presented as the favored hypothesis.

Disadvantages:
Extremely time consuming and requires tremendous computational resources. For my project of 110 taxa, I figured it was going to take about 7 days to run 50 replicates in a heuristic search plus another 7 - 14 days for 50 bootstrap replicates. Newer ML programs have been developed that are considerably faster. RaxML and PHyML are a couple examples. I was able to complete my analysis in about 4 hours (as opposed to 14+ days for the heuristic search) for 100 bootstrap replicates using PHyML. Honestly though, I am not really all that confident in the results. Not many of the branches had good support where by other methods most of them were quite well supported.

Bootstrapping

Bootstrapping is a widely used method that can provide a measure of support for the branches of a phylogenetic tree. In order to create a bootstrap replicate, a new dataset (of the same size as the original) is created by randomly choosing characters (with replacement) from the original dataset. For example, a bootstrap replicate made from a dataset with 10 characters (numbered 1 - 10) might include characters 1, 3, 4, 4, 7, 8, 8, 8, 9, 10. So, characters 2, 5 & 6 are not represented in the replicate while 4 & 8 are represented multiple times. The new dataset is then analyzed the same way the original dataset was. This process is repeated the specified number of times. A consensus of the trees generated from the replicates is created and the bootstrap support value for a branch is the percentage of times that particular branch appears in the set of replicate trees.

The reasoning behind this process is that if a branch is well supported there should be a significant number of characters that support that topology and by selecting characters at random, the strength of the phylogenetic signal should be detectable. The downside to this process is that not only is the tree that will be presented a consensus, but it is also created from artificial data.
------------

This post should provide the background to the principles and theories that led to the development of Bayesian analysis. I realize that there is a lot of information here but I tried to be as brief as I felt I could be. Hopefully it is all relatively clear but I expect that in my attempt at brevity, I left some explanations vague or unclear. I would appreciate questions, comments or discussion regarding anything related to this topic. I would also expect that my comments about parsimony may be controversial and might generate some discussion which would also be welcomed.

I will come back to discuss Bayesian analysis as soon as possible.

HBD


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


This message is a reply to:
 Message 1 by herebedragons, posted 01-01-2016 1:01 PM herebedragons has not yet responded

Replies to this message:
 Message 6 by Genomicus, posted 01-03-2016 7:59 AM herebedragons has not yet responded
 Message 8 by Tanypteryx, posted 01-05-2016 3:31 PM herebedragons has responded

  
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


Message 4 of 288 (775515)
01-01-2016 3:41 PM
Reply to: Message 3 by Admin
01-01-2016 2:07 PM


Re: Introduction
About "phylogeny," is this the way biologists actually use the term?

A phylogeny is the evolutionary history of a group of organisms and the result of a phylogenetic analysis. A phylogenetic tree is a graphic representation of a phylogeny. The phylogeny itself is the hypothesis about the evolutionary history of a group of taxa. I could try to clarify that better.

What tells us that the better hypotheses have a fair chance of being true? You go on to describe some evaluation criteria like optimality, but giving a name to a criteria in this case explains little.

I explain the optimality criteria more in depth in the next post.

"Nucleotide characters" means groups of three nucleotides that program for amino acids? Or do just mean individual nucleotides?

This issue would need to be resolved during alignment. Once the alignment is done the phylogenetic analysis treats each individual nucleotide as a separate character. The alignment is critical to any phylogenetic analysis but I wasn't sure there would be interest in a prolonged discussion about alignment, so I was trying to gloss over it.

You might be descending into jargon here.

I could probably delete that whole paragraph as it was just meant to be a quick example about how confidence affects our conclusions. I could save that for later.

I don't have time now to tackle your next post.

I will wait to make any corrections until I get your comments on that post.

HBD


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


This message is a reply to:
 Message 3 by Admin, posted 01-01-2016 2:07 PM Admin has acknowledged this reply

  
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


(2)
Message 10 of 288 (775865)
01-06-2016 12:21 AM
Reply to: Message 8 by Tanypteryx
01-05-2016 3:31 PM


Re: Introduction to Phylogenetic Methods
Good questions Tanypteryx. The simple answer is... it depends.

Essentially they are questions about alignment, as is Taq's in the next post. A good alignment (including choosing appropriate genes regions) is critical to a good phylogenetic study. I will plan to spend some time later on alignments. But for now I will give some brief answers to your questions.

How many regions would you want to sequence for the family phylogeny?

Four seems to be the minimum number of gene regions for an acceptable phylogenetic study; I have seen 8, but I can't think of any studies that have used more than that, so I would say 4 to 8 genes would be good.

If you were going to try to create a phylogeny for a family that contains 11 species in 8 genera, how would you decide which regions of the genome to sequence?

First, I would find out what other researchers are using for closely related species; this would be a good starting point. There are also some genes that are commonly used that would probably work for most Eukaryotes.

- Ribosomal RNA (rRNA) includes the 5.8S, 28S and 18S subunits as well internal spacers (ITS1 and ITS4). The subunits are transcribed into RNA but not translated into proteins. The internal spacers are also transcribed but snipped out before the subunits are assembled.

- Transcription Elongation Factor 1-alpha (EF-1a), RNA polymerase II subunits RPB1 and RPB2, Beta-tubulin and Histone H3 are some common nuclear genes used.

- Mitochondrial genes cytochrome c oxidase (cyt c) and rRNA 16S

Do you want the regions to be non-coding or regulatory?

We should frame this question in a different way... Do we want the regions to be highly conserved or highly variable? (Non-coding regions tend to be highly variable and coding regions tend to be conserved) For organisms that are very closely related, highly conserved regions will have little informative information, or in other words, too many of the subjects will have identical sequences. Conversely, regions that are highly variable will be virtually impossible to align in distantly related species. So, the type of region you choose depends on the species being studied.

Often genes are combinations of both types, coding and non-coding (exons and introns respectively). The coding regions are able to align well (since they are conserved) and the non-coding regions provide the phylogenetic informative characters. The figure below shows the ITS region, a widely used gene region. Using the primers ITS1 and ITS4 produces a fragment that contains part of the 18S and 28S subunits (coding), the entire 5.8S subunit (coding) and the two non-coding spacers between them.

How many regions to try and construct a genetic clock?

I am not really sure about this. I know there needs to be a way to calibrate the clock, meaning there needs to be a known time frame and a known number of substitutions. I also know that different genes evolve at different rates, so a molecular clock may only consider one gene. Other than that, I am not very familiar with the molecular clock techniques.

How many regions to determine the amount of genetic diversity, within and between populations a single species?

Genetic diversity studies are somewhat different than phylogenetics and I am not that familiar it yet. However, I am expecting to start a diversity study using microsatellites later this spring. So maybe more information later.

HBD


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


This message is a reply to:
 Message 8 by Tanypteryx, posted 01-05-2016 3:31 PM Tanypteryx has acknowledged this reply

  
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


Message 11 of 288 (775866)
01-06-2016 12:49 AM
Reply to: Message 9 by Taq
01-05-2016 5:38 PM


Re: Introduction
Substitutions in the first two bases of a codon are much more likely to cause a detrimental mutation and be selected against. Mutations in the 3rd base may not change the amino acid sequence at all.

The question I have for HBD is if synonymous and non-synonymous mutations are weighted differently in this method.

I am not aware of an alignment program that weighs the third codon differently. However, what they do is convert the codons to amino acids and then align the amino acids. A matrix is used to weight the amino acid substitutions where synonymous substitutions would have high scores and non-synonymous substitutions would have low scores (the alignment algorithm would try to maximize the alignment score). Below is an example of such a matrix (BLOSUM62).

HBD


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


This message is a reply to:
 Message 9 by Taq, posted 01-05-2016 5:38 PM Taq has responded

Replies to this message:
 Message 12 by Taq, posted 01-06-2016 12:03 PM herebedragons has responded

  
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


(1)
Message 13 of 288 (776125)
01-08-2016 11:43 PM
Reply to: Message 12 by Taq
01-06-2016 12:03 PM


Re: Introduction
Perhaps I am being overly critical or getting the terminology wrong, but in my understanding a synonymous mutation is one that results in the same amino acid.

No, you are exactly right. I just didn't do a good enough job explaining how it is handled.

Maximum-likelihood calculations do not take into account codon position, but it is not necessary to do so; there is a probability associated with an A mutating into a G regardless of its position. Whether the mutation results in a viable or a non-functioning protein will depend on what position the mutation occurred in but for the purposes of determining phylogenetic relationships, the important thing is that the mutation occurred and that it can be aligned properly.

The important part of the process is aligning the sequences so that the codons do line up and third position codons are compared to third position codons (as well as the other 2 positions, but you specifically asked about the third position).

Consider the following protein coding sequences that we want to align and infer their relationship:

1 - CGT GGG AAA
2 - CGG GGA AAA

If you tried to align them like this, the algorithm might insert a gap so that the nucleotides line up better. It could look like this:

1 - CGT GGG AAA
2 - CG- GGG AAA A

which would be an inappropriate alignment. Instead, the codons are kept together and converted into the appropriate amino acids and then aligned. So, after converting to amino acids the alignment looks like this

1 - Arg - Gly - Lys
2 - Arg - Gly - Lys

The chart I presented describes how the alignment algorithm determines whether to align 2 non-synonymous substitutions or insert a gap. (Synonymous mutations would be substituting the same amino acid eg.: Trp --> Trp = 11). And you are exactly right, a polar amino acid is more likely to align with another polar amino acid than it is to a hydrophobic amino acid.

Once the alignment is done, it is converted back into codons and then the phylogenetic algorithm evaluates the sequences base by base, treating each base as an individual character. So our example becomes

char#123456789
1 - CGTGGGAAA
2 - CGGGGAAAA

The maximum-likelihood is calculated for character 1 (likelihood of C --> C); then character 2 (G --> G); character 3 (T --> G); etc... Since each character is evaluated independently, its position is irrelevant.

I hope that better explains how coding sequences are handled in phylogenetics. It looks like I need to plan on explaining the whole alignment process better since it is such an important part of the whole process and seemingly not well understood.

HBD

Edited by herebedragons, : typo


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


This message is a reply to:
 Message 12 by Taq, posted 01-06-2016 12:03 PM Taq has not yet responded

  
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


(1)
Message 15 of 288 (776592)
01-16-2016 4:57 PM


Bayesian Inference I
Maximum-likelihood evaluates trees based on the probability that evolution would produce the observed data. Bayesian inference evaluates trees based on their posterior probability, which is the probability that the tree is true given the data, based on an evolutionary model and taking into consideration the prior probability. To get a feel for how the Bayesian method works let’s look at an example.

You are given a bag of coins and told that 1/2 of the coins are fair (50% heads) and 1/2 are biased (75% heads). One coin is selected from the bag at random and you wish to consider the two alternative hypotheses that either the coin is fair or it is biased. The coin is flipped 10 times and each time it falls on heads. Now we can apply likelihood to determine whether our data (10 heads) supports one hypothesis over the other.

The likelihood is the probability that the data would have arisen from a given hypothesis, ie. the coin is fair. With 10 heads the likelihood is the product of the probabilities of each toss or 0.510 = 0.00098. This does not mean there is a 0.1% chance the coin is fair, it means there is a 0.1% chance of this specific outcome for a fair coin. That is, there is a 0.001 probability of the data given the specific hypothesis.

The probability of the data given the hypothesis that the coin is biased is 0.7510 = 0.0563, which is still a really low number, which tells us that even under the biased hypothesis this outcome is highly unlikely. What counts however, is the comparison of the likelihoods of the competing hypotheses. The ratio of likelihoods is 0.0563/0.00098 or about 57 times as probable under the biased hypothesis as it is under the fair hypothesis. This likelihood ratio is usually expressed as a natural logarithm and is a measure of support for one hypothesis over another. In this example, ln(57) = 4.05, which is well above the commonly used threshold of 2.0 (which approximates the traditional P < 0.05 confidence level). We would conclude that the data strongly supports the conclusion that the coin is biased.

With likelihood we are able to deduce that the hypothesis that the coin is biased is more likely to produce the observed data than the hypothesis that the coin is fair, but we did not calculate the actual probability that the coin was biased. Bayesian theorem states that the probability of a hypothesis given some data, is equal to the probability of the data given the hypothesis (the likelihood) times the prior probability divided by the total probability of the data (summed over all hypotheses).

Where:

is the probability of the hypothesis given the data (the posterior probability)

is the probability of the data given the hypothesis (the likelihood)

is the probability of the hypothesis (the prior probability)

is the probability of the data (total probability)

Now we can apply this theorem to our coin flipping example to determine what the probability is that we have selected a biased coin given the data of flipping 10 heads in a row. Using the same model as before, we can determine (the likelihood), which is 0.7510 or 0.0563. or the prior probability that the coin is biased is 0.5, so the numerator is 0.0563 x 0.5 = 0.0281.

The denominator is a bit more difficult to determine. Here, there are only 2 possible hypotheses; that the coin is fair or that it is biased. There is a 0.5 chance it is fair, and if it is, the probability of getting 10 heads is 0.510. There is also a 0.5 chance the coin is biased, and if it is, the probability of getting 10 heads is 0.7510. Summing these together is the total probability of the data 0.5 x (0.510 + 0.7510) = 0.0286. This means there is a 2.9% chance of selecting a coin at random and then obtaining 10 heads in 10 tosses.

Combining these numbers gives us 0.0281 / 0.0286 = 0.98 or 98%. This means that given the data, the priors and the model, there is a 98% chance that the coin is biased and only a 2% chance that the coin is fair. By accounting for the observation of flipping 10 heads, the probability that a biased coin was selected went from 0.5 to 0.98. The notable aspect of the Bayesian approach is that the starting information matters. If the original sample from which the coin was selected only contain 1% biased coins, the posterior probability would only be 0.37

So in this case, even after observing 10 heads, it would still be better to bet against the coin being biased.

Now we need to apply these principles to phylogenetics. The data corresponds to a character state matrix and the hypotheses correspond to the alternative tree topologies. Thus Bayes’ theorem takes the form:

The prior probability of a particular tree is the probability that among all possible tree topologies it is the correct one. If we believed that all trees were equally likely, then we could assign a flat prior, where the prior probability of a tree equals one divided by the number of trees. The probability of the data given the tree is calculated as the maximum likelihood as described earlier.

The difficulty comes in calculating the probability of the data which requires a summation over all possible tree topologies. Maximum-likelihood and parsimony can assign a score to a tree in isolation but a Bayesian posterior probability cannot be assigned to a single tree without taking into account all possible trees. Instead of calculating the prior probability for each individual tree, Bayesian analysis uses a system known as a Marcov chain Monte Carlo to compare the relative posterior probabilities of different trees, which is a concept I will discuss in the next post. The probability of the data, Pr(Data) (the denominator of the Bayesian equation) will be the same across all possible trees and will therefore cancel out when the ratio is calculated.

I am going to have to stop here and discuss the next part of the process later. I apologize that this post was so theoretical and I thought about not bringing it up, but I felt it is actually an important part of the Bayesian analysis and without this background understanding, it may not make as much sense as to why the method has developed the way it has. I am hoping that it becomes more clear how the Bayesian theorem influenced the development of the methodology and why the process works the way it does.

HBD

Reference:

Most of this post was based on material presented in (a highly recommended book, by the way):

Baum, D. A., Smith, S. D. (2013). Tree thinking: An introduction to phylogenetic biology. Roberts and Company Publishers, Greenwood Village, CO., USA.

Edited by herebedragons, : spelling


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


Replies to this message:
 Message 16 by Dr Adequate, posted 01-17-2016 2:14 AM herebedragons has responded

  
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


Message 19 of 288 (776636)
01-17-2016 5:42 PM
Reply to: Message 17 by RAZD
01-17-2016 8:57 AM


Re: Bayesian Inference I
Well I would think that the probability of the first two of these

| | |
/ \ / \ /|\
/\ | | /\ / | \

would be the same but different from the probability of third, (which I would expect to be lower).

Dr. A is right, only bifurcating trees are evaluated for likelihood.

The assumption is that all phylogenies can be resolved given enough information, so the tree space will only contain bifurcating, fully resolved trees. Tree number 3 above would be an unresolved trichotomy, not that 3 lineages diverged from the same ancestor. This would occur when there are not enough differences between two or more taxa to completely resolve the relationship.

The question regarding prior probabilities is: is there any reason to favor tree 1 over tree 2 before even analyzing the data? If there is, then a probability could be assigned that gives tree 1 an increased posterior probability.

HBD


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


This message is a reply to:
 Message 17 by RAZD, posted 01-17-2016 8:57 AM RAZD has responded

Replies to this message:
 Message 21 by RAZD, posted 01-18-2016 8:54 AM herebedragons has responded

  
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


Message 20 of 288 (776643)
01-17-2016 7:47 PM
Reply to: Message 16 by Dr Adequate
01-17-2016 2:14 AM


Re: Bayesian Inference I
Why don't we believe that?

We do consider that all trees have an equal prior probability unless we have information or prior knowledge of the system that indicates otherwise. For example, if we knew two taxa were sister taxa, we could include constraints that favor topologies that include those taxa as sister groups (or as a monophyletic group).

We can also assign priors that reflect our prior knowledge of substitution rates (molecular clock rates), tree age, and branch lengths. So if we had prior knowledge that two lineages diverged a certain amount of time ago based on fossil evidence, we would want to favor topologies that reflect that divergence time. The posterior probability doesn't just take into account topology but other factors that could influence the phylogenetics. MrBayes (probably the most widely used Bayesian inference program) includes 34 prior parameters that can be set.

I am not all that familiar with all these subtleties (like exactly how and when to use them), but the point is that Bayesian theory allows us to take into account prior knowledge of a system and that prior knowledge can affect the outcome of the analysis.

HBD


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


This message is a reply to:
 Message 16 by Dr Adequate, posted 01-17-2016 2:14 AM Dr Adequate has not yet responded

  
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


Message 22 of 288 (776687)
01-18-2016 12:08 PM
Reply to: Message 21 by RAZD
01-18-2016 8:54 AM


Re: Bayesian Inference I
I would think the first is more probable than the second as I would not expect all the "activity" to be on one branch.
      
| vs |
a a
/ \ / \
/ \ / \
/ \ b \
/ b / \ \
c / \ c \ \
/ \ / \ / \ \ \
d e f g d e f g

Thoughts?

Well, let's think about what this means phylogenetically. In the figure on the left, both lineages have diverged at roughly the same rate. In the figure on the right, lineage 'd' has diverged at a much higher rate than has lineage 'g'. What a priori belief about evolution or about this system would cause us to believe that the first tree is more likely? One possibility would be that taxa 'd' and 'e' are Australian taxa and 'f' and 'g' are North American taxa. This prior knowledge may cause us to want to give more weight to the first tree.

However, I am not sure how you would decide how to set the prior in this case. Is it 100% certain that given the above prior knowledge that tree 1 is the correct tree? No, it is definitely not 100% certain. It is quite plausible that a North American species, 'f', is more closely related to the Australian species 'd' and 'e' than it is to another N. American species 'g'.

Not that I have read a truly representative sample of the literature regarding Bayesian inference, but my impression is that priors are not typically applied to topology but to models of evolution (maybe Genomicus can weigh in on this and say if that is his impression as well). The maths behind these models can be rather complicated and honestly, I am just not familiar enough with some of those finer points. It is something I need to understand better, so maybe I can talk about models more at a later time.

For now, I think the important "take-away" is that Bayesian statistics allows us to incorporate our prior knowledge of a system into the calculation of the posterior probability. Because of the way the formula is structured, the posterior probability becomes the probability that our hypothesis is true. Maximum-likelihood does not calculate the probability of the hypothesis, but the probability that the hypothesis could produce the given data - which is kind of backwards from what we really want to know. We want to know if our hypothesis is correct or what the probability is that it is correct. That is the big advantage of Bayesian statistics over ML.

Bottom line here, I would consider all topologies to have equal prior probability. At this point, I don't think I could legitimately give weight to one topology over another because I just wouldn't know how to determine how much weight to assign.

HBD


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


This message is a reply to:
 Message 21 by RAZD, posted 01-18-2016 8:54 AM RAZD has responded

Replies to this message:
 Message 23 by RAZD, posted 01-18-2016 12:45 PM herebedragons has not yet responded

  
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


(1)
Message 226 of 288 (796137)
12-23-2016 9:38 AM
Reply to: Message 212 by Taq
12-22-2016 1:01 PM


The purpose of phylogenetics
caffeine writes:

I'm taking issue with the point I've seen Taq and others make more than once on these forums, that phylogenetics by itself is a test of common ancestry. Since we don't reject common ancestry when we cannot produce a well supported phylogeny, it seems dishonest to say we're testing evolution this way.

But we do produce well supported phylogenies all over the place. I have already cited the cytochrome c example.

I agree with Caffeine on this, and we have had this discussion before but I have been too busy to participate much and really didn't finish up our last exchange. By itself phylogenetics are not a true test of common ancestry. Here is my reasoning:

1) When we do a phylogenetic analysis we don't have a null hypothesis - especially one that says that there is no common ancestor for the taxa. A true test of common ancestry would involve a hypothesis such as "these taxa are related by common ancestry" and a null hypothesis such as "these taxa are not related by common ancestry." Instead the question that a phylogeny asks is "What is the BEST hypothesis as to how these taxa are related by descent."

2) As much as it sounds wrong, common ancestry IS a basic assumption of phylogenetics. But before someone like vaporwave completely misunderstands this statement, I am using assumption in the way scientists use it: An assumption is a premise that must be true in order for your conclusions to be accurate. It is NOT something taken for granted, or taken without evidence, or taken by faith, or a wild-ass guess... that is not how we use assumptions in science. You must always be ready to justify your assumptions and sometimes even test them.

3) The job of a phylogenetic program is to create phylogenetic trees... that's what it does. You will NEVER get a result of "No suitable tree exists." no matter what data set you use. You may have a lot of unresolved branches, but we often do with real biological data anyway. A phylogenetic program builds and evaluates phylogenetic trees from a given data set - that's it.

4) Phylogenetic signal or phylogenetic support is a statistical methodology that is used like other statistical methodologies - essentially they state how likely the data is to be non-random. When you see 95% support for a particular branch, that means there is a 5% chance that the data is just random and only appears to fit that pattern (phylogenetic support is a little more nuanced than that, but that is essentially what the support value tells us). No tree has 100% support for all branches, it just doesn't happen. In fact, I would say that it is more typical for 1/3 or more of the branches to have support values below the reporting threshold (usually 70%) and when they don't report them they can be anywhere below that threshold, even say 10%.

5) Related to #4, what threshold of phylogenetic signal or phylogenetic support would cause the researcher to conclude that there is no common ancestor between the two taxa? There is NOTHING in the tree itself that would lead you to conclude there is no common ancestor. I think this is the strongest point for my case.

The key to my argument is that phylogenetic trees are not support for common ancestry by themselves, but must be coupled with other data to be useful. And that is where phylogenetics provide real support for common ancestry - we can produce meaningful reconstructions!

For example, I am certain that I could develop a data set for cars that would result in a decent phylogram. But it would be completely meaningless from any evolutionary perspective. For instance, it might group a 1927 Ford model T with a 2010 Ford Focus because they are both Fords, have 4 cylinder engines and the same horsepower rating . But that is completely meaningless from an evolutionary standpoint - that is they are not related in time or space.

Biological phylogenies, on the other hand, produce meaningful hypotheses that can be used to make predictions and further develop understanding. For example, it seems just about as daft to group hippos and whales together as it does to group the model T and the Ford Focus together, unless you know more about the biology of the organisms. The whale - hippo relationship also makes specific predictions about what intermediate forms should be found and where we expect to find them in relation to time and space.

Phylogenetics also makes predictions about physiological characteristic of closely related taxa as opposed to more distantly related taxa. We use this information to make investigating biological function more productive. I used a phylogenetic approach to look for disease resistance in a dry bean diversity panel. It improved my success rate 4 fold, although I never really found anything that was suitably resistant.

Vaprowave mentioned that molecular phylogenies were overturning long standing morphological phylogenies. In this he is largely correct, especially in plants and fungi. The problem with morphology is that it can be very difficult to know what traits are evolutionary important so that those traits are chosen to construct phylogenies. Molecular work has made those relationship much less subjective (there is still some subjective nature to molecular work, but much, much less so than with morphological data). Phylogenetics has given us significant insight into how evolutionary mechanisms work. Again, they lead to hypotheses about evolutionary mechanisms that can be tested and verified.

Bottom line: being able to construct phylogenetic trees from biological data is not in and of itself a true test of common ancestry. It is the meaningful insights that come from phylogenetic reconstructions and the resulting hypothesis testing that provides support for common ancestry.

Maybe this is semantics, or a highly technical perspective on the subject, but I think it is important to recognize the weaknesses and limitations of our methodologies as well as the strengths. For the most part you are doing great at explaining the strengths of phylogenetics and have given good examples. Good job on the protein alignment - I thought about doing something like that but it was just too time consuming.

And just to be clear to all, despite my minor disagreement with Taq on this, vaporwave's overall assessment of phylogenetics, as presented on this thread, is uninformed and just plain wrong, even though he may have some of the general points more or less right. This is so typical of creationists who get their biological training from creationist websites and books rather than learning any real biology.

Merry Christmas and have a happy New Year...

HBD


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


This message is a reply to:
 Message 212 by Taq, posted 12-22-2016 1:01 PM Taq has not yet responded

Replies to this message:
 Message 233 by vaporwave, posted 12-23-2016 3:39 PM herebedragons has not yet responded

  
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


(1)
Message 227 of 288 (796140)
12-23-2016 9:51 AM
Reply to: Message 225 by RAZD
12-23-2016 8:36 AM


Re: metaphysics and morphology and macroevolution
no alternate hypothesis or theory provides the detail explanation for the observed objective empirical evidence that evolution theory provides. No alternative hypothesis\theory has made testable predictions that don't falsify them, or they have failed entirely to make testable predictions.

To me, this is by far the strongest point in favor of evolutionary theory. I do feel that evolutionary theory has some weak areas... areas that are pretty much biological black boxes, and I would love for a revised or new theory to come out that could fill in those gaps. And I would especially love to be the one who made those revisions. But for now the ToE is the absolute BEST explanation for the diversity of life on earth that we have. There isn't even a close second - there isn't even a semi-viable alternative.

HBD


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


This message is a reply to:
 Message 225 by RAZD, posted 12-23-2016 8:36 AM RAZD has acknowledged this reply

  
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


(2)
Message 242 of 288 (796181)
12-24-2016 8:55 AM
Reply to: Message 241 by RAZD
12-24-2016 8:35 AM


Re: templates and peripheral features
Because it sure looks to me like you are saying "independently evolved from existing DNA from their common therian ancestor by mutation and selection" while trying desperately to make it sound like not-evolution.

Yea, this tactic has always puzzled me and it's pretty common. Just throw in a few buzz words like 'within a kind', 'designed', 'adapted' (instead of evolved) and 'irreducibly complex' and you have the creationist idea of a complete dismantling of the ToE.

Like this:

Independently adapted within a kind from a commonly designed DNA template through modification of an existing irreducibly complex system

Have a Merry Christmas and a happy New Year RAZD

HBD


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


This message is a reply to:
 Message 241 by RAZD, posted 12-24-2016 8:35 AM RAZD has acknowledged this reply

  
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


(2)
Message 283 of 288 (796550)
12-31-2016 9:34 AM
Reply to: Message 282 by Dr Adequate
12-30-2016 5:30 PM


Re: What Design Actually Looks Like
Good post, Dr. A

You don't get to mix and match. That's evolution for you. 'Cos it produces nested hierarchies.

Another way to put this is that evolution put constraints on future evolutionary processes. In other words, once on a particular lineage or branch, future generations are constrained to be part of that lineage - they cannot jump lineages. This was really the big insight provided by the Lenski long-term experiment - that each evolutionary step determines the range of possibilities for future evolutionary processes - ie. constraints. Design has no such constraints. Sure, a designer could have designed life in such a way (in fact, personally I believe the "designer" did design life to evolve; ie. used evolution to design life) but there are no constraints on the process and designed objects can "jump lineages." And as you have aptly demonstrated, the evidence from human designs does not support the design hypothesis ("nested templates"?) in biological systems.

HBD

Edited by herebedragons, : typo


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


This message is a reply to:
 Message 282 by Dr Adequate, posted 12-30-2016 5:30 PM Dr Adequate has not yet responded

  
herebedragons
Member
Posts: 1513
From: Michigan
Joined: 11-22-2009


(2)
Message 284 of 288 (796553)
12-31-2016 10:04 AM
Reply to: Message 279 by vaporwave
12-30-2016 1:06 PM


Re: we have motive (survival) means (evolution) and opportunity (proximity)
Your initial premise was that phylogenetics is not a test of common ancestry, yet you continue to discuss how phylogenetics supports design and how problems with phylogenetics point away from common ancestry.

It is important that we test our hypotheses against competing hypotheses. At this point, common ancestry is a well established and widely accepted part of evolutionary theory. When we test hypotheses about descent, we test opposing theories that all involve common descent. However, on this thread, you wish to argue that design is a better hypothesis than common descent. What we need to do is compare the design hypothesis to the hypothesis of common descent.

Since you are claiming that phylogenetics is not an adequate test to compare these hypotheses because they would give the same results regardless of the process, how would you propose to test these competing theories? Just claiming that common descent has some problem areas where the relationships are uncertain or ambiguous is not sufficient to support your design hypothesis - design is NOT the null hypothesis.

The test needs to be designed so that we can compare which hypothesis or model explains the data better. Remember that a theory is a framework that explains why we observe a particular pattern or phenomenon. Evolutionary theory is the best explanation of why we observe nested hierarchies in biological systems. Your claim is that "nested templates" provides a better framework for understanding this pattern/phenomenon. Provide us a test that can be used to directly compare these two models.

HBD


Whoever calls me ignorant shares my own opinion. Sorrowfully and tacitly I recognize my ignorance, when I consider how much I lack of what my mind in its craving for knowledge is sighing for... I console myself with the consideration that this belongs to our common nature. - Francesco Petrarca

"Nothing is easier than to persuade people who want to be persuaded and already believe." - another Petrarca gem.

Ignorance is a most formidable opponent rivaled only by arrogance; but when the two join forces, one is all but invincible.


This message is a reply to:
 Message 279 by vaporwave, posted 12-30-2016 1:06 PM vaporwave has not yet responded

  
Newer Topic | Older Topic
Jump to:


Copyright 2001-2018 by EvC Forum, All Rights Reserved

™ Version 4.0 Beta
Innovative software from Qwixotic © 2019