This is, of course, relevent to that topic but it turns out there is enough information and resources on the web to have some fun with this. This topic is to spin off this area alone from the convergent topic.
I have a rough idea of what is going on with the CLUSTALW program but I would like a 3 or 4 paragraphs explanation if WK (or someone could give that). The help is obviously aimed only at those "in the business" and is heavily jargon loaded.
I presume that the program attempts to "line up" the input sequences as best as it can. It then calculates a percentage of match. (The PIM "percentage identity matrix").
I would like some hint of what it means, how it may be correctly used and how it may be incorrectly used. What anomolies might one create?
Why is the percentage called a percentage identity matrix?
In Message 301 (one of the last messages in the precursor thread to the one Ned referenced in the OP) I talked about the external similarities of the marsupial and placental mole. I also made this prediction:
I would expect (he says sticking his neck out way beyond his knowledge!) that the marsupial mole is genetically much more similar to another marsupial - probably any marsupial - than it is to the placental mole.
With all the info WK has posted (and I think we should all offer him our thanks for showing us this great new toy :)) I figured I could actually test out my prediction, and I offer up my attempt for WK or any other expert to point out if I've screwed up or to comment on my results.
There is one very strange thing that makes me wonder if I messed something up - more on that later.
First the animals I picked. Obviously there is are the placental and marsupial moles, then I needed another marsupial to compare against and although not included in my prediction I decided to include another placental mammal. This was because I expected the two placentals to be closer to each other than to either of the marsupials.
Since this was all about convergent evolution (the two moles) I wanted to pick the other animals to be fairly different from moles. For the marsupial I picked our old friend the thylacine and for the placental I picked the giraffe - both quite different to moles I think you'll agree!
While investigating placental moles I discovered there are lots of them. The one which seems to most closely resemble the marsupial mole is one of the golden moles - I went for the Cape golden mole. For good measure I also threw in the European mole.
So I ended up with this list (I got the Latin names from Googling):
Marsupial Mole (Notoryctes typhlops)
Cape Golden Mole (Chrysochloris asiatica)
European Mole (Calcochloris obtusirostris)
Giraffe (Giraffa camelopardalis)
Thylacine (Thylacinus cynocephalus)
Using Entrez I got the following results for Cytochrome b:
>Marsupial Mole Cytochrome b MVNLRKTHPLMKIINHSFIDLPAPSNISAWWNFGSLLGICLIIQILTGLFLAMHYTSDTYTAFSSVAHIC RDVNYGWLIRNLHANGASMFFMCLFLHVGRGIYYGSYLYKETWNIGVILLLTVMATAFVGYVLPWGQMSF WGATVITNLLSAIPYIGTTLVEWIWGGFSVDKATLTRFFAFHFILPFIIAALAIVHLIFLHETGSNNPSG INPDADKIPFHPYYTIKDALGLLFLLLLLLSLALFSPDLLGDPDNFSPANPLNTPPHIKPEWYFLFAYAI LRSIPNKLGGVLALLASIMILLIIPLLHTSNQRSMTFRPISQILYWILAANLLVLTWIGGQPVEQPFIII GQLASILYFLLIILLMPLAGLFENYMLEPKW
>Cape Golden Mole Cytochrome b MTNIRKTHPLLKIINHSFIDLPAPSNISAWWNFGSLLGLCLIIQILTGLFLAMHYTSDTSTAFSSVTHIC RDVNNGWLIRYLHANGASMFFICLFTHVGRGIYYGSYLFLETWNIGIILLFAVMATAFMGYVLPWGQMSF WGATVITNLLSAIPYIGTTLVEWIWGGFSVDKATLTRFFAFHFILPFIVAALTMVHLLFLHETGSNNPSG LNSDADKIPFHPYYTVKDLLGVLMLLLFLLLLTLFSPDLLGDPDNYIPANPLNTPPHIKPEWYFLFAYAI LRSIPNKLGGVLALVFSILILAAFPLLHMSHQRSLMFRPLSQCMFWILVADLFTLTWIGGQPVEHPFIII GQLASILYFTIILVLMPVSSMIENRLLKW
>European Mole Cytochrome b MTNIRKTHPLMKIVNSSFIDLPAPSNISSWWNFGSLLGICLILQILTGLFLAMHYTSDTMTAFSSVTHIC RDVNYGWLIRYLHANGASMFFICLFLHVGRGLYYGSYMFMETWNIGVLLLFAVMATAFMGYVLPWGQMSF WGATVITNLLSAIPYIGTDLVEWIWGGFSVDKATLTRFFAFHFILPFIIAALAGVHLLFLHETGSNNPSG LSSDTDKIPFHPYYTIKDILGALILIMALSSLVLFSPDLLGDPDNYIPANPLNTPPHIKPEWYFLFAYAI LRSIPNKLGGVLALVFSILVLALMPFLHTSKQRSMMFRPISQCLFWLLVADLFTLTWIGGQPVEHPFIII GQLASILYFALILMLMPLASLMENNLLKW
>Giraffe Cytochrome b MINIRKSHPLMKIVNNALIDLPAPSNISSWWNFGSLLGICLILQILTGLFLAMHYTPDTTTAFSSVTHIC RDVNYGWIIRYMHANGASMFFICLFMHVGRGLYYGSYTFLETWNIGVILLFTVMATAFMEYVLPWGQMSF WGATVITNLLSAIPYIGTNLVEWIWGGFSVDKATLTRFFAFHFILPFIIMALTMVHLLFLHETGSNNPMG IPSDMDKIPFHPYYTIKDILGALLLILVLMLLVLFTPDLLGDPDNYTPANPLNTPPHIKPEWYFLFAYAI LRSIPNKLGGVLALVLSILILIFMPLLHTSKQRSMMFRPFSQCLFWILVADLLTLTWIGGQPVEHPFIII GQLASIMYFLIILVLMPVTSAIQNNLLKW
>Thylacine Cytochrome b MIIMRKTHPLLKTINHSFIDLPAPSNISAWWNFGSLLGICLVIQILTGLFLAMHYTSDTSTAFSSVAHIC RDVNYGWLIRNLHANGASMFFMCLFLHVGRGIYYGSYLYKETWNIGVILLLTVMATAFVGYVLPWGQMSF WGATVITNLLSAIPYIGTTLAEWVWGGFAVDKATLTRFFAFHFILPSIVTARATVHLLFLHETGSNNPSG INPDSDKIPFHPYYTIKDALGLMLLLLPLLPLALFSPDLLGDPDNFSPANPLNTPPHIKPEWYFLFAYAI LRSIPNKLGGVLALLASILILLIIPLLHTSNQRSMMFRPISQTLFWILAANLLTLTWIGGQPVEQPFIII GQLAIILYFLLIVVLMPLAGLLENYMLEPKW
I then put these into CLUSTALW and got these results:
Sequence type explicitly set to Protein Sequence format is Pearson Sequence 1: Marsupial 381 aa Sequence 2: Cape 379 aa Sequence 3: European 379 aa Sequence 4: Giraffe 379 aa Sequence 5: Thylacine 381 aa Start of Pairwise alignments Aligning...
This shows the two most closely related are 1 and 5 (Marsupial Mole and Thylacine), but the second most closely related are 3 and 4 (European Mole and Giraffe). I would have expected the two placental moles to be more closely related to each other than either of them are to the giraffe.
Maybe this is just an artifact of not having enough data sets?
So we found one instance where the "pair" between the marsupial (golden mole) and the placental mole are more closely related (87%) than the 2 marsupial species of moles (83.3%) and the placental mole with the other Marsupial mole (83.6%). But it does seem a little bizzare.
So the most similar animals are the Marsupial mole and Thylacine (the Tasmanian wolf remember). The second third and fourth closest relationships are between the three placental animals and the remaining values are for the various permutations of one placental and one marsupial.
Note that this indicates that the two kinds of placental mole are both more closely related to the giraffe than they are to the marsupial mole.
The only odd thing is that the European mole appears to be more closely related to the giraffe than it does to the Cape golden mole - everything else is in line with what I predicted.
What exactly are you having trouble with? ClustalW is a fantastic program, but most biologists that use it from time to time don't bother to look up the help given that most of the parameters are optimized for the average user's needs. The program does, however, provide a number of parameters to optimize for the specific protein/gene that you are trying to align. For example, if you are doing an intraspecific (within a species) comparison, you might want to give more weight to a neutral nucleotide substitution (one that doesn't change the amino acid sequence) than if you were doing an interspecific comparison. My advice would be not to worry about the myriad options that ClustalW provides - you're not publishing your 'results' and you're likely to mess things up if you change one of the parameters without knowing a WHOLE lot about the comparison you're trying to make.
And the PIM: "percentage identity" = some self-important person's jargon for percentage similarity "matrix" = the % is based on the matrix of parameters that you have provided: gap length penalty, etc.
Oh, and you might want to actually check the results of the alignment - see if the program has gotten the sequences to line up right or if it has inserted some gaps for no good reason. As fantastically complex as this thing is, it sometimes (read, often) makes mistakes. It's still a vast improvement over trying to do it yourself - the Western mind was not meant to look at a sequence of ACTG and see anything other than CAT or TAG or CATTAG.
The ClustalW web site I linked to in the previous thread was chosen mainly because it has a relatively simple interface, the tree drawing was very simple to use and because it was pretty fast.
Probably the most used portal to ClustalW is the one at http://www.ebi.ac.uk/clustalw/ which is hosted by the European Bioinformatics Institute. The EBI site has the full text of a paper which goes into some detail on the various alignment methods used by the ClustalW program.
Anyone really keen might be advised to download a copy of ClustalX which should also allow you to try some bootstrapping, the download site has versions for a wide variety of OSs. The NJ-plot program is also downloadable and will allow you to fiddle around with the tree a bit more, displaying bootstrap values and branch lengths, or changing your outgroup if you have one.
Some relevant highlights from the paper.
On the distance matrix and pairwise alignments :-
The scores are calculated as the number of k-tuple matches (runs of identical residues, typically 1 or 2 long for proteins or 2 to 4 long for nucleotide sequences) in the best alignment between two sequences minus a fixed penalty for every gap. We now offer a choice between this method and the slower but more accurate scores from full dynamic programming alignments using two gap penalties (for opening or extending gaps) and a full amino acid weight matrix. These scores are calculated as the number of identities in the best alignment divided by the number of residues compared (gap positions are excluded). Both of these scores are initially calculated as percent identity scores and are converted to distances by dividing by 100 and subtracting from 1.0 to give number of differences per site. We do not correct for multiple substitutions in these initial distances. In figure 1 we give the 7x7 distance matrix between the 7 globin sequences calculated using the full dynamic programming method.
A further worthwhile exercise might be to get the corresponding nucleotide sequences for the proteins we have been looking at and run them through the same procedure.
This message has been edited by Wounded King, 06-09-2005 06:13 AM
Here is a tree base on Cytochrome B protein sequence data from a number of species, including some insects, bacteria and a plant. I decided to use Arabidopsis as the outgroup since it was the only plant sequence I put in.
As well as the sequences I already had for the marsupials I got a number of sequences by using the Homologene database on Entrez which pulls up homologues of a gene from a number of different species and will display multiple alignments of the protein products of those genes, and also will display them all in FASTA format making it very simple to import into Clustal.
I then ran the data through a local version of ClustalX and used Treeview to produce the tree with the bootstrap values on.
One downside to Clustal for bootstrapping, compared to a more dedicated program like those in the Phylip suite, is that it doesn't produce a record of the trees generated by the bootstrap. ClustalX will only add the bootstrap information to the tree while Phylip's seqboot program will actually generate a file containing the 1000 trees generated.
The description of bootstrapping I gave in the previous thread was totally inaccurate. A more accurate description of the process is given by the author of the Phylip suite of programs:-
Joseph Felsenstein writes:
The bootstrap. Bootstrapping was invented by Bradley Efron in 1979, and its use in phylogeny estimation was introduced by me (Felsenstein, 1985b; see also Penny and Hendy, 1985). It involves creating a new data set by sampling N characters randomly with replacement, so that the resulting data set has the same size as the original, but some characters have been left out and others are duplicated. The random variation of the results from analyzing these bootstrapped data sets can be shown statistically to be typical of the variation that you would get from collecting new data sets. The method assumes that the characters evolve independently, an assumption that may not be realistic for many kinds of data.
In other words a dataset of-
Might become -
When the first column is sampled 3 times, the second once, the third once, the fourth once, the sixth twice, the seventh once, the eighth once, the ninth twice, the eleventh twice and the twelth once.
I'm going in to hospital on Friday morning for a few days but when I get back I'm going to have a go and see if I can produce something like it and then extend it to include the moles and giraffe from the example I posted earlier (and maybe a mole-rat as well).
This is mostly just for my own entertainment but also because the result I got where a giraffe shows as more more closely related to one placental mole than that mole was with a different placental mole is bugging me!
For many years their place within the Marsupials was hotly debated, some workers regarding it as an offshoot of the Diprotodontia (the order to which most living marsupials belong), others noting similarities to a variety of other creatures, and making suggestions that, in hindsight, appear bizarre. A 1989 review of the early literature, slightly paraphrased, states:
When Stirling (1888) initially was unable to find the epipubic bones in Marsupial Moles, speculation was rife: the Marsupial Mole was a monotreme, it was the link between monotremes and marsupials, it had it closest affinities with the (placental) golden moles, it was convergent with edentates, it was a polyprotodont diprotodont, and so on.  link The mystery was not helped by the complete silence of the fossil record. On the basis that marsupial moles have some characteristics in common with almost all other marsupials, they were eventually classified as an entirely separate order: the Notoryctemorphia. Molecular level analysis in the early 1980s showed that the marsupial moles are not closely related to any of the living marsupials, and that they appear to have followed a separate line of development for a very long time, at least 50 million years.
What molecular data? This is the absoloute rock bottom of scholarship, no reference, no suggestion what the molecule/s in question were. As far as evidence goes our scanr review is considerably more compelling than this utterly useless reference.
But even were that not the case there are other published studies that support the marsupial mole being genetically closer to the other marsupials.
Studies on the interphotoreceptor retinoid binding protein group Notoryctes with other marsupials, bear in mind that since the marsupial mole is blind this gene is virtually functionless in the marsupial mole. (Springer, et al., 1997).
WK, I am not saying encyclopedia articles are correct, but they usually tell you what the majority academic opinion is.
To me, it's useful to see that. Obviously I don't always agree with it, but it sheds light on what the experts in that field are thinking. Unfortnately, experts in the field can still be, and are often, wrong.