EvC Forum: Sequence comparisons (Bioinformatics?)

Email This Thread

Newer Topic | Older Topic

Author

Topic: Sequence comparisons (Bioinformatics?)

Wounded King

Member

Posts: 4149

From: Cincinnati, Ohio, USA

Joined: 04-09-2003

Normal Thread Display

Message 9 of 42 (215568)
06-09-2005 5:41 AM

Other Clustal sites and basic workings

The ClustalW web site I linked to in the previous thread was chosen mainly because it has a relatively simple interface, the tree drawing was very simple to use and because it was pretty fast.

Probably the most used portal to ClustalW is the one at Bioinformatics Tools for Multiple Sequence Alignment < EMBL-EBI which is hosted by the European Bioinformatics Institute. The EBI site has the full text of a paper which goes into some detail on the various alignment methods used by the ClustalW program.

Anyone really keen might be advised to download a copy of ClustalX which should also allow you to try some bootstrapping, the download site has versions for a wide variety of OSs. The NJ-plot program is also downloadable and will allow you to fiddle around with the tree a bit more, displaying bootstrap values and branch lengths, or changing your outgroup if you have one.

Some relevant highlights from the paper.

On the distance matrix and pairwise alignments :-

The scores are calculated as the number of k-tuple matches (runs of identical residues, typically 1 or 2 long for proteins or 2 to 4 long for nucleotide sequences) in the best alignment between two sequences minus a fixed penalty for every gap. We now offer a choice between this method and the slower but more accurate scores from full dynamic programming alignments using two gap penalties (for opening or extending gaps) and a full amino acid weight matrix. These scores are calculated as the number of identities in the best alignment divided by the number of residues compared (gap positions are excluded). Both of these scores are initially calculated as percent identity scores and are converted to distances by dividing by 100 and subtracting from 1.0 to give number of differences per site. We do not correct for multiple substitutions in these initial distances. In figure 1 we give the 7x7 distance matrix between the 7 globin sequences calculated using the full dynamic programming method.

A further worthwhile exercise might be to get the corresponding nucleotide sequences for the proteins we have been looking at and run them through the same procedure.

TTFN,

This message has been edited by Wounded King, 06-09-2005 06:13 AM

Wounded King

Member

Posts: 4149

From: Cincinnati, Ohio, USA

Joined: 04-09-2003

Normal Thread Display

Message 11 of 42 (215577)
06-09-2005 8:37 AM

Here is a tree base on Cytochrome B protein sequence data from a number of species, including some insects, bacteria and a plant. I decided to use Arabidopsis as the outgroup since it was the only plant sequence I put in.

As well as the sequences I already had for the marsupials I got a number of sequences by using the Homologene database on Entrez which pulls up homologues of a gene from a number of different species and will display multiple alignments of the protein products of those genes, and also will display them all in FASTA format making it very simple to import into Clustal.

I then ran the data through a local version of ClustalX and used Treeview to produce the tree with the bootstrap values on.

One downside to Clustal for bootstrapping, compared to a more dedicated program like those in the Phylip suite, is that it doesn't produce a record of the trees generated by the bootstrap. ClustalX will only add the bootstrap information to the tree while Phylip's seqboot program will actually generate a file containing the 1000 trees generated.

The description of bootstrapping I gave in the previous thread was totally inaccurate. A more accurate description of the process is given by the author of the Phylip suite of programs:-

Joseph Felsenstein writes:

The bootstrap. Bootstrapping was invented by Bradley Efron in 1979, and its use in phylogeny estimation was introduced by me (Felsenstein, 1985b; see also Penny and Hendy, 1985). It involves creating a new data set by sampling N characters randomly with replacement, so that the resulting data set has the same size as the original, but some characters have been left out and others are duplicated. The random variation of the results from analyzing these bootstrapped data sets can be shown statistically to be typical of the variation that you would get from collecting new data sets. The method assumes that the characters evolve independently, an assumption that may not be realistic for many kinds of data.

In other words a dataset of-

ACGTGCCTTGAGTGT
ACCTGGCTTGAAAGT

Might become -

AAACGTCCCTTTAAG
AAACCTGGCTTTAAA

When the first column is sampled 3 times, the second once, the third once, the fourth once, the sixth twice, the seventh once, the eighth once, the ninth twice, the eleventh twice and the twelth once.

311102112021000
_______________
ACGTGCCTTGAGTGT
ACCTGGCTTGAAAGT

TTFN,

edited by AdmiJar to downsize image.

This message has been edited by AdminJar, 06-10-2005 12:17 PM

Replies to this message:
	Message 12 by MangyTiger, posted 06-09-2005 2:35 PM		Wounded King has not replied

Wounded King

Member

Posts: 4149

From: Cincinnati, Ohio, USA

Joined: 04-09-2003

Normal Thread Display

Message 14 of 42 (215813)
06-10-2005 6:13 AM

Reply to: Message 13 by randman
06-09-2005 8:46 PM

What molecular data? This is the absoloute rock bottom of scholarship, no reference, no suggestion what the molecule/s in question were. As far as evidence goes our scanr review is considerably more compelling than this utterly useless reference.

But even were that not the case there are other published studies that support the marsupial mole being genetically closer to the other marsupials.

Studies on the interphotoreceptor retinoid binding protein group Notoryctes with other marsupials, bear in mind that since the marsupial mole is blind this gene is virtually functionless in the marsupial mole. (Springer, et al., 1997).

TTFN,

This message is a reply to:
	Message 13 by randman, posted 06-09-2005 8:46 PM		randman has replied

Replies to this message:
	Message 15 by randman, posted 06-10-2005 1:13 PM		Wounded King has not replied

Wounded King

Member

Posts: 4149

From: Cincinnati, Ohio, USA

Joined: 04-09-2003

Normal Thread Display

Message 22 of 42 (216095)
06-11-2005 5:30 AM

Reply to: Message 17 by randman
06-11-2005 2:26 AM

Re: Turtle, Kangaroo, and Rattlesnake

Please post the whole of the FASTA sequences, or at least provide sufficient information so that other people can get hold of the neccessary sequence data. In fact it might be a good idea if from now on people gave us the accession numbers of the sequences they run.

One major problem here is that Cytochrome C is not 34 amino acids in length.

I tried to run the same alignment using the following sequences

>Grey Kangaroo CytC
GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLNGIFGRKTGQAPGFTYTDANKNKGIIWGEDTLMEYLEN
PKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE
>Snapping Turtle CytC
GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLNGLIGRKTGQAEGFSYTEANKNKGITWGEETLMEYLEN
PKKYIPGTKMIFAGIKKKAERADLIAYLKDATSK
>Rattlesnake Cytochrome C
GDVEKGKKIFSMKCGTCHTVEEGGKHKTGPNLHGLFGRKTGQAVGYSYTAANKNKGIIWGDDTLMEYLEN
PKKYIPGTKMVFTGLKSKKERTDLIAYLKEATAK

As you can see the CytC sequences are 104aa in length. The sequence accessions are P68517 for the rattlesnake, P00022 for the turtle and P00014 for the Kangaroo.

I'm not sure how you ended up with only 34 amino acids for each species, I'm not sure that you are using the FASTA format correctly. It looks as if you are losing all of the sequence which is in the first line of amino acids in my input data and only getting the last 34 amino acids.

That aside the results are in line with yours.

Sequence type explicitly set to Protein
Sequence format is Pearson
Sequence 1: Grey Kangaroo 104 aa
Sequence 2: Snapping turtle 104 aa
Sequence 3: Rattlesnake 104 aa
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 89
Sequences (2:3) Aligned. Score: 79
Sequences (1:3) Aligned. Score: 79

TTFN,

This message is a reply to:
	Message 17 by randman, posted 06-11-2005 2:26 AM		randman has not replied

Wounded King

Member

Posts: 4149

From: Cincinnati, Ohio, USA

Joined: 04-09-2003

Normal Thread Display

Message 23 of 42 (216096)
06-11-2005 5:34 AM

Reply to: Message 19 by randman
06-11-2005 2:48 AM

Re: here's one with CytoB

Here is the problem!

Youre FASTA formatting is all wrong.

The correct format is

>Name
Sequence

i.e.

>Grey Kangaroo Cytochrome C
GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLNGIFGRKTGQAPGFTYTDANKNKGIIWGEDTLMEYLEN
PKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE

You need a line break after the name and only one '>' right at the start of the name for each sequence.At the moment you are losing big chunks of sequence data.

TTFN,

This message is a reply to:
	Message 19 by randman, posted 06-11-2005 2:48 AM		randman has not replied

Wounded King

Member

Posts: 4149

From: Cincinnati, Ohio, USA

Joined: 04-09-2003

Normal Thread Display

Message 24 of 42 (216098)
06-11-2005 5:53 AM

Reply to: Message 21 by Modulous
06-11-2005 5:03 AM

Re: here's one with CytoB

It would be a very good idea to check out the source of the data in a case where there are severe discrepancies in the lengths of the proteins.

A lot of the sequences in genbank are only partial coding sequences. For instance your 85aa sequence for the snapping turtle is only a partial coding sequence. There is a fuller, though still not complete, CDS based vesion here.

This sort of partial data is bound to affect the quality of any analysis you perform. To check the source data for your amino acid sequence click on the hyperlink at the 'DBSOURCE' entry in the GenPept view.

TTFN,

This message is a reply to:
	Message 21 by Modulous, posted 06-11-2005 5:03 AM		Modulous has replied

Replies to this message:
	Message 25 by Modulous, posted 06-11-2005 6:36 AM		Wounded King has not replied

Wounded King

Member

Posts: 4149

From: Cincinnati, Ohio, USA

Joined: 04-09-2003

Normal Thread Display

Message 28 of 42 (216370)
06-12-2005 9:06 AM

XP compatible version of ClustalX

This is for Mark24 since the other thread in which we were doing sequence analysis has been closed.

There is a windows version of ClustalX which should be compatible with XP, here. It comes bundled with the NJ-plot program as well, which you could use instead of Treeview.

TTFN,

This message has been edited by Wounded King, 06-12-2005 09:07 AM

Replies to this message:
	Message 29 by mark24, posted 06-12-2005 9:40 AM	Wounded King has not replied
	Message 32 by mark24, posted 06-12-2005 4:00 PM	Wounded King has replied

Wounded King

Member

Posts: 4149

From: Cincinnati, Ohio, USA

Joined: 04-09-2003

Normal Thread Display

Message 34 of 42 (216518)
06-13-2005 2:33 AM

Reply to: Message 32 by mark24
06-12-2005 4:00 PM

Input format for ClustalX

The easiest form of input is simply to make a plain text file with a set of FASTA data like those we have been using previously. You can get Genbank or GenPept to display the DNA/protein sequences in FASTA format and just c+P them into a txt file. You can even get a set of FASTA files throught the Homologene database.

TTFN,

This message is a reply to:
	Message 32 by mark24, posted 06-12-2005 4:00 PM		mark24 has replied

Replies to this message:
	Message 35 by mark24, posted 06-13-2005 3:27 AM		Wounded King has not replied

Wounded King

Member

Posts: 4149

From: Cincinnati, Ohio, USA

Joined: 04-09-2003

Normal Thread Display

Message 37 of 42 (216526)
06-13-2005 4:58 AM

If people are really interested in bioinformatics then I would definitely recommend downloading the Phylip suite of programs. It is fairly fiddly and technical to use all of the programs to do exactly what you want but the analyses produced are considerably more powerful and sophisticated than the sort of things we have been doing with Clustal in terms of phylogenetics.

TTFN,

Replies to this message:
	Message 40 by derwood, posted 07-10-2005 4:00 PM		Wounded King has not replied

Wounded King

Member

Posts: 4149

From: Cincinnati, Ohio, USA

Joined: 04-09-2003

Normal Thread Display

Message 38 of 42 (220954)
06-30-2005 12:49 PM

Another nice new program has just been released. V2 of Jalview is out now and while it is not the best program for doing your actual alignments it allows you to visualise pre-existing alignments quite nicely and does some quite cool things with the trees, although it won't let me specify an outgroup for some reason. It also lets you run sequences through ClustalW and will display your data as a principal component analysis plot.

Some quite nice features to play around with.

TTFN,

Wounded King

Member

Posts: 4149

From: Cincinnati, Ohio, USA

Joined: 04-09-2003

Normal Thread Display

Message 42 of 42 (232274)
08-11-2005 11:54 AM

That is not dead which can eternal lie....

*Bump*

Just bringin this up to the surface since we have a number of new faces on the board at the moment, and also 'cos modulous was mentioning notbeing able to find it with search at the moment.

TTFN,

Date format: mm-dd-yyyy

Timezone: ET (US)

Newer Topic | Older Topic

Do Nothing Button

7 online now:	AZPaul3, dwise1, Phat, popoi, Tangle, Taq, Theodoric
Newest Member:	popoi
Post Volume:	Total: 916,352 Year: 3,609/9,624 Month: 480/974 Week: 93/276 Day: 21/23 Hour: 1/6