|
Register | Sign In |
|
QuickSearch
EvC Forum active members: 66 (9164 total) |
| |
ChatGPT | |
Total: 916,483 Year: 3,740/9,624 Month: 611/974 Week: 224/276 Day: 64/34 Hour: 1/2 |
Thread ▼ Details |
Member (Idle past 1489 days) Posts: 19762 From: Silver Spring, MD Joined: |
|
Thread Info
|
|
|
Author | Topic: Bioinformaticians, harken | |||||||||||||||||||||||||||
crashfrog Member (Idle past 1489 days) Posts: 19762 From: Silver Spring, MD Joined: |
So I'm doing some independent study for this lab at UNL, it's the lab where my wife got her PhD, and they're kind of a low-budget agricultural entomology operation so the genetics work they do is all these techniques from like the 90's - RFLP, AFLP, RAPDs, and the like. The software tools they have are also pretty low-rent and I'm trying to cook up something to fill a hole in the tool-chain - they don't have any way to interconvert their data between the various flatfile formats that their phylogenetics software uses besides editing and formatting them by hand.
So, I'm working on a Java app to do that for them. But the other major thing is that I'm also trying to replace this antiquated bootstrap coefficient of variation program they use called "DBOOT", this guy Coelho wrote it in 2001, and since he was Brazilian or something, all its documentation and output is in Portugese. Plus, it runs like ass and you have to babysit it - it does all these computations on the event dispatch thread, so if Windows tries to take window focus or activate the screen saver, it'll crash. I'm working on implementing the same algorithm and I'm close, but I'm not quite there yet. I'm hoping somebody with some mainstream bioinformatics knowledge can help me out - nobody in the lab is tech-savvy or really even statistics-savvy, they're using these methods because that's how they get papers published. They know enough to interpret the output but they don't really know the calculations themselves. I'm not one to talk; I have zero stats background. I found this paper that seems pretty on-point:
http://www.springerlink.com/content/h343922283473447/ and it was a big help. It's AFLP data so basically what's being produced is populations of individuals and a boolean array of restriction loci - 1 for "band present" and 0 for "band absent." Something like:
Samp1 1100101010101001011111110101111010... The analysis, as far as I understand it, is a significance test - how many loci do they need to be at or under 5% coefficient of variation. There's a bootstrapping component as well, so what's actually being compared are subsamples, where a fake population is assembled from a number of random Nth loci across the population. So I'm figuring out the pairwise genetic distance (i.e. between individuals i and j) using the measure in the paper I found, which is
Where is the number of discordant loci and is the number of loci that are the same. That creates a pairwise table of genetic distances; the average and standard deviation of those are taken, and the ratio of std dev to average as a percent is the "percent coefficient of variation." That's done for a number of bootstrap replications - say, 1000 - and then the resulting coefficients of variation are averaged to produce a single mean coefficient of variation for some X number of loci. If you graph it it looks like this: I'm running tests of my code vs the Coelho 2001 program and I'm not getting the same results, and the idea of CoV, mean, and std dev are pretty straightforward so I think I'm not using the same measure of genetic distance. I know there's quite a few measurements that are used in the literature but I don't know enough to sort through them; plus, most of them are GD measures between populations, not between individuals within populations. My fear is that this lab is so far behind the biological mainstream that I'm not even talking about this stuff the right way. If there's anybody reading this who's better versed in bioinformatics than I am, which would not be hard, I'd appreciate a pointer. I realize this is probably pretty trivial stuff for a lot of people here, but I'd really like to leave these people with a useful tool. They did right by my wife and they're our friends and we owe them. They were seriously spending hours at a time, laboriously reformatting text files, adding or taking out tabs, and so on. Can you imagine having to put tabs in between the loci, when you have a population of 100 samples each with 300 loci? When I showed my wife how to block select in TextPad, she cried. They're seriously in the dark ages over here. Help me help them!
|
|||||||||||||||||||||||||||
Wounded King Member Posts: 4149 From: Cincinnati, Ohio, USA Joined: |
Does DBOOT do the actual similarity/distance calculations or does it just produced bootstrapped sets of data? Looking at a couple of papers that use DBOOT (Lima et al., 2002; Bruel et al., 2006 (PDF)), they both seem to use the Jaccard-similarity coefficient (JSC), although they each calculate it slightly differently, but it isn't clear if that is the metric that DBOOT itself uses.
So for the JSC for these two samples: Sample i:111100001010010000111100001Sample j:100111001000100100010010000 JSCij = a/(a+b+c) = 4/(4+8+5) =0.235 Where a is the number of polymorphic bands present in both individuals, b is the number of bands present in i and absent in j, and c is the number of bands present in j and absent in i. Bands absent in both individuals are ignored. The distance would be the complement of the JSC: 1 - 0.235 = 0.765 The difference between this method and the simple matching based one you use is that it does not count matching 0 values towards similarity. Using your method I believe the distance measure would be ~0.481, considerably more similar. Perhaps using the JSC would give you results more in line with the DBOOT program. TTFN, WK Edited by Wounded King, : No reason given.
|
|||||||||||||||||||||||||||
crashfrog Member (Idle past 1489 days) Posts: 19762 From: Silver Spring, MD Joined: |
Does DBOOT do the actual similarity/distance calculations or does it just produced bootstrapped sets of data? My understanding is that it actually does the distance calculations, takes the CoV of all those pairwise distances, averages that out for 1000 bootstraps, and produces a table of sampled loci vs percent mean coefficient of variation. Later today I can post some of its output if that would help.
Looking at a couple of papers that use DBOOT (Lima et al., 2002; Bruel et al., 2006 (PDF)), they both seem to use the Jaccard-similarity coefficient (JSC), although they each calculate it slightly differently, but it isn't clear if that is the metric that DBOOT itself uses. First let me say that you've already been tremendously helpful. As best as I can tell, DBOOT supports three measures of genetic distance: Jaccard, "Dice", and "Simple coincidence" (this is the best translation I'm able to provide out of Portugese). In the lab we usually set it for simple coincidence so that's the measure of distance I'm hoping to use. I'll try it with Jaccard this afternoon and see if my results are more congruent (I can easily implement Jaccard based on what you've told me.) Let me ask you about some of my assumptions. When I produce these pairwise gd's, I'm only producing one per pair of samples - my distance matrix is the bottom triangle, in other words - and I'm not producing anything for the genetic distance of a sample compared to itself. Does that seem right, or should I factor in those missing zeroes? Should I be populating the entire (for N samples) NxN matrix?
|
|||||||||||||||||||||||||||
crashfrog Member (Idle past 1489 days) Posts: 19762 From: Silver Spring, MD Joined: |
Well, using Jaccard distance didn't have any effect, and adding the genetic identities into the distance matrix just made the output worse.
Here's the result of a test with some real AFLP data, out of DBOOT:
Coeficientes de similaridade do tipo Coincidencia Simples no. bootstrp no. locos media variancia CV(%) q(.25) q(.75) 100 1 0.768242 0.176263 55.136050 48.676203 61.122746 100 2 0.726154 0.104887 44.963057 40.198032 49.481374 100 3 0.757674 0.061953 33.002239 29.778414 35.919554 100 4 0.738755 0.046589 29.374278 26.410113 32.236603 100 5 0.732846 0.038112 26.785541 24.915408 29.052878 ....... 100 104 0.735349 0.001890 5.934687 5.336402 6.549305 100 105 0.730809 0.001839 5.892220 5.338761 6.395837 100 106 0.731902 0.001736 5.712152 5.161226 6.238547 100 107 0.735075 0.001828 5.839822 5.152048 6.566232 100 108 0.735866 0.001772 5.746364 5.175141 6.376444And here's the output from my program, using simple coincidence as the measure of genetic distance: 1 1.2950240644057558 2 1.1301652942940503 3 0.9738933281933625 4 0.7637411673990612 5 0.7044007586822684 6 0.6547023347988583 7 0.6109443407519263 8 0.5645575156637022 .... 102 0.22954182190558442 103 0.23119214072191788 104 0.22949154294159135 105 0.23229729410772165 106 0.23080978926726237 107 0.23033780210424287 108 0.22856838102290847 Obviously one difference is that DBOOT reports in percent coefficient of variation and my program doesn't, as yet. But even accounting for that, DBOOT values are clearly approaching a CV of about 5.5%, and mine are getting to 22% and stopping. I'll post my code in the next message; I'd appreciate any questions or comments anyone might have.
|
|||||||||||||||||||||||||||
crashfrog Member (Idle past 1489 days) Posts: 19762 From: Silver Spring, MD Joined: |
package bootsie; import java.util.ArrayList; import java.util.Iterator; /** * * @author Crashfrog */ public abstract class MathCore { //catch-all class of static methods for these statistical tests. public static double doubleCoV(ArrayList numbers){ //determine coefficient of variation for an array of double-precision //floating-point values. double cov = 0.0; double stdDev = 0.0; double mean = MathCore.doubleMean(numbers); for (Double value: numbers){ //for each locus in loci stdDev += Math.pow((value - mean), 2); //(value - mean) squared } stdDev = Math.sqrt(stdDev / (double) numbers.size()); if (mean == 0.0) { cov = Double.NaN; //return a non-number (NotANumber) } else { cov = stdDev / mean; } return cov; } public static double doubleMean(ArrayList numbers){ //determine arithmetic mean for an array of double-precision floating- //point values. double mean = 0.0; for (Double value: numbers){ //for each value in numbers mean += value; } mean = mean / (double) numbers.size(); return mean; } public static Double getCoVOfOneBootstrap(PopulationMatrixModel data, int bootstrapVal){ //a method to produce a random-with-replacement sample of N loci from the population ArrayList picks = new ArrayList<>(); int lociSize = data.getLength(); for (int i = 0; i < bootstrapVal; i++){ picks.add(new Integer((int) (Math.random() * lociSize))); } PopulationMatrix bootstrap = data.getBootstrap(picks); bootstrap.populateGeneticDistanceMatrix(); return MathCore.doubleCoV(bootstrap.getGeneticDistances()); } public static void bootstrapCoefficientOfVariance(PopulationMatrixModel data, BootstrapMonitor monitor){ monitor.startingOp(); ArrayList The primary method is MathCore.bootstrapCoefficientOfVariance, which takes as arguments a population on which to do the analysis and a monitor object that keeps track of the progress of the computation (so the UI can show a progress bar.) MathCore.bootstrapCoVTest is a method to run a short "test" computation for purposes of estimating how long the whole enchilada will take. Edited by crashfrog, : No reason given.
|
|||||||||||||||||||||||||||
crashfrog Member (Idle past 1489 days) Posts: 19762 From: Silver Spring, MD Joined:
|
Thanks in part to WK's help, I've completed what I consider the first release of "Bootsie", a Java program for estimating the coefficient of variation of AFLP data via bootstrapping. (Note - the code presented in the above messages is now well out of date.) I mention it here in case anyone is interested or would find it useful for their own work. Bootsie is released under the terms of the "copyfree" Apache License 2.0 and is available at
http://code.google.com/p/bootsie If you or your lab does population genetics via AFLP, and uses Popgene, Arlequin, or Ntsys, and would like to replace ASG Coelho's "DBOOT" with something that doesn't crash Windows if you look at it funny, then I invite you to give this a try. I'm happy to receive whatever bug reports, suggestions, or comments you may have either in this thread or as an email or PM.
|
|
|
Do Nothing Button
Copyright 2001-2023 by EvC Forum, All Rights Reserved
Version 4.2
Innovative software from Qwixotic © 2024