Register | Sign In


Understanding through Discussion


EvC Forum active members: 65 (9162 total)
3 online now:
Newest Member: popoi
Post Volume: Total: 915,817 Year: 3,074/9,624 Month: 919/1,588 Week: 102/223 Day: 0/13 Hour: 0/0


Thread  Details

Email This Thread
Newer Topic | Older Topic
  
Author Topic:   Bioinformaticians, harken
crashfrog
Member (Idle past 1467 days)
Posts: 19762
From: Silver Spring, MD
Joined: 03-20-2003


Message 1 of 6 (636899)
10-11-2011 11:28 PM


So I'm doing some independent study for this lab at UNL, it's the lab where my wife got her PhD, and they're kind of a low-budget agricultural entomology operation so the genetics work they do is all these techniques from like the 90's - RFLP, AFLP, RAPDs, and the like. The software tools they have are also pretty low-rent and I'm trying to cook up something to fill a hole in the tool-chain - they don't have any way to interconvert their data between the various flatfile formats that their phylogenetics software uses besides editing and formatting them by hand.
So, I'm working on a Java app to do that for them. But the other major thing is that I'm also trying to replace this antiquated bootstrap coefficient of variation program they use called "DBOOT", this guy Coelho wrote it in 2001, and since he was Brazilian or something, all its documentation and output is in Portugese. Plus, it runs like ass and you have to babysit it - it does all these computations on the event dispatch thread, so if Windows tries to take window focus or activate the screen saver, it'll crash.
I'm working on implementing the same algorithm and I'm close, but I'm not quite there yet. I'm hoping somebody with some mainstream bioinformatics knowledge can help me out - nobody in the lab is tech-savvy or really even statistics-savvy, they're using these methods because that's how they get papers published. They know enough to interpret the output but they don't really know the calculations themselves. I'm not one to talk; I have zero stats background.
I found this paper that seems pretty on-point:
http://www.springerlink.com/content/h343922283473447/
and it was a big help.
It's AFLP data so basically what's being produced is populations of individuals and a boolean array of restriction loci - 1 for "band present" and 0 for "band absent." Something like:
Samp1    1100101010101001011111110101111010...
The analysis, as far as I understand it, is a significance test - how many loci do they need to be at or under 5% coefficient of variation. There's a bootstrapping component as well, so what's actually being compared are subsamples, where a fake population is assembled from a number of random Nth loci across the population.
So I'm figuring out the pairwise genetic distance (i.e. between individuals i and j) using the measure in the paper I found, which is
Where is the number of discordant loci and is the number of loci that are the same. That creates a pairwise table of genetic distances; the average and standard deviation of those are taken, and the ratio of std dev to average as a percent is the "percent coefficient of variation." That's done for a number of bootstrap replications - say, 1000 - and then the resulting coefficients of variation are averaged to produce a single mean coefficient of variation for some X number of loci. If you graph it it looks like this:
I'm running tests of my code vs the Coelho 2001 program and I'm not getting the same results, and the idea of CoV, mean, and std dev are pretty straightforward so I think I'm not using the same measure of genetic distance. I know there's quite a few measurements that are used in the literature but I don't know enough to sort through them; plus, most of them are GD measures between populations, not between individuals within populations.
My fear is that this lab is so far behind the biological mainstream that I'm not even talking about this stuff the right way. If there's anybody reading this who's better versed in bioinformatics than I am, which would not be hard, I'd appreciate a pointer. I realize this is probably pretty trivial stuff for a lot of people here, but I'd really like to leave these people with a useful tool. They did right by my wife and they're our friends and we owe them. They were seriously spending hours at a time, laboriously reformatting text files, adding or taking out tabs, and so on. Can you imagine having to put tabs in between the loci, when you have a population of 100 samples each with 300 loci? When I showed my wife how to block select in TextPad, she cried. They're seriously in the dark ages over here. Help me help them!

Replies to this message:
 Message 2 by Wounded King, posted 10-12-2011 6:43 AM crashfrog has replied

  
Wounded King
Member
Posts: 4149
From: Cincinnati, Ohio, USA
Joined: 04-09-2003


Message 2 of 6 (636910)
10-12-2011 6:43 AM
Reply to: Message 1 by crashfrog
10-11-2011 11:28 PM


Bootstrap's Bootstraps
Does DBOOT do the actual similarity/distance calculations or does it just produced bootstrapped sets of data? Looking at a couple of papers that use DBOOT (Lima et al., 2002; Bruel et al., 2006 (PDF)), they both seem to use the Jaccard-similarity coefficient (JSC), although they each calculate it slightly differently, but it isn't clear if that is the metric that DBOOT itself uses.
So for the JSC for these two samples:
Sample i:111100001010010000111100001
Sample j:100111001000100100010010000
JSCij = a/(a+b+c) = 4/(4+8+5) =0.235
Where a is the number of polymorphic bands present in both individuals, b is the number of bands present in i and absent in j, and c is the number of bands present in j and absent in i. Bands absent in both individuals are ignored.
The distance would be the complement of the JSC: 1 - 0.235 = 0.765
The difference between this method and the simple matching based one you use is that it does not count matching 0 values towards similarity. Using your method I believe the distance measure would be ~0.481, considerably more similar.
Perhaps using the JSC would give you results more in line with the DBOOT program.
TTFN,
WK
Edited by Wounded King, : No reason given.

This message is a reply to:
 Message 1 by crashfrog, posted 10-11-2011 11:28 PM crashfrog has replied

Replies to this message:
 Message 3 by crashfrog, posted 10-12-2011 8:51 AM Wounded King has not replied

  
crashfrog
Member (Idle past 1467 days)
Posts: 19762
From: Silver Spring, MD
Joined: 03-20-2003


Message 3 of 6 (636927)
10-12-2011 8:51 AM
Reply to: Message 2 by Wounded King
10-12-2011 6:43 AM


Re: Bootstrap's Bootstraps
Does DBOOT do the actual similarity/distance calculations or does it just produced bootstrapped sets of data?
My understanding is that it actually does the distance calculations, takes the CoV of all those pairwise distances, averages that out for 1000 bootstraps, and produces a table of sampled loci vs percent mean coefficient of variation. Later today I can post some of its output if that would help.
Looking at a couple of papers that use DBOOT (Lima et al., 2002; Bruel et al., 2006 (PDF)), they both seem to use the Jaccard-similarity coefficient (JSC), although they each calculate it slightly differently, but it isn't clear if that is the metric that DBOOT itself uses.
First let me say that you've already been tremendously helpful. As best as I can tell, DBOOT supports three measures of genetic distance: Jaccard, "Dice", and "Simple coincidence" (this is the best translation I'm able to provide out of Portugese). In the lab we usually set it for simple coincidence so that's the measure of distance I'm hoping to use. I'll try it with Jaccard this afternoon and see if my results are more congruent (I can easily implement Jaccard based on what you've told me.)
Let me ask you about some of my assumptions. When I produce these pairwise gd's, I'm only producing one per pair of samples - my distance matrix is the bottom triangle, in other words - and I'm not producing anything for the genetic distance of a sample compared to itself. Does that seem right, or should I factor in those missing zeroes? Should I be populating the entire (for N samples) NxN matrix?

This message is a reply to:
 Message 2 by Wounded King, posted 10-12-2011 6:43 AM Wounded King has not replied

Replies to this message:
 Message 4 by crashfrog, posted 10-12-2011 4:04 PM crashfrog has replied

  
crashfrog
Member (Idle past 1467 days)
Posts: 19762
From: Silver Spring, MD
Joined: 03-20-2003


Message 4 of 6 (636968)
10-12-2011 4:04 PM
Reply to: Message 3 by crashfrog
10-12-2011 8:51 AM


Output, compared
Well, using Jaccard distance didn't have any effect, and adding the genetic identities into the distance matrix just made the output worse.
Here's the result of a test with some real AFLP data, out of DBOOT:
Coeficientes de similaridade do tipo Coincidencia Simples
no. bootstrp   no. locos       media   variancia       CV(%)      q(.25)      q(.75)
         100           1    0.768242    0.176263   55.136050   48.676203   61.122746
         100           2    0.726154    0.104887   44.963057   40.198032   49.481374
         100           3    0.757674    0.061953   33.002239   29.778414   35.919554
         100           4    0.738755    0.046589   29.374278   26.410113   32.236603
         100           5    0.732846    0.038112   26.785541   24.915408   29.052878
.......

         100         104    0.735349    0.001890    5.934687    5.336402    6.549305
         100         105    0.730809    0.001839    5.892220    5.338761    6.395837
         100         106    0.731902    0.001736    5.712152    5.161226    6.238547
         100         107    0.735075    0.001828    5.839822    5.152048    6.566232
         100         108    0.735866    0.001772    5.746364    5.175141    6.376444
And here's the output from my program, using simple coincidence as the measure of genetic distance:
1	1.2950240644057558
2	1.1301652942940503
3	0.9738933281933625
4	0.7637411673990612
5	0.7044007586822684
6	0.6547023347988583
7	0.6109443407519263
8	0.5645575156637022
....
102	0.22954182190558442
103	0.23119214072191788
104	0.22949154294159135
105	0.23229729410772165
106	0.23080978926726237
107	0.23033780210424287
108	0.22856838102290847
Obviously one difference is that DBOOT reports in percent coefficient of variation and my program doesn't, as yet. But even accounting for that, DBOOT values are clearly approaching a CV of about 5.5%, and mine are getting to 22% and stopping. I'll post my code in the next message; I'd appreciate any questions or comments anyone might have.

This message is a reply to:
 Message 3 by crashfrog, posted 10-12-2011 8:51 AM crashfrog has replied

Replies to this message:
 Message 5 by crashfrog, posted 10-12-2011 4:10 PM crashfrog has not replied

  
crashfrog
Member (Idle past 1467 days)
Posts: 19762
From: Silver Spring, MD
Joined: 03-20-2003


Message 5 of 6 (636969)
10-12-2011 4:10 PM
Reply to: Message 4 by crashfrog
10-12-2011 4:04 PM


The primary math methods
package bootsie;

import java.util.ArrayList;
import java.util.Iterator;

/**
 *
 * @author Crashfrog
 */
public abstract class MathCore {
    //catch-all class of static methods for these statistical tests.

    public static double doubleCoV(ArrayList numbers){
        //determine coefficient of variation for an array of double-precision
        //floating-point values.
        double cov = 0.0;
        double stdDev = 0.0;
        double mean = MathCore.doubleMean(numbers);
        for (Double value: numbers){ //for each locus in loci
            stdDev += Math.pow((value - mean), 2); //(value - mean) squared
        }
        stdDev = Math.sqrt(stdDev / (double) numbers.size());

        if (mean == 0.0) {
            cov = Double.NaN; //return a non-number (NotANumber)
        } else {
            cov = stdDev / mean;
        }
        return cov;

    }

    public static double doubleMean(ArrayList numbers){
        //determine arithmetic mean for an array of double-precision floating-
        //point values.
        double mean = 0.0;
        for (Double value: numbers){ //for each value in numbers
            mean += value;
        }
        mean = mean / (double) numbers.size();
        return mean;
    }

    public static Double getCoVOfOneBootstrap(PopulationMatrixModel data, int bootstrapVal){
        //a method to produce a random-with-replacement sample of N loci from the population
        ArrayList picks = new ArrayList<>();
        int lociSize = data.getLength();
        for (int i = 0; i < bootstrapVal; i++){
            picks.add(new Integer((int) (Math.random() * lociSize)));
        }
        PopulationMatrix bootstrap = data.getBootstrap(picks);
        bootstrap.populateGeneticDistanceMatrix();
        return MathCore.doubleCoV(bootstrap.getGeneticDistances());

    }

    public static void bootstrapCoefficientOfVariance(PopulationMatrixModel data, BootstrapMonitor monitor){
        monitor.startingOp();
        ArrayList covResultsArray = new ArrayList<>();
        int lociSize = data.getLength();
        for (int i = 1; i <= lociSize; i++){
            ArrayList coefficients = new ArrayList<>();
            int n;
            for (n = 1; n <= data.numBootstraps; n++){
                Double cov = MathCore.getCoVOfOneBootstrap(data, i);
                if ((cov.equals(Double.NaN)) == false){ //div by zero or something
                    coefficients.add(cov);
                    monitor.completeOneOp();
                } else {
                    //if cov was not a number, don't count this bootstrap
                    n--;
                }
            }
            double meanCoV = MathCore.doubleMean(coefficients);
            covResultsArray.add(meanCoV);

        }

        data.coefficientsOfVariation = covResultsArray;
        monitor.completeAllOps();
    }

    public static ArrayList bootstrapCovTest(PopulationMatrixModel data, int numTests) {
        //estimator function. Returns actual CoV values to evade compiler optimization.
        ArrayList covArray = new ArrayList<>();
        double covLocus = 0;
        int n;
        for (n = 0; n <= numTests; n++) {
            double cov = MathCore.getCoVOfOneBootstrap(data, data.getLength() / 2);
            covArray.add(cov);
        }
        return covArray;
    }

    public static double simpleGeneticDistance(DataSample a, DataSample b){
        //simple coincidence genetic distance; GD_ij = sum(i = j) / sum(i = j) + sum (i != j)
        double geneticDistance = 0.0;
        Iterator ia = a.iterator();
        Iterator ib = b.iterator();
        double match = 0.0;
        double mismatch = 0.0;
        while (ia.hasNext() && ib.hasNext()){
            Byte i = ia.next();
            Byte j = ib.next();
            //System.out.println(i + " and " + j);
            if (i == -1 || j == -1){
                //do nothing; ignore loci where data cannot be compared
            } else {
                if (i.equals(j)){
                    match++;
                } else {
                    mismatch++;
                }
            }

        }
        geneticDistance = mismatch / (mismatch + match);
        return geneticDistance;
    }

    public static double jaccardGeneticDistance(DataSample a, DataSample b){
        //compliment of jaccard's similarity
        double geneticDistance = 0.0;
        Iterator ia = a.iterator();
        Iterator ib = b.iterator();
        double match = 0.0;
        double mismatch = 0.0;
        while (ia.hasNext() && ib.hasNext()){
            Byte i = ia.next();
            Byte j = ib.next();
            //System.out.println(i + " and " + j);
            if (i == -1 || j == -1){
                //do nothing; ignore loci where data cannot be compared
            } else {
                if (i == 1 && j == 1){
                    match++;
                } else {
                    mismatch++;
                }
            }

        }
        geneticDistance = mismatch / (mismatch + match);
        return geneticDistance;
    }
}
The primary method is MathCore.bootstrapCoefficientOfVariance, which takes as arguments a population on which to do the analysis and a monitor object that keeps track of the progress of the computation (so the UI can show a progress bar.) MathCore.bootstrapCoVTest is a method to run a short "test" computation for purposes of estimating how long the whole enchilada will take.
Edited by crashfrog, : No reason given.

This message is a reply to:
 Message 4 by crashfrog, posted 10-12-2011 4:04 PM crashfrog has not replied

  
crashfrog
Member (Idle past 1467 days)
Posts: 19762
From: Silver Spring, MD
Joined: 03-20-2003


(1)
Message 6 of 6 (641023)
11-15-2011 12:00 PM


Bootsie
Thanks in part to WK's help, I've completed what I consider the first release of "Bootsie", a Java program for estimating the coefficient of variation of AFLP data via bootstrapping. (Note - the code presented in the above messages is now well out of date.) I mention it here in case anyone is interested or would find it useful for their own work. Bootsie is released under the terms of the "copyfree" Apache License 2.0 and is available at
http://code.google.com/p/bootsie
If you or your lab does population genetics via AFLP, and uses Popgene, Arlequin, or Ntsys, and would like to replace ASG Coelho's "DBOOT" with something that doesn't crash Windows if you look at it funny, then I invite you to give this a try. I'm happy to receive whatever bug reports, suggestions, or comments you may have either in this thread or as an email or PM.

  
Newer Topic | Older Topic
Jump to:


Copyright 2001-2023 by EvC Forum, All Rights Reserved

™ Version 4.2
Innovative software from Qwixotic © 2024