Bayesian Analysis to Describe Genomic Evolution by Rearrangement

Version 1.02 beta, June 11, 2004

Copyright © 2004 by Bret Larget & Don Simon


There is no guarantee that the algorithms here will correctly sample from the desired posterior distributions in a finite run. While theory says that the long-run frequencies will converge to the desired posterior probabilities, inferences from insufficiently long runs can be biased if the chain has not moved sufficiently far from the initial state and converged to near stationarity. Obtaining consistent results from several different starting points with different random seeds is a minimum criterion to have confidence that the numerical results provided by BADGER are close to their analytically determined values.

We generally follow something like this procedure in analyzing new data sets.

  1. Complete one short run using all the tree proposal algorithms that are appropriate for the number of taxa in the data, running long enough to reach approximate stationarity.
  2. Graph the .lpd file. (The free software Gnuplot or packages such as R or MATLAB® are good choices.)
  3. Look at the .out file.
  4. Run several (at least five) runs from random starts for several thousand cycles. The genrc program is useful for generating run control files for the different runs. These runs will help determine how much clock time per cycle the runs take.
  5. Plot all .lpd files to make sure the plateaus are in about the same place. Determine a common number of cycles to discard for all runs so that each run is well into its plateau.
  6. Compare the minimum total number of inversions found for each run. These are stored as the last line of the .min files. Ideally, they should be the same for all runs.
  7. Run summarize on the different .top files (discarding many initial sample points) to check if the same tree topologies and clades are being selected at roughly the same posterior probabilities.
  8. Examine the .sum files after running summarize to see if named clades are reasonably defined. Use chart to create a comparison chart of the common clades across the .top files. Resummarize if necessary. These short runs are more for timing purposes than for gathering results. If the results are inconsistent, between the different runs (as indicated by differences in the frequencies of common clades in the chart), which is likely, longer runs may be necessary.
  9. Look at transition tables between subtree topologies to see if mixing is similar in different runs and adequate.
  10. If at this point, if you find that the runs are converging to very different places, you may wish to start at a non-random tree found by another method. It may be that the algorithms in BADGER are inadequate for your data.
  11. If all seems in order, it is time to do several (at least five) longer runs to save for inference. These runs may be on the order of millions to hundreds of millions of cycles depending on the size of the data, the time to do the runs, and computational resource limits. The sample rate should be as small as possible, based on storage limitations. The ultimate size of the files you produce should be the limiting factor. The files may be compressed later or discarded after the essential summarization has occurred. The runs should be long enough to achieve the accuracy you desire. You can judge time requirements from the previous exploratory runs.

Back to the table of contents.

This page was most recently updated on June 29, 2004.