Distribution Analysis by Non-linear Fitting of Integrated Probabilities
What's it for?
Analysis of the form of the parent probability distribution of a data set.
What's its advantage?
DANFIP analysis gives a better picture of the form of the parent distribution than does "histogramming"It extracts the components of multi-component distributions
The reduced Chi-square of the fit can be used to choose between models.
It allows good estimates of population parameters (such as the mean and standard deviation) for truncated distributions
How do you do it?
Note: This plot (orange symbols) is called the empirical cummulative distribution function and the Y axis scale represents the number of datum points. This is referred to as the eCDF for short.1) Take a random collection of datum values2) Sort them in increasing order
3) "Swap" the X and Y axes, so that X = datum value, Y = the rank position in the increasing order of values.
If the eCDF is divided by the number of datum points, the Y axis runs from 0 to 1 and it is the shape and scale of the integrated probability function of the parent distribution.
Alternatively, if an integrated probability distribution is multiplied by the number of datum values in a sample, it will be the shape and scale of the eCDF (green symbols).
For Gaussian (normal) probabilities, this presents a problem since there is no integrated form of the probability distribution function. However, Hasting's approximation formula can approximate the shape of the integrated probability distribution function for any given values of the mean and standard deviation:







When the meter pegs!
Sometimes, measured values are truncated at some lower or higher limit. This occurs often with analog instruments that have fixed readout scales. Sometimes the baseline drifts off scale on the low side or measurements unexpectedly exceed the maximum. Thus, a data set has a number of values recorded either as off scale or assigned to the lowest or highest limiting value.
Truncated data, even if the parent population is gaussian, gives calculated means and standard deviations that are biased.
In this case the sample from the first figure is truncated by assigning any value >270 to have a value of 270. The simulation is from a population with a mean of 250 and a S.D. of 20. The original sample of 200 values gives a mean of 250.4 and S.D. of 19.6. This trucated sample with all data included gives 248.5 and 16.3. If the truncated values are ignored, the values are worse (245.0 and 14.8).
For DANFIP analysis, the values of the data that have been assigned the limiting value are not processed, but the eCDF of the other data is fit WITH THE ORIGINAL SAMPLE SIZE to preserve some of the information. This gives a very good estimate of the parent population parameters (249.7 and 19.4).
John E. Wampler
Department of Biochemistry & Molecular Biology
Life Sciences Building
University of Georgia
Athens, GA 30602-7229
Phone (706) 542-1573
FAX (706) 542-1738
e-mail: wampler@bmb.uga.edu