## About this lesson

Data sets are often displayed in distributions. Different distributions are indicative of different physical phenomena. The ability to recognize a distribution will aid in the identification of process performance issues.

## Exercise files

Download this lesson’s related exercise files.

Classes of Distribution.xlsx10.3 KB Classes of Distribution - Solution.docx

304.8 KB

## Quick reference

### Classes of Distribution

Data sets are often displayed in distributions. Different distributions are indicative of different physical phenomena. The ability to recognize a distribution will aid in the identification of process performance issues.

### When to use

Visualization of datasets is often easier to use when explaining characteristics of the data than with tables of numbers. In addition, the classes of distribution have specific characteristics which will dictate what type of hypothesis test is appropriate with that data.

### Instructions

#### Discrete Data Distributions

Discrete data has a limited number of values and there is no meaningful data between the data categories. Therefore, these distributions are always shown as histograms with counts of data elements in each category or bucket that is represented by a vertical bar.

*Binomial Distribution *

This is a distribution for data that can only take on one of two states such as pass/fail. This relies on using a fixed batch size. The horizontal axis is the count per batch. The vertical axis is normally the percentage of occurrences for that count. An example is the number of defects in a batch.

*Poisson Distribution*

This distribution is a count of occurrences of an event. The batch is based upon a time dependency, such as a day. The horizontal axis is the count of the instances that occurred in that time period. An example is the number of phone calls in a day.

*Geometric Distribution*

This distribution is again used with data items that only have two states. In this case, instead of counting the number in a batch, it is based upon when the state changes. The horizontal axis is a time line. The vertical axis shows the probability of the event changing state during that time period. An example is plotting the day of the month when rainfall or snowfall first occurs in that month.

#### Continuous Distribution

Continuous data is that which can take on an infinite number of values. Between any two data values, there is another data value that could be detected if the measurement system was able to accurately discriminate that level of fraction or decimal. The plots are characterized by a smooth curve, not histogram bars. In all these plots, the horizontal axis is the independent variable and the vertical axis is the process performance dependent variable.

*Normal Distribution*

This is the bell-shaped curve that represents common cause or random variation. It is symmetric, peaked in the center and the tails approach zero. This is normally our desired distribution for analysis because we know that it represents random variation around the process performance.

*Uniform Distribution*

This is a horizontal line or essentially equal vertical value for all horizontal axis values. This represents the case where the process performance does not depend upon the independent variable.

*Bi-modal Distribution*

This is normally an asymmetric curve. There are two (or more) peaks. This represents the case where there are multiple processes embedded in the data. These need to be separated and each process analysed individually.

*Exponential Distribution*

This is an asymmetric curve. One end starts a point on the vertical axis and the other end of the curve approaches – but never reaches – zero value. A typical physical phenomena that follows this pattern is failure rates of a product or system that is subject to infant mortality.

*Log-normal Distribution*

This is also an asymmetric curve. Both ends of the curve are at zero. However, one end quickly shoots up and then it slowly decays back to zero. This is also a commonly occurring pattern in the real world. For instance, machine down time follows this pattern, it takes a finite amount of time to do a repair which is the major spike, and some repairs then take longer.

*Weibull Distribution*

The Weibull curve is actually a family of curves that can take on many shapes including an exponential, log-normal, or even normal. The actual shape varies based upon factors or constants in the Weibull equation. This equation has proven very effective at modelling reliability in complex systems. The factors are based upon the system design parameters.

*PDF and CDF*

The final topic in distributions is the PDF and CDF displays. PDF stands for Probability Density Function. This is the type of display used in all the curves shown earlier in this reference guide. The height of the vertical axis is showing the probability that a data point will occur at that value of the horizontal axis. The higher the point, the more density at that point of the distribution.

CDF stands for Cumulative Distribution Function and shows the probability that a point in the distribution will have occurred by that level of the horizontal axis. In a CDF, the curve always starts at zero on the left end – a probability for that low end value – and ends at one on the right end – representing that all data points have occurred – and the probability is then 100%.

If the distribution is a uniform distribution, the PDF is a flat line (as shown above) and the CDF is a straight diagonal line going from zero to one. If the underlying distribution is a normal curve when shown as a PDF, it is an S curve when shown as a CDF. The slope starts very shallow when small changes are occurring on the left tail of the normal curve. The slope becomes steep in the center of the curve when the normal curve is peaking, and then the slope becomes shallow again as the horizontal axis approaches the right side of the normal curve. An exponential curve will start at zero, immediately leap up to the value of the vertical axis and then start to flatten out, ending at the value of 1.

### Hints & tips

- If the graph is a bar graph (histogram), it is discrete data, if it is a smooth curve, it is continuous data.
- PDF and CDF show the same information, just with different ways of expressing the vertical scale values. PDF is for that specific horizontal scale value. CDF is for all the horizontal scale values to the left of that point.

Lesson notes are only available for subscribers.

PMI, PMP and PMBOK are registered marks of the Project Management Institute, Inc.