Lecture 4

MeasuresMeasuresofofCentralCentralTendencyTendency(Location)(Location)

Remark

A frequency table or a histogram provide a visual summary of the data. But numerical summaries are even better and could be used for further analysis. And with enough numerical statistics we could also capture most of the information on the shape of the distribution and build and validate a statistical model. A statistical model on the data is a powerful tool to make predictions and to draw conclusions. We will start with the most basic numerical summaries, the measures of central tendency.

ArithmeticArithmeticMeanMean==AverageAverage

Definition 1

Let$\, Y_1, Y_2, \dots, Y_n \, $ be observations on a (quantitative) variable in a sample of size $\, n\,$. Thearithmetic mean (mean, average) is defined as$$ \overline{Y}=\frac{1}{n}\left(Y_1+\dots+Y_n\right)=\frac{1}{n} \sum_{i=1}^n Y_i=\frac{1}{n} \sum Y $$Here and for the rest of the course, we will use the symbol $\, \sum \,$ to denote a sum.

Example 1

Ladybug data

$$ \begin{array}{cccccccccc} 7.6 & 7.2 & 8.6 & 7.0 & 8.6 \\ 6.6 & 8.6 & 7.8 & 8.6 & 7.6 \\ 7.8 & 8.8 & 7.6 & 9.4 & 7.2 \\ 8.2 & 8.8 & 9.0 & 7.2 & 7.6 \\ 8.8 & 8.2 & 7.2 & 8.4 & 7.8 \end{array}$$

Lady bug average size

$$\overline{Y} = \frac{1}{25}(7.6+\dots+7.8)=8.008 \mathrm{~mm} $$
;

WeightedWeightedAverageAverage

Definition 2

Weighted averages are used when the values carry different weight (importance).

$$ \overline{Y}_w=\frac{ \sum w_i \cdot Y_i}{ \sum w_i} \; ; \quad \text { where } w_i \text { is the weight of the } i^{th} \text{ factor. }$$

Example 2

You have three samples from the same population. Here are the sample averages $Y_i$ and the sample sizes $n_i$.$$ \begin{array}{c|ccc} Y_i & 3.85 & 5.21 & 4.70\\ \hline n_i & 12 & 5 & 8\ \end{array}$$

Then the weighted average is$$\overline{Y}_w = \frac{(12)(3.85)+(5)(5.21)+(8)(4.70)}{12+5+8}=4.76$$This is of course equivalent to adding the original measurements and computing the overall sample mean.

GeometricGeometricMeanMean

Definition 3

Geometric means are used for growth rates.

$$ GM_Y=\sqrt[n]{Y_1\cdot Y_2\cdots Y_n } = \sqrt[n]{\Pi_{i=1}^n \,Y_i} $$Here and for the rest of the course, we will use the symbol $\, \Pi \,$ to denote a product of values.

Example 3

Say that the weight gains of a fawn over five consecutive weeks are $$ 1.03,\, 0.98,\, 1.09,\, 1.12,\, 1.08 $$The gains are multiplicative, not additive. But we can also interpret them as percentages (in the same way as interest rates). Thus in the first week, the fawn gained $3\%$ in weight, in the second week lost $2\%$ in weight, and so on.

The geometric mean is$$ GM_Y=\sqrt[5]{(1.03)\cdot(0.98)\cdots (1.08) }=1.0588 $$Thus the fawn gained $5.58\%$ in weight per week on average.

HarmonicHarmonicMeanMean

Definition 4

Harmonic means are commonly used to compute an average speed of a process, when the times to complete this process are known.$$ \frac{1}{H_Y}=\frac{1}{n}\sum \frac{1}{Y} $$

Example 4

Suppose that the time it takes a migratory bird to cover $100\,\mathrm{km}$ in hours are$$ 6.5, \, 8.3, \, 5.1, \, 12.7 $$

Then the average number of hours to cover $100\,\mathrm{km}$ is$$ \begin{aligned} \frac{1}{H_Y}&=\frac{1}{4}\left(\frac{1}{6.5}+\frac{1}{8.3}+\frac{1}{5.1}+\frac{1}{12.7}\right)\\ &\\ \Rightarrow\quad H_Y&=7.28 \mathrm{~hrs} \end{aligned} $$

MedianMedian

Definition 5

The median is the $50^{th}$ percentile of an ordered distribution of values. In other words, half the values are smaller than the median and half the values are larger than the median.

Example 5

$\begin{array}{|l|ccccc|} \hline \text{Dataset 1} & 14 & 16 & 18 & 19 & 23 \\ \hline \end{array} $$$ \Rightarrow \quad M=18 $$
$\begin{array}{|l|ccccc|}\hline \text{Dataset 2} & 14 & 16 & 18 & 22\\ \hline \end{array} $$$\Rightarrow \quad M=\frac{16+18}{2}=17 $$When the number of observations is even, the median is the average of the two middle values.

Remark

For a skewed distribution, you should report the median instead of the mean as a measure of central tendency. This is because the mean is heavily influence by outliers, while the median is not. The following example illustrates this point.

Example 6

Average temperature over seven days in a week


$ \begin{array}{|l|ccccccc|} \hline \text{Dataset 1: } & 2 & -3 & -1 & 4 & 0 & -5 & -6 \\ \hline \end{array} $$$\Rightarrow \quad \overline{Y}=-1.3 \,;\; M=-1 $$
$ \begin{array}{|l|ccccccc|} \hline \text{Dataset 2: } & 2 & -23 & -1 & 4 & 0 & -5 & -6 \\ \hline \end{array} $$$\Rightarrow \quad \overline{Y}=-4.3 \,;\; M=-1 $$ The extreme value $-23$ has a large influence on the mean, but no influence on the median.

Remark

Other measures of position include quartiles and percentiles. For example the 90th percentile is the value such that 90% of the values are smaller than it, while 10% of the values are larger than it.
Also used are the deciles (10th, 20th, 30th, etc. percentile) and the quartiles (25th, 50th, 75th percentile).$Q_1=\text{ first quartile}, \, Q_2=\text{ median}, \, Q_3=\text{ third quartile} \, $and the percentiles will be handled in Excel.

Remark

The mode is the most frequently occuring value, but it will not be too useful for us.

Looking ahead

Amazingly enough we will be able to use sample data to draw conclusions about whole populations. For example, say we want to estimate the average weight, $\mu\, $of the population of Yellow Perch in the Lake of Two Mountains. Then,$\, \mu, \,$is apopulation parameterand is not measurable in practice since it would be impossible to catch and weigh every single Yellow Perch. Proceeding pragmatically we get a sample and compute, $\, \overline{Y},\, $ thesample mean.
The value of $\, \overline{Y}\,$ depends on the sample, but when averaged over all samples, the mean of sample means is equal to $\mu $. We call $\overline{Y}\,$ an unbiased estimator for $\mu$. Moreover for large enough samples, $\overline{Y}\,$ will be close to $\mu$. Later in the course we will learn how to quantify this closeness. In this way we will be able to use sample data to draw conclusions about whole populations.