Lecture 13

Chi-SquareChi-SquareTestTestforforIndependenceIndependence

Remark

The Chi-Square Test for Independence is a statistical test used to determine if there is a dependency relationship between two variables. The independence test uses $\chi^2\;$ statistic and estimates if the difference between the expected and observed frequency distribution in a contingency table is statistically significant. The expected frequencies are calculated under the assumption that the two variables are independent.

For a $\chi^2\; $ test for independence the null hypothesis is the statement that the two variables under consideration are independent. The alternative hypothesis is the statement that the two variables are dependent. The $\chi^2\; $value defined next is computed from the data and is used to decide whether to reject the null hypothesis and accept the presence of dependence between the two variables.

Formula

The data for the test for independence is organized in a contingency table with $r$ rows and $c$ columns. The value in row $i$ and column $j$ is denoted by $O_{ij}$.

The marginal counts in the table are used to calculate the expected frequency for each table cell under the assumption of independence. The expected frequency is calculated as:$$ E_{ij}=\frac{1}{n}\left \lbrace \left( \sum _j O_{ij} \right) \left( \sum_{i} O_{ij}\right) \right \rbrace $$The test statistic for the $\chi^2-$goodness of fit test is given by:$$ \chi^2=\sum_{i, j} \frac{\left (O_{i j}-E_{i j}\right)^2}{E_{i j}}$$with $(r-1)(c-1)$ degrees of freedom. We reject $H_0$ if $\chi^2$ is too large.

The $p-$value for the test is the probability of observing a test statistic as extreme as the one we observed, assuming the null hypothesis is true.

Example 1

The following table shows the distribution of bison in Yellowstone National Park by age and location.$$ \begin{array}{c|c|c|c|c|c} & \text { Lamar } & \text { Nez Percé } & \text { Firehole } & \\ \text { Age } & \\ \hline \text { Calf } & 13 & 13 & 15 & 41 \\ \hline \text { Yearling } & 10 & 11 & 12 & 33 \\ \hline \text { Adult } & 34 & 28 & 30 & 92 \\ \hline & 57 & 52 & 57 & 166\end{array}$$

Is the age distribution independent of the location in the park? Test at the $\alpha=0.05$ level of significance.

Solution

Null hypothesis, $H_0$: Age distribution is independent of location.
Alternative hypothesis, $H_1:$ Age distribution and location are dependent.

The cell expected frequencies are computed by multiplying the two marginal counts and dividing by the total count. For example for the Calfs in Lamar, the expected frequency is:$$ E_{11}=\frac{57\times 41}{166}=14.08 $$The remaining expected frequencies are computed similarly.$$ \begin{array}{c|c|c|c|c|c} & \text { Lamar } & \text { Nez Percé } & \text { Firehole } & \\ \text { Age } & \\ \hline \text { Calf } & 14.08 & 12.84 & 14.08 & 41 \\ \hline \text { Yearling } & 11.33 & 10.34 & 11.33 & 33 \\ \hline \text { Adult } & 31.59 & 28.82 & 31.59 & 92 \\ \hline & 57 & 52 & 57 & 166\end{array}$$Next we compute the $\chi^2$ statistic by using the observed frequancies from the data table and the expected frequencies.$$ \begin{aligned} \chi^2 &=\frac{(13-14.08)^2}{14.08}+\cdots+\frac{(30-31.59)^2}{31.59} \\ &\\ &=0.670355, \quad \quad df=(3-1)(3-1)=4 \end{aligned} $$Using software (e.g. Excel) we find: $p-value= 0.9549 > \alpha$.

We fail to reject $H_0.\; $This data does not provide enough evidence to refute the claim that location and age distribution of bison in Yellowstone are independent.

Example 2

Consider the following data from a Myer-Briggs personality test and the occupation of the test takers.$$ \begin{array}{l|c|c|c} \text { Occupation } & \text{ Extroverted } & \text { Introverted } \\ \hline \text { Clergy } & 62 & 45 & 107 \\ \text { Medical Doctor } & 68 & 94 & 162 \\ \text { Lawyer } & 56 & 81 & 137 \\ \hline & 186 & 220 & 406 \end{array}$$Use a $\chi^2$-test at $0.05$ level of significance to determine if the listed occupations and personality traits are independent.

Solution

Null hypothesis, $H_0$: Occupation and personality are independent.
Alternative hypothesis, $H_1:$ Occupation and personality are dependent.

We start by computing the expected frequencies for each cell in the table under the assumption of independence. These expected frequencies are computed by multiplying the row and column marginal counts and dividing by the total count.$$ \begin{array}{l|c|c|c} & \text{Extroverted} & \text{Introverted} & \\ \hline \text{Clergy} & 49.02 & 57.98 & 107 \\ \text { Medical Doctor } & 74.22 & 87.78 & 162 \\ \text { Lawyer } & 62.76 & 74.23 & 137 \\ \hline & 186 & 220 & 406\end{array}$$The $\chi^2$ statistic is computed using the observed and expected frequencies.$$ \begin{aligned} \chi^2 &=\frac{(62-49.02)^2}{49.02}+\cdots+\frac{(81-74.33)^2}{74.33} \\ &\\ &=8.65, \quad \quad df=(3-1)(2-1)=2 \end{aligned} $$Using software we find: $p-value= 0.0013 < \alpha=0.05$

Since the $p-$value is less than the level of significance, we reject $H_0$ and accept $H_1$. We conclude that occupation and personality are dependent variables.