cyl), and their
expected proportions (distribution) in the population?Here, we’ll be working from the mtcars data set, to examine our
difference from the population expectation of the percentage of four-,
six-, and eight-cylinder (cyl) cars on the
road. We know from Kelley
Blue Book (2024) that around 43.4 percent of cars have 4-cylinder
engines, 32.8 percent of cars have 6-cylinder engines, and 20.4 percent
of cars have 8-cylinder engines (with the remaining 3.4 percent as
“other”). Removing the “others”, these percentages are adjusted to:
44.93 percent as 4-cylinder, 33.95 percent as 6-cylinder, and 21.12
percent as 8-cylinder engines.
The Chi Square Goodness of Fit test (\(X^2\)) examines the difference between the observed frequencies within each category of our variable and the population/expected frequencies for each category of our variable, to determine if our observed frequencies are extremely different from the expectation.
The assumptions for the Chi Square are…
Cases (observations) are not related or dependent upon each other. Case can’t have more than one attribute. No ties between observations. Examine data collection strategy to see if there are linkages between observations.
we have met the assumption of independence of observations.correct = TRUE.In the vannstats
package, I have included the tab function which returns the
crosstabs of observed and expected frequencies. To check if you’ve met
the assumption of normality (e.g. fewer than 20% of cells in the
crosstab of expected frequencies falls below \(n=5\)), you use the following:
we have met the assumption of normality: less than 20% of
cells in the 2x2 Expected Frequency crosstab have fewer than 5 expected
counts. Actually, no cell has fewer than 5.The calculation for the Chi Square is:
\(X^2 = \sum \frac{(O - E)^2}{E}\) or \(X^2 = \sum \frac{(f_o - f_e)^2}{f_e}\) or \(X^2 = \sum \frac{(O_i - E_i)^2}{E_i}\) or \(X^2 = \sum \frac{(f_{o_i} - f_{e_i})^2}{f_{e_i}}\)
where…
In addition, the degrees of freedom (\(df\)) for the test is…
* \(df = (k - 1)\)
where…
For Chi Square Goodness of Fit test, within the chi.sq function, the dependent
variable is listed first and the independent variable is listed
second.
counts <- c(11,7,14)
proportions <- c(.4493,.3395,.2112)
chisq.test(counts, p = proportions, correct=FALSE)##
## Chi-squared test for given probabilities
##
## data: counts
## X-squared = 9.9271, df = 2, p-value = 0.006988
In the output above, we see the \(X^2\)-obtained value (9.9271), the degrees of freedom (2), and the p-value (0.006988, which is less than our set alpha level of .05).
To interpret the findings, we report the following information:
“Using the Chi Square test of independence (\(X^2\)), I reject/fail to reject the null hypothesis that there is no difference between the population/expected frequency and our obtained frequencies, \(X^2(?) = ?, p ? .05\)”
The calculation for Yates’ Chi Square is:
\(X^{2}_{Yates'} = \sum \frac{(|f_o -
f_e| - 0.5)^2}{f_e}\) or
\(X^{2}_{Yates'} = \sum \frac{(|f_{o_i} -
f_{e_i}| - 0.5)^2}{f_{e_i}}\)
To employ Yates’ Continuity Correction, when we have violated the
assumption of normality, simply update the chi.sq option to
correct = TRUE, as below
After finding a significant result in your omnibus/overall chi square test, to identify where the differences lie, you can do one thing:
## Call:
## chi.sq(df = data1, var1 = vs, var2 = am, post = T)
##
## Pearson's Chi-squared test:
##
## χ² Critical χ² df p-value
## 0.90688 3.84100 1 0.3409
##
##
## Post-Hoc Test w/ Bonferroni Adjustment:
## Comparing: am-vs
##
## Standardized Residual (Z) p-value
## 0-0 0.9523 1
## 0-1 -0.9523 1
## 1-0 -0.9523 1
## 1-1 0.9523 1
And finally, we can see where the significantly different
different comparisons (between observed and expected) are, Bonferroni’s
adjusted p-values which can also be called from the chi.sq function. As seen
above: