gang) and their race
(race)?Here, we’ll be working from the Defendants2025 data set, to
examine the relationship between if a defendant has a gang charge
(gang: measured as a categorical/nominal “yes”
or “no”) and the defendant’s race (race: a
nominal variable for each racial category).
The Chi Square test (\(X^2\)) examines the association or relationship between two nominal/ordinal variables to see if the relationship reflects a true relationship that we could expect to find in the population. The test also tells us whether or not a category (attribute) of one variable varies by categories of another variable.
The test is called the test of independence because it really tests the absence of association between (independence of) two variables.
The assumptions for the Chi Square are…
Cases (observations) are not related or dependent upon each other. Case can’t have more than one attribute. No ties between observations. Examine data collection strategy to see if there are linkages between observations.
Defendants2025
data have been randomly-sampled,
we have met the assumption of independence of observations.correct = TRUE.In the vannstats
package, I have included the tab function which returns the
crosstabs of observed and expected frequencies. To check if you’ve met
the assumption of normality (e.g. fewer than 20% of cells in the
crosstab of expected frequencies falls below \(n=5\)), you use the following:
## $`Observed Frequencies`
## race: asian black latine other white Total
## gang: no 68 315 673 34 344 1434
## yes 2 120 144 0 38 304
## Total 70 435 817 34 382 1738
##
## $`Expected Frequencies`
## race: asian black latine other white Total
## gang: no 57.75604 358.91254 674.0955 28.052934 315.18297 1434
## yes 12.24396 76.08746 142.9045 5.947066 66.81703 304
## Total 70.00000 435.00000 817.0000 34.000000 382.00000 1738
we have met the assumption of normality: less than 20% of
cells in the 2x2 Expected Frequency crosstab have fewer than 5 expected
counts. Actually, no cell has fewer than 5.The calculation for the Chi Square is:
\(X^2 = \sum \frac{(f_o - f_e)^2}{f_e}\) or \(X^2 = \sum \frac{(f_{o_i} - f_{e_i})^2}{f_{e_i}}\)
where…
In addition, the degrees of freedom (\(df\)) for the test is…
* \(df = (r-1)(c-1)\)
where…
For Chi Square, within the chi.sq function, the dependent
variable is listed first and the independent variable is listed
second.
## Call:
## chi.sq(df = data1, var1 = gang, var2 = race)
##
## Pearson's Chi-squared test:
##
## χ² Critical χ² df p-value
## 63.385 9.488 4 5.632e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the output above, we see the \(X^2\)-obtained value (63.385), the degrees of freedom (4), and the p-value (5.632e-13, or .0000000000005632, which is less than our set alpha level of .05).
To interpret the findings, we report the following information:
“Using the Chi Square test of independence (\(X^2\)), I reject/fail to reject the null hypothesis that there is no association between variable one and variable 2, in the population, \(X^2(?) = ?, p ? .05\)”
The calculation for Yates’ Chi Square is:
\(X^{2}_{Yates'} = \sum \frac{(|f_o -
f_e| - 0.5)^2}{f_e}\) or
\(X^{2}_{Yates'} = \sum \frac{(|f_{o_i} -
f_{e_i}| - 0.5)^2}{f_{e_i}}\)
To employ Yates’ Continuity Correction, when we have violated the
assumption of normality, simply update the chi.sq option to
correct = TRUE, like this:
After finding a significant result in your omnibus/overall chi square test, to identify where the differences lie, you can run a post-hoc significance test.
We can see where the significantly different comparisons
(between observed and expected) are both in a table and a visual (plot)
format, using Bonferroni’s adjusted p-values, which can be called from
the chi.sq function, using
the following:

## Call:
## chi.sq(df = data1, var1 = gang, var2 = race, post = T, plot = T)
##
## Pearson's Chi-squared test:
##
## χ² Critical χ² df p-value
## 63.385 9.488 4 5.632e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Post-Hoc Test w/ Bonferroni Adjustment:
## Comparing: race-gang
##
## Standardized Residual (Z) p-value
## asian-no 3.2899 0.0100219 *
## asian-yes -3.2899 0.0100219 *
## black-no -6.4008 1.546e-09 ***
## black-yes 6.4008 1.546e-09 ***
## latine-no -0.1386 1.0000000
## latine-yes 0.1386 1.0000000
## other-no 2.7114 0.0670021 .
## other-yes -2.7114 0.0670021 .
## white-no 4.3939 0.0001113 ***
## white-yes -4.3939 0.0001113 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As seen above:
asian
defendants are more likely to be not given a gang charge (and
less likely to be given a gang charge), that black
defendants are more likely to be given a gang charge (and less
likely to not be given a gang charge), and that white
defendants are more likely to not be given a gang charge (and
less likely to be given a gang charge). These are areas where the
disparity between being given (e.g. yes) and not being given (e.g. no),
is significantly different.