[ ]


Univariate/Descriptive Statistics

There are various ways to call a number of univariate statistics in R. As social scientists, the main univariate statistics we are concerned with are the mean, median, standard deviation, minimum, maximum, and range. The stock R program comes with the summary function, which, unfortunately, does not provide the some measures. Therefore, we use the describe function from the psych package. We can call univariate statistics for both the full data set and a specific variable.

First, let’s load the packages as libraries

library(MASS)
library(psych)
library(vannstats)

And create the data1 object out of the Defendants2025 data.

data1 <- Defendants2025

For the full data set, we can call univariate statistics as such…

describe(data1)
##              vars    n     mean       sd   median  trimmed      mad     min   max    range skew
## id              1 1738   869.50   501.86   869.50   869.50   644.19    1.00  1738  1737.00 0.00
## age             2 1738    35.39    11.36    34.00    34.74    11.86   17.00    69    52.00 0.49
## race*           3 1738     3.13     1.14     3.00     3.09     1.48    1.00     5     4.00 0.48
## race_binary*    4 1738     1.22     0.41     1.00     1.15     0.00    1.00     2     1.00 1.35
## charge*         5 1738     3.29     1.57     4.00     3.23     1.48    1.00     6     5.00 0.02
## gang*           6 1738     1.17     0.38     1.00     1.09     0.00    1.00     2     1.00 1.71
## priors          7 1738     1.63     1.32     1.00     1.50     1.48    0.00     6     6.00 0.74
## gun*            8 1738     1.42     0.49     1.00     1.40     0.00    1.00     2     1.00 0.34
## risk_score      9 1738     4.51     2.78     4.35     4.42     3.41    0.01    10     9.99 0.20
## bail           10 1738 22572.21 16200.18 20000.00 21764.73 18532.50 1000.00 50000 49000.00 0.29
## perkins*       11 1738     1.32     0.47     1.00     1.28     0.00    1.00     2     1.00 0.76
##              kurtosis     se
## id              -1.20  12.04
## age             -0.36   0.27
## race*           -0.65   0.03
## race_binary*    -0.17   0.01
## charge*         -0.90   0.04
## gang*            0.92   0.01
## priors           0.34   0.03
## gun*            -1.89   0.01
## risk_score      -1.08   0.07
## bail            -1.14 388.59
## perkins*        -1.43   0.01

Whereas, for a specific variable, we can call univariate statistics as such…

describe(data1$risk_score)
##    vars    n mean   sd median trimmed  mad  min max range skew kurtosis   se
## X1    1 1738 4.51 2.78   4.35    4.42 3.41 0.01  10  9.99  0.2    -1.08 0.07

In addition, we can call univariate statistics for a variable but broken out by groups/categories of another variable. Note, this is the first step towards bivarate statistics (looking at the relationship between two variables). We do this by using the describeBy function, where we list the main variable first, and the grouping/category variable second… as such…

describeBy(data1$risk_score, data1$race)
## 
##  Descriptive statistics by group 
## group: asian
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis  se
## X1    1 70 3.14 2.54   2.36    2.77 1.44 0.04 9.22  9.18 1.22      0.3 0.3
## ------------------------------------------------------------------------------ 
## group: black
##    vars   n mean   sd median trimmed  mad  min max range skew kurtosis   se
## X1    1 435 4.82 2.82   4.74     4.8 3.53 0.02  10  9.98 0.03    -1.22 0.14
## ------------------------------------------------------------------------------ 
## group: latine
##    vars   n mean   sd median trimmed  mad  min max range skew kurtosis  se
## X1    1 817 4.82 2.81   4.96     4.8 3.31 0.01  10  9.99 0.01    -1.11 0.1
## ------------------------------------------------------------------------------ 
## group: other
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 34 3.69 2.64   3.44    3.45 2.71 0.01 9.98  9.97 0.68    -0.22 0.45
## ------------------------------------------------------------------------------ 
## group: white
##    vars   n mean   sd median trimmed mad  min  max range skew kurtosis   se
## X1    1 382  3.8 2.53   3.42    3.58 2.7 0.03 9.98  9.95 0.61    -0.36 0.13

Above, we can see that the mean risk score differs by the race of the defendant (e.g. black and latine individuals have, on average, higher risk scores).

A Note on Skewness and Kurtosis

Skewness is the measure of how close or far a distribution is from symmetry (the normal curve). While it summarizes clustering of scores along the X-axis, with regard to the position of the mode, median, and mean, skewness is also concerned with length/width of the tails of the distribution, relative to one another.

Kurtosis is sometimes referred to as a measure of the peakedness of the distribution, and how different the distribution is from mesokurtic (e.g. middle kurtosis, or the normal curve). Statisticians have argued that kurtosis is, more appropriately, a measure of the height/thickness of the tails of the distribution.

Skewness ranges from \(-\infty\) to \(\infty\). The sign indicates the type of skew, with \(-\) indicating negative skewness, \(+\) indicating positive skewness, and 0 indicating no skew… (AKA symmetry, AKA the normal curve). The cutoffs for skewness are as follows:

High: \(\geq |1|\)

Moderate: \(|1| \geq x \geq |.5|\)

Low: \(|.5| \geq x \geq |0|\)

Statisticians have developed a kurtosis measure that represents excess kurtosis beyond the normal curve (although typical kurtosis ranges from 1 to \(+ \infty\)). This excess kurtosis measure ranges from \(-2\) to \(+ \infty\). Using this metric, negative values represent platykurtic distributions and positive values indicate leptokurtic distributions. Distributions close to a kurtosis value of 0 are considered mesokurtic. We use cutoffs to indicate types of kurtosis, as follows…

Platykurtic: \(-2 \leq x \lt 0\); or \(x \lt 0\)

Mesokurtic: \(x \approx 0\)

Leptokurtic: \(+ \infty \geq x \gt 0\); or \(x \gt 0\)

Calling Specific Univariate Statistics

Beyond using the describe function, you can call singular desired univariate statistics. Here, we’ll ask for a specific univariate statistic, one at a time, for the risk_score variable.

Below, we’ve added the option for , na.rm=T (alternatively, , na.rm=TRUE), meaning that if data or observations are missing/NA for the variables we’re working with, we still want R to calculate the statistic for the non-missing cases by removing those missing cases (NAs), select TRUE.

mean(data1$risk_score, na.rm=T)
## [1] 4.506024
median(data1$risk_score, na.rm=T)
## [1] 4.35
sd(data1$risk_score, na.rm=T)
## [1] 2.784066
min(data1$risk_score, na.rm=T)
## [1] 0.01
max(data1$risk_score, na.rm=T)
## [1] 10
range(data1$risk_score, na.rm=T)
## [1]  0.01 10.00

Standardized Scores (Z-Scores) and Interval Estimates around Means (CI)

Z-Score

Recall that Z-scores are standardized scores – how close or far an observation’s score is from the mean, in standard deviation units. These are relative scores because the standard deviation (as well as the mean) incorporates information from all other observations.

The calculation for Z-scores is:

\(Z = \frac{(X - \mu)}{\sigma}\)

But, the above calculation relies on population parameters \(\mu\), (the population mean) and \(\sigma\), (the population standard deviation), which we often do not have information on. Instead, the calculation, for each observation’s Z-score, is:

\(Z_{i} = \frac{(X_{i} - \bar{X})}{SD}\)

where…

  • \(X_{i}\) is the raw score for a given observation
  • \(\bar{X}\) is the mean for all observations
  • \(SD\) is the standard deviation for all observations

For example, if we wanted to calculate a Z-score for an individual who received a risk score of 5, relative to all other defendants in the Defendants2025 data set, we would use the formula:

\(Z_{Individual} = \frac{(5 - \bar{X}_{Defendants2025\$risk_score})}{SD_{Defendants2025\$risk_score}}\)

Luckily, I’ve created a Z-score calculation function, z.calc(), for calculating a z-score for a value, given the mean and standard deviation for a variable within a data frame (or a list of values).

  1. the data frame (or a values list)
  2. variable name for the variable of interest
  3. raw score

as such…

z.calc(data1, risk_score, 5)
## Raw Score      Mean   Z Score 
## 5.0000000 4.5060242 0.1774297

This indicates that the risk score for the individual is .177 standard deviation units above the mean.

Confidence Intervals

Recall that Z-scores are standardized scores – how close or far an observation’s score is from the mean, in standard deviation units. These are relative scores because the standard deviation (as well as the mean) incorporates information from all other observations.

The calculation for confidence intervals is:

\(CI = \mu \pm Z {\sigma}\)

…or, more appropriately, the calculation for a given confidence interval, based on a given confidence level (CL) is:

\(CI_{CL} = \mu \pm Z_{CL}{\sigma}\)

where…

  • \(CL\) represents the Confidence Level you’re interested in using (e.g. 95%, 99%, 99.9%, etc.)
  • \(Z_{CL}\) represents the Z-score associated with that Confidence Level you’re using (e.g. 95%, 99%, 99.9%, etc.)

For example, for the 99.9% CI, we would have an associated Z-score (\(Z_{CL}\)) of \(Z_{99.9} = 3.29\), such that, the CI calculation would be:

\(CI_{99.9} = \mu \pm 3.29{\sigma}\)

However, because the above formula relies on \(\sigma\), which is an unknown population parameter – the standard deviation of a variable from the population. Our best guess of that standard deviation population parameter is the standard deviation statistic from our sample, or \(SD\), but our sample standard deviation is based on fewer cases than the the standard deviation from the population . As such, we need to adjust the size of \(SD\) based on our sample size, creating a new value we can plug in in place of \(\sigma\), which is called Standard Error of the Mean: \(SE = \frac{SD}{\sqrt N}\). Moreover, because we are relying on data from a sample, we also need to rely on sample statistics, rather than population parameters \(\bar{X}\) Thus, our new confidence interval calculation becomes:

\(CI_{CL} = \bar{X} \pm Z_{CL}{\frac{SD}{\sqrt N}}\)

or….

\(CI_{CL} = \bar{X} \pm Z_{CL}{SE}\)

Below, we’ve added a CI calculation function, ci.calc(), for calculating a confidence interval, for a given variable within a data frame (or a list of values), for a given confidence level.

  1. the data frame (or a values list)
  2. variable name for the variable of interest
  3. confidence level

as such…

ci.calc(data1, risk_score, 95)
##       Mean   CI lower   CI upper Std. Error 
## 4.50602417 4.37513291 4.63691542 0.06678125

Above, we see the mean for the risk_score variable (within the data1 data frame), the lower and upper bounds for the 95 percent confidence level, and the standard error.

Beyond this, we could read in a values list or a variable within a data frame using the dollar sign operator. However, when doing this, you should specify when you’re reading in the confidence level. For example, if we wanted to calculate the 99 percent confidence interval for the age variable from the data1 data frame, then you could also run it as such:

ci.calc(data1$age, cl=99)
##       Mean   CI lower   CI upper Std. Error 
## 35.3935558 34.6917239 36.0953877  0.2724503

Assessing Normality by Using Visualizations

Histograms

In addition, you can create a visual representation (plot) of univariate data using a histogram. For quickly plotting the histogram of one variable, and to overlay a normal curve on our histogram, we can use the hst function from the vannstats package.

The hst function for plotting one variable (e.g. not broken out by categories of another variable) takes two arguments:

  1. the data set name
  2. variable name for the variable of interest

as such…

hst(data1, risk_score) 

However, as we begin to move into analyzing bivariate relationships, we may find it necessary to visualize histograms by breaking them out by levels or categories of different variables.

To plot the histogram for risk score broken out by race use the same hst function, from the vannstats package, and simply add a third (and even up to a fourth argument):

  1. the data set name
  2. variable name for the variable of interest
  3. (first) grouping variable name
  4. (second) grouping variable name
hst(data1, risk_score, race)

Boxplots (Box-and-Whisker Plots)

Boxplots also provide a visual representation of the normality of a distribution. The boxplot has a box, a line through the box, two whiskers on either end of the box, and sometimes dots/points outside the whiskers. Below, we get a sense of what each part of the boxplot represents…

  • Bottom (or left end) whisker represents a point that is less than or equal to the calculation: 1.5x the size of the interquartile range (IQR). If there is an actual data point at that value, then the bottom whisker represents that point. However, if there is not an actual data point there, the bottom whisker is pulled inward to fall at the closest but less extreme data point.
  • Bottom (or left end) of the box represents the first quartile (the 25th percentile case)
  • Middle line (or dot) inside the box represents the median, also known as the second quartile (the 50th percentile case)
  • Top (or right end) of the box represents the third quartile (the 75th percentile case)
  • Top (or right end) whisker represents a point that is less than or equal to than the calculation: 1.5x the size of the interquartile range (IQR). If there is an actual data point at that value, then the top whisker represents that point. However, if there is not an actual data point there, the top whisker is pulled inward to fall at the closest but less extreme data point. of the whisker represents the maximum score for that variable’s distribution, or, more appropriately, a distance above the mean that is 1.5x the size of the interquartile range (IQR).
  • Outside dots represent outliers - extreme high or extreme low values for that variable.

To tell if a variable is normally-distributed using the box-and-whisker plot, generally, we want to see that there is some distance between the box and the end of the whiskers, that the box isn’t pushed too close to either whisker, that the median line (dot) is near the center of the box, and that there aren’t many outliers (dots) on the outside of the whiskers.

To plot a boxplot, we use the box function, from the vannstats package. The function takes two arguments, if you do not want to break it out by values of another variable:

  1. the data set name
  2. variable name for the variable of interest
box(data1, risk_score)

Further, this function takes a maximum of four arguments:

  1. the data set name
  2. variable name for the variable of interest
  3. (first) grouping variable name
  4. (second) grouping variable name

To break the above boxplot out by categories of race, we can do the following…

box(data1, risk_score, race)

Normal Q-Q (Quantile-Quantile) Plots

Much as in the above, we want to assess whether or not our variable follows the normal distribution. As such, the quantile-quantile plot is a visual tool to help us figure out if the empirical distribution of our variable fits (or rather, comes from) a theoretical normal distribution.

We fit a plot of our data/variable (usually on the Y-axis) against ``theoretical data’’ that should occur if the data came from a normal distribution (e.g. Expected Normal on the X-axis). If our data actually fit a normal curve, then the dots on the plot should follow a straight line, or be reasonably close to the line plotted.

Below, we can assess normality to determine whether our variable follows a normal distribution, using the qq function, from the vannstats package. The function takes two arguments, if you do not want to break it out by values of another variable:

  1. the data set name
  2. variable name for the variable of interest
qq(data1, risk_score)

This function also takes a maximum of four arguments:

  1. the data set name
  2. variable name for the variable of interest
  3. (first) grouping variable name
  4. (second) grouping variable name

As such, we can break this plot out by another grouping variable:

qq(data1, risk_score, race)

Working with Random, Normally-Distributed Data

R also has a number of functions that work to create random data. To create random, normally-distributed data, use the rnorm function, which takes a maximum of three arguments. It should look something like this rnorm(100,0,1), where the first number (here, 100) represents the number of cases or data points you want in your random normally-distributed data. The second argument/number (here 0) is the mean that you want your data to have. The third number/argument (here 1) is the standard deviation that you want your data to have.

Note that the rnorm function takes a maximum of three arguments – and it takes a minimum of one argument (the number of cases/data points). The default settings for the rnorm function is mean of 0 and a standard deviation of 1. This means that rnorm(100) and rnorm(100,0,1) will output similar means and standard deviations. Similar, not the exact same, because these data are randomly generated, so the values of the data points will vary a bunch but still have a mean of 0 and standard deviation of 1.

rnorm(100,0,1)
rnorm(100)

Obviously, you can alter the number of cases involved.

rnorm(10)
##  [1] -0.14488111 -0.57975057  0.02006747 -0.83625386 -0.35970658  0.44323770 -0.39154097  1.26166084
##  [9]  1.24540723  1.02743704

or…

rnorm(100)
##   [1]  0.032042533  0.847863299 -1.580563413 -0.415532577 -0.548261651 -2.329345394  0.222095920
##   [8]  1.096984229 -0.542043002  0.428050941 -0.945893513  0.755656728 -1.937079465 -2.181363692
##  [15]  0.752385396  0.429976679 -0.018578176  0.793694827  1.354059310  0.299377437  0.001717725
##  [22] -2.571636836 -0.668411799  0.811501261 -2.149940088 -0.301154686  0.401072666 -0.732361816
##  [29]  1.389551413  0.114179416  1.196188594  1.411471956 -0.198398512  1.353542415 -0.724306341
##  [36] -0.372137077  1.169337804 -0.179329127  0.675750985  1.638190139 -1.215783126  0.495680843
##  [43]  0.413418443  0.084317808  0.167262586  1.628998686  0.419778218  2.009996710  1.603152718
##  [50] -1.405103936 -1.299849854 -0.667499969  0.852673359 -0.196380432  1.470803060 -0.578263775
##  [57]  0.112010163 -0.715055014 -0.005484123 -0.502836273  0.857198447  0.882938178 -0.271313735
##  [64] -0.109952075  0.538547689  0.937061266 -0.663090377  0.322027874  2.028634165  0.675625418
##  [71]  1.062971419  0.932352740  0.040927013  0.223922723  1.292370160  1.298917213  0.571497014
##  [78] -0.086988446 -1.784697810 -0.530940903 -0.055134935  1.135993128  1.438840663  0.073921430
##  [85] -0.354285516 -0.484101990 -1.454350516  1.021180641  1.200366145 -0.614246502  1.691440064
##  [92]  0.939101659  0.691172256 -0.447739225  0.090860029 -0.197322605 -1.521370910  0.750077531
##  [99] -0.082076557  1.104979577

or even…

rnorm(1000)

(output supressed)

You can also use the assignment operator <- to assign the values of the rnorm function to an object:

ten<-rnorm(10)
hundred<-rnorm(100)
thousand<-rnorm(1000)

Then you can run univariate statistics on those data, and even create a histogram for the data:

mean(thousand)
## [1] -0.001852478
median(thousand)
## [1] -0.00798082
sd(thousand)
## [1] 0.9826502
min(thousand)
## [1] -2.671512
max(thousand)
## [1] 3.412189
range<-max(thousand)-min(thousand)
range
## [1] 6.083701

Finally, you can plot the histograms to see how they differ when altering the number of cases:

hst(ten)

hst(hundred)

hst(thousand)

So now you know that the more cases/data points you have, the more your data will mimic the normal distribution (bell curve).