There are various ways to call a number of univariate statistics in
R. As social scientists, the main univariate statistics we are concerned
with are the mean, median, standard deviation, minimum, maximum, and
range. The stock R program comes with the summary function, which,
unfortunately, does not provide the some measures. Therefore,
we use the describe
function from the psych
package. We can call univariate statistics for both the full data set
and a specific variable.
First, let’s load the packages as libraries
And create the data1 object out of the
Defendants2025 data.
For the full data set, we can call univariate statistics as such…
## vars n mean sd median trimmed mad min max range skew
## id 1 1738 869.50 501.86 869.50 869.50 644.19 1.00 1738 1737.00 0.00
## age 2 1738 35.39 11.36 34.00 34.74 11.86 17.00 69 52.00 0.49
## race* 3 1738 3.13 1.14 3.00 3.09 1.48 1.00 5 4.00 0.48
## race_binary* 4 1738 1.22 0.41 1.00 1.15 0.00 1.00 2 1.00 1.35
## charge* 5 1738 3.29 1.57 4.00 3.23 1.48 1.00 6 5.00 0.02
## gang* 6 1738 1.17 0.38 1.00 1.09 0.00 1.00 2 1.00 1.71
## priors 7 1738 1.63 1.32 1.00 1.50 1.48 0.00 6 6.00 0.74
## gun* 8 1738 1.42 0.49 1.00 1.40 0.00 1.00 2 1.00 0.34
## risk_score 9 1738 4.51 2.78 4.35 4.42 3.41 0.01 10 9.99 0.20
## bail 10 1738 22572.21 16200.18 20000.00 21764.73 18532.50 1000.00 50000 49000.00 0.29
## perkins* 11 1738 1.32 0.47 1.00 1.28 0.00 1.00 2 1.00 0.76
## kurtosis se
## id -1.20 12.04
## age -0.36 0.27
## race* -0.65 0.03
## race_binary* -0.17 0.01
## charge* -0.90 0.04
## gang* 0.92 0.01
## priors 0.34 0.03
## gun* -1.89 0.01
## risk_score -1.08 0.07
## bail -1.14 388.59
## perkins* -1.43 0.01
Whereas, for a specific variable, we can call univariate statistics as such…
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1738 4.51 2.78 4.35 4.42 3.41 0.01 10 9.99 0.2 -1.08 0.07
In addition, we can call univariate statistics for a variable but
broken out by groups/categories of another variable. Note, this is the
first step towards bivarate statistics (looking at the relationship
between two variables). We do this by using the describeBy function, where we
list the main variable first, and the grouping/category variable second…
as such…
##
## Descriptive statistics by group
## group: asian
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 70 3.14 2.54 2.36 2.77 1.44 0.04 9.22 9.18 1.22 0.3 0.3
## ------------------------------------------------------------------------------
## group: black
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 435 4.82 2.82 4.74 4.8 3.53 0.02 10 9.98 0.03 -1.22 0.14
## ------------------------------------------------------------------------------
## group: latine
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 817 4.82 2.81 4.96 4.8 3.31 0.01 10 9.99 0.01 -1.11 0.1
## ------------------------------------------------------------------------------
## group: other
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 34 3.69 2.64 3.44 3.45 2.71 0.01 9.98 9.97 0.68 -0.22 0.45
## ------------------------------------------------------------------------------
## group: white
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 382 3.8 2.53 3.42 3.58 2.7 0.03 9.98 9.95 0.61 -0.36 0.13
Above, we can see that the mean risk score differs by the race of the defendant (e.g. black and latine individuals have, on average, higher risk scores).
Skewness is the measure of how close or far a distribution is from symmetry (the normal curve). While it summarizes clustering of scores along the X-axis, with regard to the position of the mode, median, and mean, skewness is also concerned with length/width of the tails of the distribution, relative to one another.
Kurtosis is sometimes referred to as a measure of the peakedness of the distribution, and how different the distribution is from mesokurtic (e.g. middle kurtosis, or the normal curve). Statisticians have argued that kurtosis is, more appropriately, a measure of the height/thickness of the tails of the distribution.
Skewness ranges from \(-\infty\) to \(\infty\). The sign indicates the type of skew, with \(-\) indicating negative skewness, \(+\) indicating positive skewness, and 0 indicating no skew… (AKA symmetry, AKA the normal curve). The cutoffs for skewness are as follows:
High: \(\geq |1|\)
Moderate: \(|1| \geq x \geq |.5|\)
Low: \(|.5| \geq x \geq |0|\)
Statisticians have developed a kurtosis measure that represents excess kurtosis beyond the normal curve (although typical kurtosis ranges from 1 to \(+ \infty\)). This excess kurtosis measure ranges from \(-2\) to \(+ \infty\). Using this metric, negative values represent platykurtic distributions and positive values indicate leptokurtic distributions. Distributions close to a kurtosis value of 0 are considered mesokurtic. We use cutoffs to indicate types of kurtosis, as follows…
Platykurtic: \(-2 \leq x \lt 0\); or \(x \lt 0\)
Mesokurtic: \(x \approx 0\)
Leptokurtic: \(+ \infty \geq x \gt 0\); or \(x \gt 0\)
Beyond using the describe function, you can call
singular desired univariate statistics. Here, we’ll ask for a specific
univariate statistic, one at a time, for the risk_score variable.
Below, we’ve added the option for , na.rm=T (alternatively, , na.rm=TRUE), meaning that if
data or observations are missing/NA for the variables we’re working
with, we still want R to calculate the statistic for the non-missing
cases by removing those missing cases (NAs), select TRUE.
## [1] 4.506024
## [1] 4.35
## [1] 2.784066
## [1] 0.01
## [1] 10
## [1] 0.01 10.00
Recall that Z-scores are standardized scores – how close or far an observation’s score is from the mean, in standard deviation units. These are relative scores because the standard deviation (as well as the mean) incorporates information from all other observations.
The calculation for Z-scores is:
\(Z = \frac{(X - \mu)}{\sigma}\)
But, the above calculation relies on population parameters \(\mu\), (the population mean) and \(\sigma\), (the population standard deviation), which we often do not have information on. Instead, the calculation, for each observation’s Z-score, is:
\(Z_{i} = \frac{(X_{i} - \bar{X})}{SD}\)
where…
For example, if we wanted to calculate a Z-score for an individual who received a risk score of 5, relative to all other defendants in the Defendants2025 data set, we would use the formula:
\(Z_{Individual} = \frac{(5 - \bar{X}_{Defendants2025\$risk_score})}{SD_{Defendants2025\$risk_score}}\)
Luckily, I’ve created a Z-score calculation function, z.calc(), for calculating a
z-score for a value, given the mean and standard deviation for a
variable within a data frame (or a list of values).
data frame (or a values list)variable name for the variable of interestraw scoreas such…
## Raw Score Mean Z Score
## 5.0000000 4.5060242 0.1774297
This indicates that the risk score for the individual is .177 standard deviation units above the mean.
Recall that Z-scores are standardized scores – how close or far an observation’s score is from the mean, in standard deviation units. These are relative scores because the standard deviation (as well as the mean) incorporates information from all other observations.
The calculation for confidence intervals is:
\(CI = \mu \pm Z {\sigma}\)
…or, more appropriately, the calculation for a given confidence interval, based on a given confidence level (CL) is:
\(CI_{CL} = \mu \pm Z_{CL}{\sigma}\)
where…
For example, for the 99.9% CI, we would have an associated Z-score (\(Z_{CL}\)) of \(Z_{99.9} = 3.29\), such that, the CI calculation would be:
\(CI_{99.9} = \mu \pm 3.29{\sigma}\)
However, because the above formula relies on \(\sigma\), which is an unknown population parameter – the standard deviation of a variable from the population. Our best guess of that standard deviation population parameter is the standard deviation statistic from our sample, or \(SD\), but our sample standard deviation is based on fewer cases than the the standard deviation from the population . As such, we need to adjust the size of \(SD\) based on our sample size, creating a new value we can plug in in place of \(\sigma\), which is called Standard Error of the Mean: \(SE = \frac{SD}{\sqrt N}\). Moreover, because we are relying on data from a sample, we also need to rely on sample statistics, rather than population parameters \(\bar{X}\) Thus, our new confidence interval calculation becomes:
\(CI_{CL} = \bar{X} \pm Z_{CL}{\frac{SD}{\sqrt N}}\)
or….
\(CI_{CL} = \bar{X} \pm Z_{CL}{SE}\)
Below, we’ve added a CI calculation function, ci.calc(), for calculating a
confidence interval, for a given variable within a data frame (or a list
of values), for a given confidence level.
data frame (or a values list)variable name for the variable of interestconfidence levelas such…
## Mean CI lower CI upper Std. Error
## 4.50602417 4.37513291 4.63691542 0.06678125
Above, we see the mean for the risk_score variable
(within the data1 data frame), the lower and upper bounds
for the 95 percent confidence level, and the standard error.
Beyond this, we could read in a values list or a variable within a
data frame using the dollar sign operator. However, when doing this, you
should specify when you’re reading in the confidence level. For example,
if we wanted to calculate the 99 percent confidence interval for the
age variable from the data1 data frame, then
you could also run it as such:
## Mean CI lower CI upper Std. Error
## 35.3935558 34.6917239 36.0953877 0.2724503
In addition, you can create a visual representation (plot) of
univariate data using a histogram. For quickly plotting the histogram of
one variable, and to overlay a normal curve on our histogram, we can use
the hst function from the
vannstats package.
The hst function for
plotting one variable (e.g. not broken out by categories of another
variable) takes two arguments:
data set namevariable name for the variable of interestas such…

However, as we begin to move into analyzing bivariate relationships, we may find it necessary to visualize histograms by breaking them out by levels or categories of different variables.
To plot the histogram for risk score broken out by race use the same
hst function, from the
vannstats package, and simply add a third (and even up to a
fourth argument):
data set namevariable name for the variable of interest(first) grouping variable name(second) grouping variable name
Boxplots also provide a visual representation of the normality of a distribution. The boxplot has a box, a line through the box, two whiskers on either end of the box, and sometimes dots/points outside the whiskers. Below, we get a sense of what each part of the boxplot represents…
To tell if a variable is normally-distributed using the box-and-whisker plot, generally, we want to see that there is some distance between the box and the end of the whiskers, that the box isn’t pushed too close to either whisker, that the median line (dot) is near the center of the box, and that there aren’t many outliers (dots) on the outside of the whiskers.
To plot a boxplot, we use the box function, from the
vannstats package. The function takes two arguments, if you
do not want to break it out by values of another variable:
data set namevariable name for the variable of interest
Further, this function takes a maximum of four arguments:
data set namevariable name for the variable of interest(first) grouping variable name(second) grouping variable nameTo break the above boxplot out by categories of race, we can do the following…

Much as in the above, we want to assess whether or not our variable follows the normal distribution. As such, the quantile-quantile plot is a visual tool to help us figure out if the empirical distribution of our variable fits (or rather, comes from) a theoretical normal distribution.
We fit a plot of our data/variable (usually on the Y-axis) against ``theoretical data’’ that should occur if the data came from a normal distribution (e.g. Expected Normal on the X-axis). If our data actually fit a normal curve, then the dots on the plot should follow a straight line, or be reasonably close to the line plotted.
Below, we can assess normality to determine whether our variable
follows a normal distribution, using the qq function, from the
vannstats package. The function takes two arguments, if you
do not want to break it out by values of another variable:
data set namevariable name for the variable of interest
This function also takes a maximum of four arguments:
data set namevariable name for the variable of interest(first) grouping variable name(second) grouping variable nameAs such, we can break this plot out by another grouping variable:

R also has a number of functions that work to create random data. To
create random, normally-distributed data, use the rnorm function, which takes a
maximum of three arguments. It should look something like this rnorm(100,0,1), where the first
number (here, 100)
represents the number of cases or data points you want in your random
normally-distributed data. The second argument/number (here 0) is the mean that you want your
data to have. The third number/argument (here 1) is the standard deviation that
you want your data to have.
Note that the rnorm
function takes a maximum of three arguments – and it takes a minimum of
one argument (the number of cases/data points). The default settings for
the rnorm function is mean
of 0 and a standard deviation of 1. This means that rnorm(100) and rnorm(100,0,1) will output
similar means and standard deviations. Similar, not the exact same,
because these data are randomly generated, so the values of the
data points will vary a bunch but still have a mean of 0 and standard
deviation of 1.
Obviously, you can alter the number of cases involved.
## [1] -0.14488111 -0.57975057 0.02006747 -0.83625386 -0.35970658 0.44323770 -0.39154097 1.26166084
## [9] 1.24540723 1.02743704
or…
## [1] 0.032042533 0.847863299 -1.580563413 -0.415532577 -0.548261651 -2.329345394 0.222095920
## [8] 1.096984229 -0.542043002 0.428050941 -0.945893513 0.755656728 -1.937079465 -2.181363692
## [15] 0.752385396 0.429976679 -0.018578176 0.793694827 1.354059310 0.299377437 0.001717725
## [22] -2.571636836 -0.668411799 0.811501261 -2.149940088 -0.301154686 0.401072666 -0.732361816
## [29] 1.389551413 0.114179416 1.196188594 1.411471956 -0.198398512 1.353542415 -0.724306341
## [36] -0.372137077 1.169337804 -0.179329127 0.675750985 1.638190139 -1.215783126 0.495680843
## [43] 0.413418443 0.084317808 0.167262586 1.628998686 0.419778218 2.009996710 1.603152718
## [50] -1.405103936 -1.299849854 -0.667499969 0.852673359 -0.196380432 1.470803060 -0.578263775
## [57] 0.112010163 -0.715055014 -0.005484123 -0.502836273 0.857198447 0.882938178 -0.271313735
## [64] -0.109952075 0.538547689 0.937061266 -0.663090377 0.322027874 2.028634165 0.675625418
## [71] 1.062971419 0.932352740 0.040927013 0.223922723 1.292370160 1.298917213 0.571497014
## [78] -0.086988446 -1.784697810 -0.530940903 -0.055134935 1.135993128 1.438840663 0.073921430
## [85] -0.354285516 -0.484101990 -1.454350516 1.021180641 1.200366145 -0.614246502 1.691440064
## [92] 0.939101659 0.691172256 -0.447739225 0.090860029 -0.197322605 -1.521370910 0.750077531
## [99] -0.082076557 1.104979577
or even…
You can also use the assignment operator <- to assign the values of the
rnorm function to an
object:
Then you can run univariate statistics on those data, and even create a histogram for the data:
## [1] -0.001852478
## [1] -0.00798082
## [1] 0.9826502
## [1] -2.671512
## [1] 3.412189
## [1] 6.083701