The OLS regression examines the predictive relationship between some independent variable(s), and an interval-ratio dependent variable. The test tells us about the effect (slope) of any independent (X) variable on an interval-ratio dependent (Y) variable. In particular, the regression equation looks at how values of an x variable “predict” a specific Y value.
For this example, the OLS regression works well because we’re looking
at how variation in a person’s graduate school GPA (gpa, an
interval-ratio variable ranging from 2.5 to 4.3) can be
predicted/explained by variation in four other pre-graduate school
variables on which students were assessed:
ar, an
interval-ratio variable ranging from 2.5 to 5.0)grev, an interval-ratio
variable ranging from 480 to 720)grev, an interval-ratio
variable ranging from 480 to 720)mat, an interval-ratio variable ranging from 55 to
85)summary(lm(gpa ~ ar + grev + mat + greq, data=data))
In total, we have 60 individuals. Below, show an image of their data in a spreadsheet and list their respective scores. The data are as follows:

GPA: 3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3, 3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3, 3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3, 3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3
Average Recommender Rating: 2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0, 2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0, 2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0, 2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0.
GRE Verbal Score: 540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610, 540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610, 540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610, 540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610.
GRE Quantitative Score: 625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600, 625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600, 625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600, 625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600.
Miller Analogies Test Score: 65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85, 65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85, 65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85, 65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85.
As in the Intro to
R vignette, we can create an object out of a list of numbers using
the concatenate c
function.
Knowing that we have five variables, we have to read in the variables separately (listing the values for each observation). To do so, we can use the following code:
gpa <- c(3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3, 3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3, 3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3, 3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3)
ar <- c(2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0, 2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0, 2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0, 2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0)
grev <- c(540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610, 540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610, 540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610, 540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610)
greq <- c(625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600, 625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600, 625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600, 625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600)
mat <- c(65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85, 65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85, 65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85, 65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85)Where the first number in each list corresponds with the number in
the first observation . For example, the first observation in the list
for gpa is 3.2, which corresponds with the
first observation in the mat list, of 65.
Next, to appropriately prepare the data for analysis, we have to
merge these five lists. To merge, as in the Intro to R vignette, we
can use the data.frame
function.
Now we can call the data…
## gpa ar grev greq mat
## 1 3.2 2.7 540 625 65
## 2 4.1 4.5 680 575 75
## 3 3.0 2.5 480 520 65
## 4 2.6 3.1 520 545 55
## 5 3.7 3.6 490 520 75
## 6 4.0 4.3 535 655 65
## 7 4.3 4.6 720 630 75
## 8 2.7 3.0 500 500 75
## 9 3.6 4.7 575 605 65
## 10 4.1 3.4 690 555 75
## 11 2.7 3.7 545 505 55
## 12 2.9 2.6 515 540 55
## 13 2.5 3.1 520 520 55
## 14 3.0 2.7 710 585 65
## 15 3.3 5.0 610 600 85
## 16 3.2 2.7 540 625 65
## 17 4.1 4.5 680 575 75
## 18 3.0 2.5 480 520 65
## 19 2.6 3.1 520 545 55
## 20 3.7 3.6 490 520 75
## 21 4.0 4.3 535 655 65
## 22 4.3 4.6 720 630 75
## 23 2.7 3.0 500 500 75
## 24 3.6 4.7 575 605 65
## 25 4.1 3.4 690 555 75
## 26 2.7 3.7 545 505 55
## 27 2.9 2.6 515 540 55
## 28 2.5 3.1 520 520 55
## 29 3.0 2.7 710 585 65
## 30 3.3 5.0 610 600 85
## 31 3.2 2.7 540 625 65
## 32 4.1 4.5 680 575 75
## 33 3.0 2.5 480 520 65
## 34 2.6 3.1 520 545 55
## 35 3.7 3.6 490 520 75
## 36 4.0 4.3 535 655 65
## 37 4.3 4.6 720 630 75
## 38 2.7 3.0 500 500 75
## 39 3.6 4.7 575 605 65
## 40 4.1 3.4 690 555 75
## 41 2.7 3.7 545 505 55
## 42 2.9 2.6 515 540 55
## 43 2.5 3.1 520 520 55
## 44 3.0 2.7 710 585 65
## 45 3.3 5.0 610 600 85
## 46 3.2 2.7 540 625 65
## 47 4.1 4.5 680 575 75
## 48 3.0 2.5 480 520 65
## 49 2.6 3.1 520 545 55
## 50 3.7 3.6 490 520 75
## 51 4.0 4.3 535 655 65
## 52 4.3 4.6 720 630 75
## 53 2.7 3.0 500 500 75
## 54 3.6 4.7 575 605 65
## 55 4.1 3.4 690 555 75
## 56 2.7 3.7 545 505 55
## 57 2.9 2.6 515 540 55
## 58 2.5 3.1 520 520 55
## 59 3.0 2.7 710 585 65
## 60 3.3 5.0 610 600 85
The assumptions for the regression are…
In addition, the previously-discussed assumptions for other tests (independence of observations) is implied, since all of these bivariate tests require random samples. Beyond this, the OLS regression requires an interval-ratio outcome variable.
Where \(k\) is the number of independent variables included in the regression model.
we have violated (not met) the assumption of adequate sample size.
In almost all cases, I would advise not proceeding with the regression
model, however, given that this is an example, I will
proceed.To identify outliers, simply look at the boxplots for each variable in the model (Y and all Xs) to see “how outlying, these outliers are.” In most cases, outliers should remain in the data. Need strong justification for removing outlying cases.





we have met the assumption of absence of outliers.
Interestingly, the boxplot for the GRE Verbal variable has the median is
closer to the 25th percentile. Equally as interesting, the boxplot for
the Miller Analogies Test variable is missing a lower whisker,
indicating that the lowest extreme case is similar to/the same as the
25th percentile case.Multicollinearity: Independent variables (more) highly correlated with one another (compared to their correlation with the DV).
## gpa ar grev mat greq
## gpa 1
## ar 0.62 1
## grev 0.58 0.41 1
## mat 0.6 0.52 0.43 1
## greq 0.61 0.51 0.47 0.27 1
ar, grev, mat,
greq) are above a correlation coefficient of \(r \approx .90\). Therefore,
we have met the assumption of absence of multicollinearity.Singularity: If independent variables included are (together) all possible subsets of measure also included in model.
we have met the assumption of absence of singularity.
we have met the assumptions of linearity, normality, and homoskedasticity.
Linearity is met given that the residuals do not exhibit a non-linear
(e.g. curvilinear) relationship about the 0 distance (from \(\hat{Y}\)) line. Normality is met given
that the residuals do not have a hard stop on either side of the line –
that is, they are evenly distributed about the 0 distance (from \(\hat{Y}\)) line. Finally, homoskedasticity
is met given that the residuals are evenly distanced from the 0 distance
(from \(\hat{Y}\)) line at all values
of \(\hat{Y}\) – as exemplified the
lack of “fanning out” on one end.The calculation for the Regression is:
\(\hat{Y} = b_0 + b_1X_1 + b_2X_2\)
Where…
For Regression, within the lm function, which stands for
linear model, the dependent variable is listed first and the
independent variable is listed second.
This may seem confusing, so it’s best to wrap our lm function in a summary call…
##
## Call:
## lm(formula = gpa ~ ar + grev + mat + greq, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7876 -0.2297 0.0069 0.2868 0.5260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.738107 0.640989 -2.712 0.00892 **
## ar 0.144233 0.076185 1.893 0.06360 .
## grev 0.001524 0.000708 2.152 0.03580 *
## mat 0.020896 0.006438 3.246 0.00200 **
## greq 0.003998 0.001234 3.240 0.00203 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3693 on 55 degrees of freedom
## Multiple R-squared: 0.6405, Adjusted R-squared: 0.6143
## F-statistic: 24.49 on 4 and 55 DF, p-value: 1.128e-11
To interpret the findings, we report the following information:
The test used
The variables used in the full model
For significant variables, how a variable’s slope affects the outcome
The amount of variance in the outcome explained by the combination of IVs.
grev, mat, and greq are all
positively and significantly related to the outcome variable. First, GRE
Verbal score, net of all other variables, is significant and positively
related to GPA, such that for every 1-unit increase in a person’s GRE
Verbal score, there is an associated (predicted) 001524-unit
increase in their GPA. Considering the Miller Analogies Test,
which is significant and positively related to GPA, net of all other
variables in the model, for every 1-unit increase in a person’s Miller
Analogies Test score, there is an associated .020896-unit
increase in their GPA. Finally, GRE Quantitative score is
significant and positively related to GPA, when controlling for all
other variables, such that for every 1-unit increase in a person’s GRE
Quantitative score, there is an associated 003998-unit
increase in their GPA. Beyond this, we see that the
Recommender’s Average Rating for a student is non-significant in the
model, and is therefore unrelated to the outcome, net of all other
variables. ar, grev,
mat, and greq), the model fit statistic, the
\(R^2\), is .6405. This indicates that
64.06 percent of the variation in a person’s GPA is explained by the
combination of their average ratings, their GRE Verbal Score, their GRE
Quantitative Score, and their score on the Miller Analogies Test.