Can we predict/explain variation in a person’s graduate school GPA from a variety of pre-graduate school factors?


What is the Regression?

The OLS regression examines the predictive relationship between some independent variable(s), and an interval-ratio dependent variable. The test tells us about the effect (slope) of any independent (X) variable on an interval-ratio dependent (Y) variable. In particular, the regression equation looks at how values of an x variable “predict” a specific Y value.

For this example, the OLS regression works well because we’re looking at how variation in a person’s graduate school GPA (gpa, an interval-ratio variable ranging from 2.5 to 4.3) can be predicted/explained by variation in four other pre-graduate school variables on which students were assessed:

summary(lm(gpa ~ ar + grev + mat + greq, data=data))


Load the Necessary Stuff

library(MASS)
library(psych)
library(vannstats)


Reading in the Data

In total, we have 60 individuals. Below, show an image of their data in a spreadsheet and list their respective scores. The data are as follows:

GPA: 3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3, 3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3, 3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3, 3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3

Average Recommender Rating: 2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0, 2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0, 2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0, 2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0.

GRE Verbal Score: 540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610, 540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610, 540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610, 540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610.

GRE Quantitative Score: 625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600, 625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600, 625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600, 625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600.

Miller Analogies Test Score: 65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85, 65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85, 65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85, 65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85.


As in the Intro to R vignette, we can create an object out of a list of numbers using the concatenate c function.

Knowing that we have five variables, we have to read in the variables separately (listing the values for each observation). To do so, we can use the following code:

gpa <- c(3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3, 3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3, 3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3, 3.2, 4.1, 3.0, 2.6, 3.7, 4.0, 4.3, 2.7, 3.6, 4.1, 2.7, 2.9, 2.5, 3.0, 3.3)

ar <- c(2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0, 2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0, 2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0, 2.7, 4.5, 2.5, 3.1, 3.6, 4.3, 4.6, 3.0, 4.7, 3.4, 3.7, 2.6, 3.1, 2.7, 5.0)

grev <- c(540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610, 540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610, 540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610, 540, 680, 480, 520, 490, 535, 720, 500, 575, 690, 545, 515, 520, 710, 610)

greq <- c(625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600, 625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600, 625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600, 625, 575, 520, 545, 520, 655, 630, 500, 605, 555, 505, 540, 520, 585, 600)

mat <- c(65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85, 65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85, 65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85, 65, 75, 65, 55, 75, 65, 75, 75, 65, 75, 55, 55, 55, 65, 85)

Where the first number in each list corresponds with the number in the first observation . For example, the first observation in the list for gpa is 3.2, which corresponds with the first observation in the mat list, of 65.

Next, to appropriately prepare the data for analysis, we have to merge these five lists. To merge, as in the Intro to R vignette, we can use the data.frame function.

data <- data.frame(gpa, ar, grev, greq, mat)

Now we can call the data…

data
##    gpa  ar grev greq mat
## 1  3.2 2.7  540  625  65
## 2  4.1 4.5  680  575  75
## 3  3.0 2.5  480  520  65
## 4  2.6 3.1  520  545  55
## 5  3.7 3.6  490  520  75
## 6  4.0 4.3  535  655  65
## 7  4.3 4.6  720  630  75
## 8  2.7 3.0  500  500  75
## 9  3.6 4.7  575  605  65
## 10 4.1 3.4  690  555  75
## 11 2.7 3.7  545  505  55
## 12 2.9 2.6  515  540  55
## 13 2.5 3.1  520  520  55
## 14 3.0 2.7  710  585  65
## 15 3.3 5.0  610  600  85
## 16 3.2 2.7  540  625  65
## 17 4.1 4.5  680  575  75
## 18 3.0 2.5  480  520  65
## 19 2.6 3.1  520  545  55
## 20 3.7 3.6  490  520  75
## 21 4.0 4.3  535  655  65
## 22 4.3 4.6  720  630  75
## 23 2.7 3.0  500  500  75
## 24 3.6 4.7  575  605  65
## 25 4.1 3.4  690  555  75
## 26 2.7 3.7  545  505  55
## 27 2.9 2.6  515  540  55
## 28 2.5 3.1  520  520  55
## 29 3.0 2.7  710  585  65
## 30 3.3 5.0  610  600  85
## 31 3.2 2.7  540  625  65
## 32 4.1 4.5  680  575  75
## 33 3.0 2.5  480  520  65
## 34 2.6 3.1  520  545  55
## 35 3.7 3.6  490  520  75
## 36 4.0 4.3  535  655  65
## 37 4.3 4.6  720  630  75
## 38 2.7 3.0  500  500  75
## 39 3.6 4.7  575  605  65
## 40 4.1 3.4  690  555  75
## 41 2.7 3.7  545  505  55
## 42 2.9 2.6  515  540  55
## 43 2.5 3.1  520  520  55
## 44 3.0 2.7  710  585  65
## 45 3.3 5.0  610  600  85
## 46 3.2 2.7  540  625  65
## 47 4.1 4.5  680  575  75
## 48 3.0 2.5  480  520  65
## 49 2.6 3.1  520  545  55
## 50 3.7 3.6  490  520  75
## 51 4.0 4.3  535  655  65
## 52 4.3 4.6  720  630  75
## 53 2.7 3.0  500  500  75
## 54 3.6 4.7  575  605  65
## 55 4.1 3.4  690  555  75
## 56 2.7 3.7  545  505  55
## 57 2.9 2.6  515  540  55
## 58 2.5 3.1  520  520  55
## 59 3.0 2.7  710  585  65
## 60 3.3 5.0  610  600  85


Assumptions and Diagnostics for Regression

The assumptions for the regression are…

  • Adequate Sample Size
  • Absence of Outliers
  • Absence of Multicollinearity and Singluarity
  • Linearity, Normality, and Homoscedasticity (Homogeneity of Variance)

In addition, the previously-discussed assumptions for other tests (independence of observations) is implied, since all of these bivariate tests require random samples. Beyond this, the OLS regression requires an interval-ratio outcome variable.

1. Adequate Sample Size
  • According to Green (1991), as cited in Tabachnick and Fidel (2006), adequate sample size is determined by the modified equation \(N \geq 50 + 8(k)\)

Where \(k\) is the number of independent variables included in the regression model.

  • Given that we have four IVs/predictor variables, the minimum number of cases to be adequate is 82 (\(82 = 50 + 8(4)\)). Therefore, with only 60 observations in the data set, we do not have enough cases to adequately run the regression model. That is, we have violated (not met) the assumption of adequate sample size. In almost all cases, I would advise not proceeding with the regression model, however, given that this is an example, I will proceed.


2. Absence of Outliers

To identify outliers, simply look at the boxplots for each variable in the model (Y and all Xs) to see “how outlying, these outliers are.” In most cases, outliers should remain in the data. Need strong justification for removing outlying cases.

box(data, gpa)

box(data, ar)

box(data, grev)

box(data, greq)

box(data, mat)

  • We can see from the boxplots that the distributions of the variables are relatively normal, with no outliers. Therefore, we have met the assumption of absence of outliers. Interestingly, the boxplot for the GRE Verbal variable has the median is closer to the 25th percentile. Equally as interesting, the boxplot for the Miller Analogies Test variable is missing a lower whisker, indicating that the lowest extreme case is similar to/the same as the 25th percentile case.


3. Multicollinearity and Singularity
  • Multicollinearity: Independent variables (more) highly correlated with one another (compared to their correlation with the DV).

    • Check the correlation matrix for variables.
cormat(data, gpa ~ ar + grev + mat + greq)
##       gpa   ar grev  mat greq
## gpa     1                    
## ar   0.62    1               
## grev 0.58 0.41    1          
## mat   0.6 0.52 0.43    1     
## greq 0.61 0.51 0.47 0.27    1
  • We can see from the correlation matrix that none of the bivariate relationships between the independent variables (ar, grev, mat, greq) are above a correlation coefficient of \(r \approx .90\). Therefore, we have met the assumption of absence of multicollinearity.


  • Singularity: If independent variables included are (together) all possible subsets of measure also included in model.

    • Look at the items and determine if they are subsets of other items also included.


  • Based on the data, none of the independent variables are subsets of one another. While the GRE Verbal and the GRE Quantitative are subsets of the larger GRE, they would only count as singularity if we also included an overall or summative GRE score that was made up of the scores for the GRE Verbal and the GRE Quantitative. As such, we have met the assumption of absence of singularity.


4. Linearity, Normality, and Homoskedasticity
  • Linearity: Variables move together in a linear fashion.
  • Normality: Variables are normally-distributed.
  • Homoskedasticity: Homogeneity of Variance - Variance of variables are similar (10:1, 3:1 for SDs).
    • Visual inspection of Residuals Plot to see if relationship is linear, normal, and similar variances. Plot should have points that extend beyond both sides of the 0 line (normality), should not have a U or inverted-U shape in the points (linearity), and it should not have a funnel shape, where points are tightly clustered near the 0 line at one end of the plot, and completely dispersed along y-axis at other end of plot (homoskedasticity).
vannstats::residplot(data, gpa ~ ar + grev + mat + greq)

  • Based on the residuals plot (the difference between the actual \(Y\) and the \(\hat{Y}\)), we see that we have met the assumptions of linearity, normality, and homoskedasticity. Linearity is met given that the residuals do not exhibit a non-linear (e.g. curvilinear) relationship about the 0 distance (from \(\hat{Y}\)) line. Normality is met given that the residuals do not have a hard stop on either side of the line – that is, they are evenly distributed about the 0 distance (from \(\hat{Y}\)) line. Finally, homoskedasticity is met given that the residuals are evenly distanced from the 0 distance (from \(\hat{Y}\)) line at all values of \(\hat{Y}\) – as exemplified the lack of “fanning out” on one end.


The Regression Calculation

The calculation for the Regression is:

\(\hat{Y} = b_0 + b_1X_1 + b_2X_2\)

Where…


Running the Regression

For Regression, within the lm function, which stands for linear model, the dependent variable is listed first and the independent variable is listed second.

lm(gpa ~ ar + grev + mat + greq, data=data)

This may seem confusing, so it’s best to wrap our lm function in a summary call…

summary(lm(gpa ~ ar + grev + mat + greq, data=data))
## 
## Call:
## lm(formula = gpa ~ ar + grev + mat + greq, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7876 -0.2297  0.0069  0.2868  0.5260 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1.738107   0.640989  -2.712  0.00892 **
## ar           0.144233   0.076185   1.893  0.06360 . 
## grev         0.001524   0.000708   2.152  0.03580 * 
## mat          0.020896   0.006438   3.246  0.00200 **
## greq         0.003998   0.001234   3.240  0.00203 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3693 on 55 degrees of freedom
## Multiple R-squared:  0.6405, Adjusted R-squared:  0.6143 
## F-statistic: 24.49 on 4 and 55 DF,  p-value: 1.128e-11

To interpret the findings, we report the following information: