LAB 7:
Correlation and Linear Regression Analyses
1
The main purpose of this lab is to be able to use and correctly interpret the results of the following:
Scattergram (Scatter Plot)
Correlation analysis
Simple linear regression
Scatter plot
Is used to assess (visually) the linear relationship between two quantitative variables (linear or nonlinear relationship)
Is used to explore the direction of the relationship between the two variables (positive, negative or no relationship)
It can also be used to explore the data values (if there is any outliers)
Introduction
See Chapter 5 section 5.3 for more details
2
Correlation Analysis:
The correlation coefficient (r) computed from the sample data measures the strength and the direction of a linear relationship between two variables.
The null hypothesis:
Ho : There is no linear relationship between the 2 variables
The alternative hypothesis:
Ha : There is a linear relationship between the 2 variables
The range of correlation coefficient is 1 to +1. When there is no linear relationship between
variables or only a weak relationship, the value of correlation coefficient will be close to 0.
No correlation: as x increases, no definite shift in y.
Positive correlation: as x increases, y increases.
Negative correlation: as x increases, y decreases
3
0 to + 0.29: little or no association. +0.30 to + 0.49: weak positive association. + 0.5 to +0.69: medium positive association. +0.7to + 1.0: strong positive association.  0 to – 0.29: little or no association. – 0.30 to – 0.49: weak negative association. – 0.5 to – 0.69: medium negative association. – 0.7 to – 1.0: strong negative association. 
Things to remember
Correlation coefficient cutoff points
Simple Linear Regression:
Is used to predict a single dependent variable (response) based on a single independent (predictor) variable
The null hypothesis:
Ho : The slope is zero; there is no linear relationship between the 2 variables
The alternative hypothesis:
Ha : The slope is not zero; there is a linear relationship between the 2 variables
For simple linear regression, coefficient of determination (R2) is the square of the correlation coefficient.
The range of coefficient of determination (R2) is 0 to +1. When there is no linear relationship between variables or only a weak relationship, the value of R squared will be close to 0.
5
a = y intercept (constant)
b = slope (regression coefficient) of line
y = dependent (predicted )variable
x = independent (predictor) variable
y = a + bx
Y
X
Assumptions:
Normality
Equal variances
Independence
Linear relationship
Regression analysis establishes a regression equation for predictions
For a given value of x, we can predict a value of y
Correlation
Is used to measure the strength and the direction of a linear association between two quantitative variables
Simple Linear Regression
Is used to predict a single dependent variable (response) based on a single independent (predictor) variable.
Scattergram (Scatter Plot)
Is used to assess (visually) the relationship between two quantitative variables (Linear vs. non Linear).
Hypothetical Example
For
Simple Regression
Question : Is revision time variable a good predictor of exam performance?
Answer:
Since we have two quantitative (not repeated) variables, simple linear regression should be used to answer this research question.
Step 1: Scatter plot
Dependent variable (response)? Exam performance
Independent variable (predictor)? Revision time
Can you see any relationship between revision time and exam performance? Positive or Negative?
To Obtain Scatter Plot:
From the menus choose:
Graphs Legacy Dialogs Scatter/Dot Select In the Scatterplot dialog Select the icon for simple scatter Select Define Select Exam performance as the Yaxis variable and Revision time as the Xaxis variable Click Ok
9
Question 1: Is revision time variable a good predictor of exam performance? Continued:
Answer:
To Obtain an Individual 95% Confidence Interval on the Scatter Plot:
After scattergram is drawn, do the following:
Double click on graph (which puts you in the chart editor window) Click on any one of the data points (this will highlight all the data points).
From the menus choose:
Click on the chart menu Select elements Fit line at total Then check individual 95% confidence interval Click apply and close properties box Close chart editor
10
Can you see any relationship between Revision Time and Exam Performance? Yes, Positive relationship
Answer:
Step 2: Correlation and simple linear regression analysis
To Obtain Statistics for correlation and simple linear regression
From the menus choose:
Analyze Regression Linear Select exam performance as the dependent variable
Select revision time as the independent variable Click statistics and select estimates and 95% CI, also uncheck
Model fit and check Descriptives Click Ok
12
Model Summaryb  
Model  R  R Square  Adjusted R Square  Std. Error of the Estimate 
1  .898a  .806  .805  9.676 
a. Predictors: (Constant), Revision Time  
b. Dependent Variable: Exam Performance 
Coefficientsa  
Model  Unstandardized Coefficients  Standardized Coefficients  t  Sig.  
B  Std. Error  Beta  
1  (Constant)  – 207.810  8.365  24.843  .000  
Revision Tme  5.356  .128  .898  41.952  .000  
a. Dependent Variable: Exam Performance 
The Regression Equation:
Correlation Coefficient
Coefficient of determination (R2)
Answers the question of whether there is a significant linear relationship or not
Slope
Intercept (constant)
Revision
Time
Exam
Performance
Rsquare=0.806
% 80.6 of variation in exam perf. (dependent) variable explained by revision time (independent) variable
Results:
Correlation Analysis
There is a significant strong positive linear relationship between revision time and exam performance (Evidence: Correlation Coefficient (r) = 0.898)
Regression Analysis
80.6 % of the variation in exam performance is explained by revision time
Regression equation: exam performance= 207.81 + 5.36 * revision time
Conclusion: Revision time is a useful predictor of exam performance
0
20
40
60
80
100
120
140
160
180
200
050100150200250
Exam Performance=*(Revision Time)
Exam Performance207.815.356*(Revision Ti
me)
YabX
ab
=+
+
=+
Revision time5.360.13<0.001
R
2
= 0.806
Variableβ*
Std. ErrorPvalue
Intercept = – 207.81
Dependent Variable: Exam Performance
Model Summaryc  
Model  R  R Square  Adjusted R Square  Std. Error of the Estimate  Change Statistics  
R Square Change  F Change  df1  df2  Sig. F Change  
Basic  1  .339a  .115  .104  9.576  .115  10.316  3  238  .000  
Smoking  2  .365b  .133  .119  9.496  .018  5.018  1  237  .026  14.3153077371  
a. Predictors: (Constant), GENDER, BMI, age  100  
b. Predictors: (Constant), GENDER, BMI, age, EVER SMOKE CIGARETTES  100  
c. Dependent Variable: BASELINE DIASTOLIC BLOOD PRESSURE  100  
100  
100  
Model Summaryc  100  
Model  R  R Square  Adjusted R Square  Std. Error of the Estimate  Change Statistics  ERROR:#VALUE!  
R Square Change  F Change  df1  df2  Sig. F Change  100  
1  .339a  .115  .104  9.595  .115  10.243  3  237  .000  0.3339800938  
Stress  2  .345b  .119  .100  9.613  .004  .559  2  235  .573  3.5583878252  
a. Predictors: (Constant), GENDER, BMI, age  100  
b. Predictors: (Constant), GENDER, BMI, age, StresAverage, StressLow  100  
c. Dependent Variable: BASELINE DIASTOLIC BLOOD PRESSURE  100  
100  
100  
Model Summaryc  100  
Model  R  R Square  Adjusted R Square  Std. Error of the Estimate  Change Statistics  ERROR:#VALUE!  
R Square Change  F Change  df1  df2  Sig. F Change  100  
1  .339a  .115  .104  9.576  .115  10.316  3  238  .000  0  
Exercise  2  .381b  .145  .123  9.471  .030  2.769  3  235  .042  18.8063330466  
a. Predictors: (Constant), GENDER, BMI, age  100  
b. Predictors: (Constant), GENDER, BMI, age, Exer4, Exer3, Exer2  100  
c. Dependent Variable: BASELINE DIASTOLIC BLOOD PRESSURE  100  
100  
100  
Model Summaryc  100  
Model  R  R Square  Adjusted R Square  Std. Error of the Estimate  Change Statistics  ERROR:#VALUE!  
R Square Change  F Change  df1  df2  Sig. F Change  100  
1  .339a  .115  .104  9.576  .115  10.316  3  238  .000  0  
Cholestrol  2  .354b  .125  .110  9.541  .010  2.745  1  237  .099  6.278301257  
a. Predictors: (Constant), GENDER, BMI, age  
b. Predictors: (Constant), GENDER, BMI, age, BASELINE CHOLESTEROL  
c. Dependent Variable: BASELINE DIASTOLIC BLOOD PRESSURE  
.092  23.9130434783  
.114 
Summary Table – General characteristics of study participants  
Characteristics  Study Participants  
N*  242  
Diastolic blood pressure, mm Hg a  84.2±10.1  
Body mass indexa  30.2±6.3  
Age, year a  55.2±11.2  
Cholesterol  232.1±43.3  
Gender [N, %]  
Male  83 (34.3)  
Female  159 (65.7)  
Smoking [N, %]  
Never  157 (64.9)  
Ever  85 (35.1)  
SelfIdentified Race [N, %]  
None  99 (40.9)  
Mild  75 (31.0)  
Moderate  58 (24.0)  
Vigorous  10 (04.1)  
Stress [N, %]  
High  128 (53.1)  
Average  85 (35.3)  
Low  28 (11.6)  
a Value are means± SD  
* Based on the total number of subjects in the final model. 
Variable  β*  Std. Error  Pvalue  
Revision time  5.36  0.13  <0.001  
Dependent Variable: Exam Performance  
Intercept = – 207.81  
R2 = 0.806 
Name __________________
Objectives: Be able to correctly use and interpret a: 1. Scattergram (scatterplot)
2. Correlation Analysis
3. Regression analysis
A SCATTERGRAM is a good way to get a feel for any relationship which may exist between two different quantitative variables. OUTLIERS can also often be spotted with a Scattergram.
1. In the dataset Assignment_7_SP15.sav (a part of CORN1 dataset), examine a Scattergram between weight at baseline and height.
· From the menus, choose:
Graphs
Legacy Dialogs
Scatter/Dot
· In the Scatterplot dialog, select the icon for simple scatter.
· Select Define.
· Select weight at baseline as the Yaxis variable and height as the Xaxis variable
· Click ok
After scattergram is drawn, do the following:
· Double click on graph (which puts you in the chart editor window)
· Click on any one of the data points (this will highlight all the data points)
· From the menus, choose:
Click on the chart menu
Select elements
Fit line at total
· Then check individual 95% confidence interval.
· Click apply and close properties box
· Close chart editor
Answer the following questions:
a) Dependent variable? _________________
b) Independent variable? ________________
c) Can you see any relationship between height and weight? Positive or Negative?
Note: Correlation and regression analysis are run at the same time
2. CORRELATION ANALYSIS can give us a more complete picture of the LINEAR relationship between these two variables.
a) H0: ________________________________________________________________________________________________________________________________________________________________________________________________________________________
· From the menus choose:
Analyze
Regression
Linear…
· Select weight as the dependent variable and height as the independent variable.
b) What is the CORRELATION COEFFICIENT (r) = _______? This is a measure of the strength of the relationship between the dependent variable and the independent variables.
c) Is there a strong relationship between weight and height? _____ Evidence____ ?
3. REGRESSION ANALYSIS is used to predict the dependent variable based on the independent variable.
a) REGRESSION COEFFICIENT (unstandardized B)= ________? This is the slope of the regression line.
b) CONSTANT = __________? This is the point at which the regression line crosses the Y axis (also called “intercept”).
c) Write the equation of the REGRESSION LINE which best relates weight and height.
Recall y = a+ bx
d) Use the above equation to predict your own weight.
e) What percent of the variation in weight is explained by height? Explain possible reasons for its accuracy (or inaccuracy) at predicting your weight.
Variable  β*  Std. Error  Pvalue 



Height  
Dependent variable: Weight at Baseline  
Intercept =  
R2 =

Conclusion: