Definition and Interpretation
What is R Squared?
Coefficient of determination. Measures proportion of variance in dependent variable explained by independent variables. Indicator of model fit quality. Values range 0 to 1, sometimes expressed as percentage.
Historical Context
Introduced in 1897 by Karl Pearson. Extended by Francis Galton in regression context. Fundamental in linear regression diagnostics.
Conceptual Meaning
Explained variance ratio. Quantifies how much variability in outcome is captured by predictors. Higher values indicate better explanatory power.
Calculation and Formula
Basic Formula
Ratio of explained sum of squares to total sum of squares. Denotes fraction of variance accounted for by model.
R² = 1 - (SS_res / SS_tot)where:SS_res = Σ(yᵢ - ŷᵢ)² (Residual Sum of Squares)SS_tot = Σ(yᵢ - ȳ)² (Total Sum of Squares)yᵢ = observed valuesŷᵢ = predicted valuesȳ = mean of observed valuesAlternate Expression
Also expressed as square of Pearson correlation coefficient in simple linear regression.
R² = (correlation(y, ŷ))²Components Explained
SS_tot: total variability in data. SS_res: variability unexplained by model. SS_reg = SS_tot - SS_res: variability explained by regression.
| Term | Definition |
|---|---|
| SS_tot | Total Sum of Squares, total variance in observed data |
| SS_res | Residual Sum of Squares, variance unexplained by model |
| SS_reg | Regression Sum of Squares, variance explained by model |
Interpretation of Values
Value Range
Ranges between 0 and 1. 0 means model explains none of variance. 1 means perfect fit, all variance explained.
Common Thresholds
Values above 0.7 indicate good fit in many fields. Values below 0.3 suggest weak model explanatory power. Context-dependent.
Percentage Explanation
R² × 100 gives percentage of variance explained by model. Example: R²=0.85 means 85% variance explained.
Types of R Squared
Simple R Squared
Derived from simple linear regression with one predictor. Equals square of correlation coefficient between observed and predicted.
Multiple R Squared
Used in multiple regression with several predictors. Measures collective explanatory power.
Pseudo R Squared
Used in models without least squares estimation, e.g. logistic regression. Various definitions exist (McFadden, Cox & Snell).
Limitations and Misinterpretations
Overfitting Susceptibility
Increases with additional predictors regardless of relevance. Can be misleading without adjustment.
Does Not Imply Causation
High R² does not confirm causal relationship. Only measures association strength.
Not Absolute Measure of Model Quality
Must be complemented by residual analysis, significance tests, and domain knowledge.
Adjusted R Squared
Purpose
Corrects R² for number of predictors and sample size. Penalizes model complexity.
Formula
Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - p - 1)]where:n = number of observationsp = number of predictorsInterpretation
May decrease if new predictors do not improve model. Preferred over raw R² for model comparison.
R Squared vs Correlation Coefficient
Relationship
In simple linear regression, R² equals square of Pearson correlation coefficient (r).
Differences in Multiple Regression
Correlation coefficient not defined between observed and predicted vectors directly. R² generalizes explained variance.
Interpretation Nuances
Correlation measures linear association; R² measures explained variance fraction. Both important, different roles.
Use Cases in Regression Analysis
Model Evaluation
Primary metric for assessing goodness of fit. Guides model refinement and variable selection.
Predictive Accuracy
Indicates expected predictive performance on similar data. Useful in forecasting and risk modeling.
Comparing Models
Used to compare nested and non-nested models. Adjusted R² preferred for penalized comparison.
Computation Examples
Simple Linear Regression Example
Dataset: observed y = [3, 5, 7, 9], predicted ŷ = [2.8, 5.1, 6.9, 9.2]
Calculate:SS_tot = Σ(yᵢ - ȳ)² = (3-6)^2 + (5-6)^2 + (7-6)^2 + (9-6)^2 = 9 + 1 + 1 + 9 = 20SS_res = Σ(yᵢ - ŷᵢ)² = (3-2.8)^2 + (5-5.1)^2 + (7-6.9)^2 + (9-9.2)^2 = 0.04 + 0.01 + 0.01 + 0.04 = 0.10R² = 1 - (0.10 / 20) = 1 - 0.005 = 0.995Multiple Regression Example
Model with predictors X1, X2 predicting y. SS_tot = 100, SS_res = 25.
R² = 1 - (25 / 100) = 0.75 (75% variance explained)| Example | R² Value | Interpretation |
|---|---|---|
| Simple Regression | 0.995 | Excellent fit, nearly all variance explained |
| Multiple Regression | 0.75 | Good fit, substantial variance explained |
Software Implementations
R
Function summary(lm()) outputs R² and adjusted R² automatically. Package caret and others compute R² for different models.
Python
Scikit-learn's LinearRegression().score() method returns R². Statsmodels provides detailed regression summaries including R².
SPSS and SAS
Regression procedures output R², adjusted R², and related diagnostics by default.
References
- Freedman, D. A. "Statistical Models: Theory and Practice." Cambridge University Press, 2009, pp. 120-145.
- Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. "Applied Linear Statistical Models." 5th ed., McGraw-Hill, 2005, pp. 350-380.
- Gelman, A., & Hill, J. "Data Analysis Using Regression and Multilevel/Hierarchical Models." Cambridge University Press, 2007, pp. 45-60.
- Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. "Applied Logistic Regression." 3rd ed., Wiley, 2013, pp. 159-180.
- Montgomery, D. C., Peck, E. A., & Vining, G. G. "Introduction to Linear Regression Analysis." 5th ed., Wiley, 2012, pp. 50-75.