Definition and Interpretation

What is R Squared?

Coefficient of determination. Measures proportion of variance in dependent variable explained by independent variables. Indicator of model fit quality. Values range 0 to 1, sometimes expressed as percentage.

Historical Context

Introduced in 1897 by Karl Pearson. Extended by Francis Galton in regression context. Fundamental in linear regression diagnostics.

Conceptual Meaning

Explained variance ratio. Quantifies how much variability in outcome is captured by predictors. Higher values indicate better explanatory power.

Calculation and Formula

Basic Formula

Ratio of explained sum of squares to total sum of squares. Denotes fraction of variance accounted for by model.

R² = 1 - (SS_res / SS_tot)where:SS_res = Σ(yᵢ - ŷᵢ)² (Residual Sum of Squares)SS_tot = Σ(yᵢ - ȳ)² (Total Sum of Squares)yᵢ = observed valuesŷᵢ = predicted valuesȳ = mean of observed values

Alternate Expression

Also expressed as square of Pearson correlation coefficient in simple linear regression.

R² = (correlation(y, ŷ))²

Components Explained

SS_tot: total variability in data. SS_res: variability unexplained by model. SS_reg = SS_tot - SS_res: variability explained by regression.

TermDefinition
SS_totTotal Sum of Squares, total variance in observed data
SS_resResidual Sum of Squares, variance unexplained by model
SS_regRegression Sum of Squares, variance explained by model

Interpretation of Values

Value Range

Ranges between 0 and 1. 0 means model explains none of variance. 1 means perfect fit, all variance explained.

Common Thresholds

Values above 0.7 indicate good fit in many fields. Values below 0.3 suggest weak model explanatory power. Context-dependent.

Percentage Explanation

R² × 100 gives percentage of variance explained by model. Example: R²=0.85 means 85% variance explained.

Types of R Squared

Simple R Squared

Derived from simple linear regression with one predictor. Equals square of correlation coefficient between observed and predicted.

Multiple R Squared

Used in multiple regression with several predictors. Measures collective explanatory power.

Pseudo R Squared

Used in models without least squares estimation, e.g. logistic regression. Various definitions exist (McFadden, Cox & Snell).

Limitations and Misinterpretations

Overfitting Susceptibility

Increases with additional predictors regardless of relevance. Can be misleading without adjustment.

Does Not Imply Causation

High R² does not confirm causal relationship. Only measures association strength.

Not Absolute Measure of Model Quality

Must be complemented by residual analysis, significance tests, and domain knowledge.

Adjusted R Squared

Purpose

Corrects R² for number of predictors and sample size. Penalizes model complexity.

Formula

Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - p - 1)]where:n = number of observationsp = number of predictors

Interpretation

May decrease if new predictors do not improve model. Preferred over raw R² for model comparison.

R Squared vs Correlation Coefficient

Relationship

In simple linear regression, R² equals square of Pearson correlation coefficient (r).

Differences in Multiple Regression

Correlation coefficient not defined between observed and predicted vectors directly. R² generalizes explained variance.

Interpretation Nuances

Correlation measures linear association; R² measures explained variance fraction. Both important, different roles.

Use Cases in Regression Analysis

Model Evaluation

Primary metric for assessing goodness of fit. Guides model refinement and variable selection.

Predictive Accuracy

Indicates expected predictive performance on similar data. Useful in forecasting and risk modeling.

Comparing Models

Used to compare nested and non-nested models. Adjusted R² preferred for penalized comparison.

Computation Examples

Simple Linear Regression Example

Dataset: observed y = [3, 5, 7, 9], predicted ŷ = [2.8, 5.1, 6.9, 9.2]

Calculate:SS_tot = Σ(yᵢ - ȳ)² = (3-6)^2 + (5-6)^2 + (7-6)^2 + (9-6)^2 = 9 + 1 + 1 + 9 = 20SS_res = Σ(yᵢ - ŷᵢ)² = (3-2.8)^2 + (5-5.1)^2 + (7-6.9)^2 + (9-9.2)^2 = 0.04 + 0.01 + 0.01 + 0.04 = 0.10R² = 1 - (0.10 / 20) = 1 - 0.005 = 0.995

Multiple Regression Example

Model with predictors X1, X2 predicting y. SS_tot = 100, SS_res = 25.

R² = 1 - (25 / 100) = 0.75 (75% variance explained)
ExampleR² ValueInterpretation
Simple Regression0.995Excellent fit, nearly all variance explained
Multiple Regression0.75Good fit, substantial variance explained

Software Implementations

R

Function summary(lm()) outputs R² and adjusted R² automatically. Package caret and others compute R² for different models.

Python

Scikit-learn's LinearRegression().score() method returns R². Statsmodels provides detailed regression summaries including R².

SPSS and SAS

Regression procedures output R², adjusted R², and related diagnostics by default.

References

  • Freedman, D. A. "Statistical Models: Theory and Practice." Cambridge University Press, 2009, pp. 120-145.
  • Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. "Applied Linear Statistical Models." 5th ed., McGraw-Hill, 2005, pp. 350-380.
  • Gelman, A., & Hill, J. "Data Analysis Using Regression and Multilevel/Hierarchical Models." Cambridge University Press, 2007, pp. 45-60.
  • Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. "Applied Logistic Regression." 3rd ed., Wiley, 2013, pp. 159-180.
  • Montgomery, D. C., Peck, E. A., & Vining, G. G. "Introduction to Linear Regression Analysis." 5th ed., Wiley, 2012, pp. 50-75.