Definition and Overview

Concept

Correlation coefficient quantifies the degree to which two variables co-vary linearly. Values range from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no linear relationship.

Purpose

Measures strength and direction of association. Used to identify predictive variables, confirm hypotheses, and summarize data relationships.

Historical Context

Developed by Francis Galton and Karl Pearson in the late 19th century. Foundation for modern regression and multivariate statistics.

"Correlation does not imply causation, but it is fundamental to understanding relationships in data." -- Karl Pearson

Types of Correlation Coefficients

Pearson Product-Moment Correlation

Measures linear relationship between two continuous variables. Assumes normality and homoscedasticity.

Spearman’s Rank Correlation

Non-parametric measure based on ranked data. Captures monotonic relationships, robust to outliers.

Kendall’s Tau

Non-parametric, measures ordinal association. Based on concordant and discordant pairs.

Point-Biserial and Phi Coefficients

Point-Biserial: correlation between continuous and binary variable. Phi: association between two binary variables.

Summary Table

CoefficientData TypeAssumptionsRange
Pearson rContinuousNormality, linearity, homoscedasticity-1 to +1
Spearman ρOrdinal/NonparametricMonotonicity-1 to +1
Kendall τOrdinal/NonparametricMonotonicity-1 to +1

Properties and Interpretation

Range and Significance

Range: -1 ≤ r ≤ 1. Positive values indicate direct association; negative values indicate inverse association. Magnitude indicates strength.

Strength Guidelines

Common heuristic: |r| < 0.3 = weak; 0.3 ≤ |r| < 0.5 = moderate; ≥ 0.5 = strong correlation.

Symmetry

Correlation coefficient is symmetric: r(X,Y) = r(Y,X).

Unitless Measure

Scale-free metric; unaffected by changes in units or linear transformations.

Effect of Outliers

Highly sensitive to outliers, especially Pearson’s r. Can distort interpretation.

Calculation Methods

Pearson Correlation Formula

r = Σ((xᵢ - x̄)(yᵢ - ȳ)) / √[Σ(xᵢ - x̄)² Σ(yᵢ - ȳ)²]

Stepwise Procedure

1. Compute means x̄, ȳ. 2. Calculate deviations (xᵢ - x̄), (yᵢ - ȳ). 3. Calculate covariance numerator. 4. Compute standard deviations denominator. 5. Divide covariance by product of std devs.

Spearman’s Rank Correlation

ρ = 1 - (6 Σ dᵢ²) / (n(n² - 1))

dᵢ = difference between ranks of paired observations.

Computational Considerations

Sample size n impacts variance and confidence intervals. Use unbiased estimators for population inference.

Assumptions and Conditions

Linearity

Relationship between variables must be approximately linear for Pearson’s r to be valid.

Normality

Variables should be approximately normally distributed for significance testing of Pearson’s r.

Homoscedasticity

Constant variance of residuals across values of X is required.

Independence

Observations must be independent across sample.

Monotonicity for Nonparametric

Spearman’s and Kendall’s coefficients assume monotonic, not necessarily linear, relationships.

Applications in Regression Analysis

Variable Selection

Used to identify predictor variables with strong linear associations to dependent variable.

Multicollinearity Detection

High inter-correlations among predictors indicate multicollinearity risk, complicating regression interpretation.

Goodness of Fit

Correlation coefficient squared (R²) indicates proportion of variance explained by regression model.

Model Diagnostics

Residual correlation analysis detects model misspecification or omitted variables.

Exploratory Data Analysis

Initial data examination to reveal potential relationships and guide modeling strategies.

Limitations and Misinterpretations

Non-Causality

Correlation does not prove causation; confounding variables may exist.

Nonlinearity

Correlation coefficient fails to detect nonlinear relationships even if strong.

Outliers

Extreme values can inflate or deflate correlation estimates misleadingly.

Restricted Range

Limited variability reduces correlation magnitude, underestimating true association.

Spurious Correlations

Correlation may arise by chance, especially in large datasets or multiple testing contexts.

Numerical Examples

Example 1: Pearson Correlation

Data points: X = [2, 4, 6, 8, 10], Y = [1, 3, 5, 7, 9]

Calculation:

Mean X = 6, Mean Y = 5Numerator Σ(xᵢ - 6)(yᵢ - 5) = (2-6)(1-5)+(4-6)(3-5)+(6-6)(5-5)+(8-6)(7-5)+(10-6)(9-5)= (-4)(-4)+(-2)(-2)+0+2*2+4*4 = 16 + 4 + 0 + 4 + 16 = 40Denominator = sqrt(Σ(xᵢ - 6)² * Σ(yᵢ - 5)²)= sqrt((16+4+0+4+16)*(16+4+0+4+16)) = sqrt(40*40) = 40r = 40 / 40 = 1 (perfect positive correlation) 

Example 2: Spearman Rank Correlation

Data ranks: X ranks = [1, 2, 3, 4, 5], Y ranks = [5, 6, 7, 8, 7]

Differences dᵢ and squared differences dᵢ²:

iX rankY rankdᵢ = X - Ydᵢ²
115-416
226-416
337-416
448-416
557-24

Σ dᵢ² = 16+16+16+16+4 = 68, n=5

ρ = 1 - (6*68) / (5*(25-1)) = 1 - 408 / 120 = 1 - 3.4 = -2.4 (invalid)

Note: Ranks or data must be consistent; ties handled by average ranks. Example illustrates importance of data quality.

Computing Correlation Coefficients in Software

R Language

Function: cor(x, y, method = "pearson"|"spearman"|"kendall"). Handles NA with use parameter.

Python (SciPy, Pandas)

SciPy: scipy.stats.pearsonr(x, y), spearmanr(x, y). Pandas: DataFrame.corr(method='pearson')

Excel

Functions: =CORREL(array1, array2) for Pearson. Add-ins required for Spearman.

SPSS

Analyze > Correlate > Bivariate; select Pearson, Spearman, or Kendall tau.

Interpretation Aids

Software provides p-values, confidence intervals, and scatterplots for comprehensive analysis.

Advanced Topics

Confidence Intervals and Hypothesis Testing

Fisher’s z-transformation stabilizes variance for interval estimation and significance testing.

Robust Correlation Measures

Techniques like biweight midcorrelation reduce outlier influence; used in bioinformatics.

Correlation in Multivariate Analysis

Correlation matrices underpin PCA, factor analysis, cluster analysis.

Nonlinear Correlation Measures

Distance correlation and mutual information detect nonlinear dependencies missed by Pearson’s r.

Bootstrapping Correlations

Nonparametric resampling to estimate sampling distribution and variability of correlation coefficients.

References

  • Pearson, K. "Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia." Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, vol. 187, 1896, pp. 253-318.
  • Spearman, C. "The proof and measurement of association between two things." The American Journal of Psychology, vol. 15, no. 1, 1904, pp. 72-101.
  • Kendall, M.G. "A New Measure of Rank Correlation." Biometrika, vol. 30, no. 1/2, 1938, pp. 81-93.
  • Fisher, R.A. "The probable error of a correlation coefficient." Biometrika, vol. 6, no. 1, 1908, pp. 302-310.
  • Wilcox, R.R. "Introduction to Robust Estimation and Hypothesis Testing." Academic Press, 3rd edition, 2012.