Definition and Basic Concepts

Random Variables and Dependence

Random variables: measurable functions from sample space to real numbers. Dependence: non-independence implying statistical association. Correlation: measure of linear dependence between two variables.

Expectation and Joint Distribution

Expectation (E): weighted average over probability space. Joint distribution: determines joint behavior of two variables.

Correlation Concept

Correlation quantifies tendency of two variables to vary together. Positive correlation: variables increase or decrease simultaneously. Negative correlation: one variable increases as the other decreases. Zero correlation: no linear association.

"Correlation does not imply causation." -- Anonymous

Covariance: Foundation of Correlation

Definition of Covariance

Covariance (Cov) defined as the expectation of product of centered variables:

Cov(X,Y) = E[(X - E[X])(Y - E[Y])]

Interpretation of Covariance

Positive covariance: variables tend to increase together. Negative covariance: inverse relation. Zero covariance: no linear relationship.

Units and Scale Dependence

Covariance units: product of units of X and Y. Scale-dependent: magnitude affected by variable units, limiting direct interpretability.

Pearson Correlation Coefficient

Definition

Pearson correlation (ρ or r) normalizes covariance by product of standard deviations:

ρ = Cov(X,Y) / (σ_X σ_Y)

Range and Scale

Range: -1 ≤ ρ ≤ 1. Dimensionless measure. ρ = ±1 indicates perfect linear relationship. ρ = 0 indicates no linear correlation.

Calculation from Data

Sample correlation uses sample means and standard deviations. Formula:

r = Σ (x_i - x̄)(y_i - ȳ) / [√Σ (x_i - x̄)² * √Σ (y_i - ȳ)²]

Properties of Correlation

Symmetry

ρ(X,Y) = ρ(Y,X). Correlation is symmetric by definition.

Unitless Measure

Scale-invariant: unaffected by linear transformations of variables.

Linearity Sensitivity

Measures linear association only. Nonlinear dependence may yield ρ near zero.

Bounds and Extremes

Bounds: ±1. Achieved only by perfect linear functions.

Relation to Independence

Independence ⇒ zero correlation. Converse not always true.

Interpretation and Significance

Strength of Association

Magnitude of ρ indicates strength. Guidelines: 0–0.3 weak, 0.3–0.7 moderate, >0.7 strong.

Direction

Sign of ρ indicates positive or negative trend.

Statistical Significance

Hypothesis testing: null hypothesis ρ=0, significance evaluated via t-test or permutation.

Effect of Outliers

Outliers may inflate or deflate ρ significantly.

Contextual Considerations

Practical significance depends on domain context and sample size.

Calculation Methods

Analytical Calculation

Using known distributions: compute E[X], E[Y], E[XY], variances.

Sample Estimators

Sample means and variances replace population parameters for data sets.

Matrix Form

Correlation matrix: symmetric matrix of pairwise correlations for multivariate data.

Computational Algorithms

Efficient incremental algorithms for streaming data.

Software Implementations

Widely available in statistical packages (R, Python, MATLAB).

Types of Correlation Coefficients

Pearson Correlation

Measures linear relationship between continuous variables.

Spearman Rank Correlation

Nonparametric measure using rank-transformed data, sensitive to monotonic relationships.

Kendall’s Tau

Based on concordant and discordant pairs; robust to outliers.

Point-Biserial Correlation

Correlation between one continuous and one binary variable.

Polychoric and Tetrachoric Correlations

Estimate correlation for ordinal or dichotomous variables assuming latent normality.

Limitations and Pitfalls

Nonlinearity

Correlation may miss nonlinear dependence.

Confounding Variables

Spurious correlations caused by lurking variables.

Range Restriction

Limited data range can underestimate correlation.

Outliers and Influential Points

Outliers distort correlation estimates.

Causality Misinterpretation

High correlation does not imply causation.

Applications in Probability and Statistics

Exploratory Data Analysis

Identify associations and variable relationships.

Regression Analysis

Basis for linear regression modeling and diagnostics.

Portfolio Theory

Correlations determine diversification benefits in finance.

Signal Processing

Correlation functions detect similarity and patterns.

Machine Learning

Feature selection and redundancy assessment.

Worked Examples

Example 1: Simple Dataset

Given X = {1,2,3,4}, Y = {2,4,6,8}, compute Pearson correlation.

Solution: E[X] = 2.5, E[Y] = 5; Cov(X,Y) = E[XY] - E[X]E[Y]

E[XY] = (1*2 + 2*4 + 3*6 + 4*8)/4 = (2 + 8 + 18 + 32)/4 = 60/4 = 15Cov(X,Y) = 15 - (2.5)(5) = 15 - 12.5 = 2.5Var(X) = E[X²] - (E[X])² = ((1²+2²+3²+4²)/4) - (2.5)² = (1+4+9+16)/4 - 6.25 = 7.5 - 6.25 = 1.25Var(Y) = E[Y²] - (E[Y])² = ((2²+4²+6²+8²)/4) - 25 = (4+16+36+64)/4 - 25 = 30 - 25 = 5σ_X = √1.25 ≈ 1.118, σ_Y = √5 ≈ 2.236ρ = Cov(X,Y) / (σ_X σ_Y) = 2.5 / (1.118 * 2.236) ≈ 1.0 

Example 2: Interpretation of Zero Correlation

Variables X uniform on [-1,1], Y = X². Compute ρ(X,Y).

Solution: Although Y depends on X, linear correlation is zero due to symmetry.

Summary Tables

Correlation Coefficient Ranges and Interpretation

Correlation (|ρ|)StrengthDescription
0.0 – 0.3WeakMinimal linear association
0.3 – 0.7ModerateNoticeable linear relationship
0.7 – 1.0StrongHigh linear dependence

Common Correlation Coefficients and Use Cases

CoefficientData TypeSensitivityNotes
PearsonContinuousLinearAssumes normality
SpearmanOrdinal/ContinuousMonotonicNonparametric
Kendall’s TauOrdinal/ContinuousMonotonicRobust to outliers

Key Formulas

Covariance

Cov(X,Y) = E[(X - μ_X)(Y - μ_Y)] = E[XY] - μ_X μ_Y 

Pearson Correlation Coefficient

ρ = Cov(X,Y) / (σ_X σ_Y) = [E[XY] - μ_X μ_Y] / (σ_X σ_Y) 

Sample Correlation Coefficient

r = Σ (x_i - x̄)(y_i - ȳ) / [√Σ (x_i - x̄)² * √Σ (y_i - ȳ)²] 

Spearman Rank Correlation

ρ_s = 1 - (6 Σ d_i²) / [n(n² - 1)]where d_i = difference in ranks 

Kendall’s Tau

τ = (Number of concordant pairs - Number of discordant pairs) / [n(n-1)/2] 

References

  • Fisher, R.A., "Correlation Between Relatives on the Supposition of Mendelian Inheritance," Transactions of the Royal Society of Edinburgh, vol. 52, 1918, pp. 399-433.
  • Spearman, C., "The Proof and Measurement of Association Between Two Things," The American Journal of Psychology, vol. 15, 1904, pp. 72-101.
  • Kendall, M.G., "A New Measure of Rank Correlation," Biometrika, vol. 30, 1938, pp. 81-93.
  • Hotelling, H., "Simplified Calculation of the Correlation Coefficient," Journal of Educational Psychology, vol. 22, 1931, pp. 143-148.
  • Gnanadesikan, R., "Methods for Statistical Data Analysis of Multivariate Observations," Wiley, 1997, pp. 120-145.