Definition and Basic Concepts
Random Variables and Dependence
Random variables: measurable functions from sample space to real numbers. Dependence: non-independence implying statistical association. Correlation: measure of linear dependence between two variables.
Expectation and Joint Distribution
Expectation (E): weighted average over probability space. Joint distribution: determines joint behavior of two variables.
Correlation Concept
Correlation quantifies tendency of two variables to vary together. Positive correlation: variables increase or decrease simultaneously. Negative correlation: one variable increases as the other decreases. Zero correlation: no linear association.
"Correlation does not imply causation." -- Anonymous
Covariance: Foundation of Correlation
Definition of Covariance
Covariance (Cov) defined as the expectation of product of centered variables:
Cov(X,Y) = E[(X - E[X])(Y - E[Y])]Interpretation of Covariance
Positive covariance: variables tend to increase together. Negative covariance: inverse relation. Zero covariance: no linear relationship.
Units and Scale Dependence
Covariance units: product of units of X and Y. Scale-dependent: magnitude affected by variable units, limiting direct interpretability.
Pearson Correlation Coefficient
Definition
Pearson correlation (ρ or r) normalizes covariance by product of standard deviations:
ρ = Cov(X,Y) / (σ_X σ_Y)Range and Scale
Range: -1 ≤ ρ ≤ 1. Dimensionless measure. ρ = ±1 indicates perfect linear relationship. ρ = 0 indicates no linear correlation.
Calculation from Data
Sample correlation uses sample means and standard deviations. Formula:
r = Σ (x_i - x̄)(y_i - ȳ) / [√Σ (x_i - x̄)² * √Σ (y_i - ȳ)²]Properties of Correlation
Symmetry
ρ(X,Y) = ρ(Y,X). Correlation is symmetric by definition.
Unitless Measure
Scale-invariant: unaffected by linear transformations of variables.
Linearity Sensitivity
Measures linear association only. Nonlinear dependence may yield ρ near zero.
Bounds and Extremes
Bounds: ±1. Achieved only by perfect linear functions.
Relation to Independence
Independence ⇒ zero correlation. Converse not always true.
Interpretation and Significance
Strength of Association
Magnitude of ρ indicates strength. Guidelines: 0–0.3 weak, 0.3–0.7 moderate, >0.7 strong.
Direction
Sign of ρ indicates positive or negative trend.
Statistical Significance
Hypothesis testing: null hypothesis ρ=0, significance evaluated via t-test or permutation.
Effect of Outliers
Outliers may inflate or deflate ρ significantly.
Contextual Considerations
Practical significance depends on domain context and sample size.
Calculation Methods
Analytical Calculation
Using known distributions: compute E[X], E[Y], E[XY], variances.
Sample Estimators
Sample means and variances replace population parameters for data sets.
Matrix Form
Correlation matrix: symmetric matrix of pairwise correlations for multivariate data.
Computational Algorithms
Efficient incremental algorithms for streaming data.
Software Implementations
Widely available in statistical packages (R, Python, MATLAB).
Types of Correlation Coefficients
Pearson Correlation
Measures linear relationship between continuous variables.
Spearman Rank Correlation
Nonparametric measure using rank-transformed data, sensitive to monotonic relationships.
Kendall’s Tau
Based on concordant and discordant pairs; robust to outliers.
Point-Biserial Correlation
Correlation between one continuous and one binary variable.
Polychoric and Tetrachoric Correlations
Estimate correlation for ordinal or dichotomous variables assuming latent normality.
Limitations and Pitfalls
Nonlinearity
Correlation may miss nonlinear dependence.
Confounding Variables
Spurious correlations caused by lurking variables.
Range Restriction
Limited data range can underestimate correlation.
Outliers and Influential Points
Outliers distort correlation estimates.
Causality Misinterpretation
High correlation does not imply causation.
Applications in Probability and Statistics
Exploratory Data Analysis
Identify associations and variable relationships.
Regression Analysis
Basis for linear regression modeling and diagnostics.
Portfolio Theory
Correlations determine diversification benefits in finance.
Signal Processing
Correlation functions detect similarity and patterns.
Machine Learning
Feature selection and redundancy assessment.
Worked Examples
Example 1: Simple Dataset
Given X = {1,2,3,4}, Y = {2,4,6,8}, compute Pearson correlation.
Solution: E[X] = 2.5, E[Y] = 5; Cov(X,Y) = E[XY] - E[X]E[Y]
E[XY] = (1*2 + 2*4 + 3*6 + 4*8)/4 = (2 + 8 + 18 + 32)/4 = 60/4 = 15Cov(X,Y) = 15 - (2.5)(5) = 15 - 12.5 = 2.5Var(X) = E[X²] - (E[X])² = ((1²+2²+3²+4²)/4) - (2.5)² = (1+4+9+16)/4 - 6.25 = 7.5 - 6.25 = 1.25Var(Y) = E[Y²] - (E[Y])² = ((2²+4²+6²+8²)/4) - 25 = (4+16+36+64)/4 - 25 = 30 - 25 = 5σ_X = √1.25 ≈ 1.118, σ_Y = √5 ≈ 2.236ρ = Cov(X,Y) / (σ_X σ_Y) = 2.5 / (1.118 * 2.236) ≈ 1.0 Example 2: Interpretation of Zero Correlation
Variables X uniform on [-1,1], Y = X². Compute ρ(X,Y).
Solution: Although Y depends on X, linear correlation is zero due to symmetry.
Summary Tables
Correlation Coefficient Ranges and Interpretation
| Correlation (|ρ|) | Strength | Description |
|---|---|---|
| 0.0 – 0.3 | Weak | Minimal linear association |
| 0.3 – 0.7 | Moderate | Noticeable linear relationship |
| 0.7 – 1.0 | Strong | High linear dependence |
Common Correlation Coefficients and Use Cases
| Coefficient | Data Type | Sensitivity | Notes |
|---|---|---|---|
| Pearson | Continuous | Linear | Assumes normality |
| Spearman | Ordinal/Continuous | Monotonic | Nonparametric |
| Kendall’s Tau | Ordinal/Continuous | Monotonic | Robust to outliers |
Key Formulas
Covariance
Cov(X,Y) = E[(X - μ_X)(Y - μ_Y)] = E[XY] - μ_X μ_Y Pearson Correlation Coefficient
ρ = Cov(X,Y) / (σ_X σ_Y) = [E[XY] - μ_X μ_Y] / (σ_X σ_Y) Sample Correlation Coefficient
r = Σ (x_i - x̄)(y_i - ȳ) / [√Σ (x_i - x̄)² * √Σ (y_i - ȳ)²] Spearman Rank Correlation
ρ_s = 1 - (6 Σ d_i²) / [n(n² - 1)]where d_i = difference in ranks Kendall’s Tau
τ = (Number of concordant pairs - Number of discordant pairs) / [n(n-1)/2] References
- Fisher, R.A., "Correlation Between Relatives on the Supposition of Mendelian Inheritance," Transactions of the Royal Society of Edinburgh, vol. 52, 1918, pp. 399-433.
- Spearman, C., "The Proof and Measurement of Association Between Two Things," The American Journal of Psychology, vol. 15, 1904, pp. 72-101.
- Kendall, M.G., "A New Measure of Rank Correlation," Biometrika, vol. 30, 1938, pp. 81-93.
- Hotelling, H., "Simplified Calculation of the Correlation Coefficient," Journal of Educational Psychology, vol. 22, 1931, pp. 143-148.
- Gnanadesikan, R., "Methods for Statistical Data Analysis of Multivariate Observations," Wiley, 1997, pp. 120-145.