Definition and Overview

Concept

Logistic regression: supervised learning algorithm for classification. Predicts probability of categorical dependent variable. Outputs mapped to [0,1] via sigmoid function. Typically used for binary classification problems.

Purpose

Estimate class membership probabilities. Model relationship between independent variables and binary outcome. Enables decision-making based on predicted likelihoods.

History

Origin: statistician David Cox, 1958. Extension of linear regression to classification. Widely adopted in statistics, epidemiology, and machine learning.

"The logistic model is a fundamental tool for binary response modeling." -- David R. Cox

Mathematical Foundations

Model Equation

Logistic function: maps linear combination of inputs to probability.

p(x) = 1 / (1 + e^(-(β₀ + β₁x₁ + ... + βₙxₙ)))

Sigmoid Function

Definition: S(t) = 1 / (1 + e^(-t)). Properties: smooth, monotonic, range (0,1). Converts logits to probabilities.

Log-Odds and Link Function

Odds: p/(1-p). Log-odds (logit): log(p/(1-p)) = linear predictor β₀ + Σβᵢxᵢ. Link function connects linear model to probability space.

Model Training and Estimation

Maximum Likelihood Estimation

Parameters β estimated by maximizing likelihood of observed labels given inputs. Likelihood function: product of predicted probabilities for actual classes.

Cost Function

Log-likelihood loss minimized. Equivalent to cross-entropy loss in classification context.

L(β) = Σ [yᵢ log(p(xᵢ)) + (1 - yᵢ) log(1 - p(xᵢ))]

Optimization Algorithms

Common methods: Gradient Descent, Newton-Raphson, Iteratively Reweighted Least Squares (IRLS). Converge to parameter estimates maximizing likelihood.

Feature Engineering and Selection

Feature Scaling

Standardization or normalization improves convergence speed. Logistic regression sensitive to scale of inputs.

Interaction Terms

Polynomial or interaction features model nonlinear relationships. Enhance model expressiveness.

Feature Selection Techniques

Filter methods: correlation, mutual information. Wrapper methods: recursive feature elimination. Embedded: L1 regularization (Lasso).

Decision Boundary and Interpretation

Thresholding

Default threshold: 0.5. If p(x) ≥ 0.5 → class 1; else class 0. Threshold adjustable for precision-recall trade-off.

Geometric Interpretation

Decision boundary: hyperplane defined by β₀ + Σβᵢxᵢ = 0. Separates feature space into predicted class regions.

Coefficient Interpretation

β coefficients represent log-odds change per unit increase in feature. Positive β increases class 1 probability; negative reduces.

Multiclass Logistic Regression

One-vs-Rest (OvR)

Train separate binary classifiers for each class vs. rest. Predict class with highest probability.

Softmax Regression

Generalizes logistic regression to multiple classes. Uses softmax function to model class probabilities.

p(y = k | x) = exp(β_kᵀ x) / Σ_j exp(β_jᵀ x)

Applications

Multiclass classification in text, image, and speech recognition.

Regularization Techniques

L1 Regularization (Lasso)

Encourages sparsity in coefficients. Useful for feature selection.

L2 Regularization (Ridge)

Penalizes large coefficients. Prevents overfitting, improves generalization.

Elastic Net

Combination of L1 and L2 penalties. Balances sparsity and coefficient shrinkage.

Evaluation Metrics and Performance

Accuracy

Proportion of correct predictions. Suitable for balanced datasets.

Precision, Recall, F1-Score

Precision: true positives / predicted positives. Recall: true positives / actual positives. F1: harmonic mean of precision and recall.

ROC Curve and AUC

Receiver Operating Characteristic plots true positive rate vs. false positive rate. Area Under Curve measures classifier quality.

MetricDefinitionInterpretation
Accuracy(TP + TN) / TotalOverall correctness
PrecisionTP / (TP + FP)Positive predictive value
RecallTP / (TP + FN)Sensitivity
F1-Score2 * (Precision * Recall) / (Precision + Recall)Balance of precision and recall

Advantages and Limitations

Advantages

Simple, interpretable coefficients. Probabilistic output. Efficient training. Works well for linearly separable data.

Limitations

Assumes linear decision boundary. Sensitive to outliers and multicollinearity. Poor performance on complex nonlinear data without feature engineering.

Mitigation Strategies

Use kernel methods, feature transformations, or switch to nonlinear models like SVM or neural networks for complex problems.

Applications in Machine Learning

Medical Diagnosis

Predict disease presence/absence from clinical variables. Widely used due to interpretability.

Credit Scoring

Assess loan default risk based on financial history and demographics.

Marketing

Customer churn prediction, campaign response modeling.

Implementation Considerations

Data Preprocessing

Handle missing values, encode categorical variables, scale numerical features.

Model Initialization

Initialize coefficients to zero or small random values. Convergence depends on starting point.

Hyperparameter Tuning

Regularization strength, learning rate, max iterations. Use cross-validation to select optimal parameters.

Software Tools and Libraries

Python Libraries

scikit-learn: LogisticRegression class with regularization and multiclass support. Statsmodels: detailed statistical summaries.

R Packages

glm() function in base package. caret for model tuning.

Other Tools

MATLAB, SAS, SPSS support logistic regression with graphical interfaces and advanced options.

ToolLanguage/PlatformFeatures
scikit-learnPythonRegularization, multiclass, pipelines
StatsmodelsPythonStatistical tests, detailed summaries
glm()RGeneralized linear models, formula interface

References

  • Hosmer, D.W., Lemeshow, S., Sturdivant, R.X. Applied Logistic Regression. Wiley, 2013, pp. 1-416.
  • Cox, D.R. The Regression Analysis of Binary Sequences. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 20, 1958, pp. 215-242.
  • Ng, A.Y. Feature selection, L1 vs. L2 regularization, and rotational invariance. ICML, 2004, pp. 78-85.
  • Friedman, J., Hastie, T., Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, Vol. 33, 2010, pp. 1-22.
  • James, G., Witten, D., Hastie, T., Tibshirani, R. An Introduction to Statistical Learning. Springer, 2013, pp. 1-426.