Machine Learning [Complete Guide: Regression, Classification, Clustering and More]

Bias-variance tradeoff

The bias-variance tradeoff is the fundamental tension in machine learning: complex models have low bias but high variance; simple models have high bias but low variance.

Simple linear regression

Simple linear regression fits a line to two continuous variables using least squares, quantifies the relationship with R², and tests whether the slope is significantly different from zero.

Multiple linear regression

Multiple linear regression models a response variable as a linear combination of several predictors, estimated by OLS. Learn about adjusted R², multicollinearity, VIF and the F-test.

Linear regression diagnostics

Regression diagnostics check whether the LINE assumptions hold, identify outliers and influential observations, and detect multicollinearity before trusting model results.

Nonlinear regression

Nonlinear regression fits models where parameters appear nonlinearly, requiring iterative algorithms. Learn the key models, fitting methods, and how to choose initial values.

Logistic regression

Logistic regression models the probability of a binary outcome using the sigmoid function, estimated by maximum likelihood. Learn odds ratios, model evaluation and multiclass extensions.

Splines

Splines fit smooth nonlinear curves by joining piecewise polynomials at knots. Learn regression splines, natural cubic splines, smoothing splines and how to select the smoothing parameter.

Generalized additive model (GAM)

GAMs replace linear predictors with smooth functions, combining the flexibility of nonparametric regression with the interpretability of additive models.

Analysis of covariance (ANCOVA)

ANCOVA adjusts group mean comparisons for a continuous covariate, increasing statistical power and removing confounding. Learn the key assumption of homogeneous slopes and how to compute adjusted means.

Regularization

Regularization adds a penalty to the loss function to prevent overfitting. Ridge shrinks all coefficients toward zero; Lasso can shrink them to exactly zero, performing variable selection.

Ridge regression

Ridge regression adds an L2 penalty to OLS, shrinking all coefficients toward zero without eliminating any. It stabilizes estimates under multicollinearity and has a closed-form solution.

Lasso regression

Lasso uses an L1 penalty that shrinks coefficients to exactly zero, performing automatic variable selection. Learn the soft-thresholding operator, coordinate descent, and when to use ElasticNet instead.

ElasticNet regression

ElasticNet combines L1 and L2 penalties, inheriting Lasso's variable selection and Ridge's stability with correlated predictors. The mixing parameter alpha controls the blend.

K-nearest neighbors (KNN)

KNN classifies new observations by majority vote among the k nearest training points. Learn how k controls complexity, which distance metric to use, and why KNN struggles in high dimensions.

Naive Bayes

Naive Bayes classifies by applying Bayes' theorem with a conditional independence assumption. Fast, interpretable, and surprisingly effective for text classification and spam filtering.

Discriminant analysis

LDA and QDA classify observations by modeling class-conditional Gaussian distributions. LDA assumes equal covariances and gives linear boundaries; QDA allows different covariances and gives quadratic ones.

Support vector machines (SVM)

SVM finds the hyperplane that maximizes the margin between classes. The kernel trick maps data to higher dimensions implicitly, enabling nonlinear boundaries without explicit feature computation.

Neural networks

Neural networks learn hierarchical representations through layers of weighted connections. Backpropagation computes gradients via the chain rule; activation functions introduce the nonlinearity that makes deep networks powerful.

Decision trees

Decision trees partition the feature space into rectangular regions by recursively splitting on the most informative feature. They are interpretable, require no feature scaling, and are the building block of random forests and gradient boosting.

Random forest

Random forest builds many deep trees on bootstrap samples, each using a random subset of features at each split. Averaging decorrelated trees reduces variance dramatically without increasing bias.

Gradient boosting and XGBoost

Gradient boosting fits trees sequentially to the residuals of the current model, converting weak learners into a strong predictor. XGBoost adds second-order optimization, regularization and speed.

K-means clustering

K-means partitions data into K clusters by iteratively assigning points to the nearest centroid and updating centroids. Learn K-means++, elbow method, silhouette analysis and when K-means fails.

Hierarchical clustering

Hierarchical clustering builds a tree of nested clusters without requiring K in advance. The dendrogram shows the full merge history; cutting it at any height gives a flat clustering.

DBSCAN

DBSCAN finds clusters of arbitrary shape by grouping points in dense regions. It detects outliers automatically and requires no K upfront. Epsilon and minPts control the density threshold.

Principal component analysis (PCA)

PCA projects data onto the directions of maximum variance, reducing dimensionality while retaining as much information as possible. Learn eigendecomposition, SVD, biplots and scree plots.

Correspondence analysis

Correspondence analysis maps rows and columns of a contingency table into a shared low-dimensional space using chi-square distances. It is the PCA equivalent for categorical data.

t-SNE and UMAP

t-SNE and UMAP reduce high-dimensional data to 2D for visualization by preserving local neighborhood structure. They reveal clusters invisible to PCA but require careful interpretation.

Cross-validation

Cross-validation estimates how well a model generalizes to new data. Learn k-fold, LOOCV, stratified CV, nested CV for unbiased hyperparameter selection, and time series CV.

ROC curve and AUC

The ROC curve plots sensitivity vs (1-specificity) across all classification thresholds. AUC measures overall discriminative ability: P(score for positive > score for negative).

Shapley values and SHAP

SHAP uses Shapley values from game theory to explain individual predictions of any model. Each feature gets a contribution that fairly accounts for all possible feature combinations.

MACHINE LEARNING

Introduction

Regression

Regularization

Classification

Tree-based methods

Clustering

Dimensionality reduction

Model evaluation