MACHINE LEARNING
Machine learning is a field of artificial intelligence focused on developing algorithms that allow systems to learn from data and improve performance over time without explicit programming
Introduction
Regression
Simple linear regression
Simple linear regression fits a line to two continuous variables using least squares, quantifies the relationship with R², and tests whether the slope is significantly different from zero.
Multiple linear regression
Multiple linear regression models a response variable as a linear combination of several predictors, estimated by OLS. Learn about adjusted R², multicollinearity, VIF and the F-test.
Linear regression diagnostics
Regression diagnostics check whether the LINE assumptions hold, identify outliers and influential observations, and detect multicollinearity before trusting model results.
Nonlinear regression
Nonlinear regression fits models where parameters appear nonlinearly, requiring iterative algorithms. Learn the key models, fitting methods, and how to choose initial values.
Logistic regression
Logistic regression models the probability of a binary outcome using the sigmoid function, estimated by maximum likelihood. Learn odds ratios, model evaluation and multiclass extensions.
Splines
Splines fit smooth nonlinear curves by joining piecewise polynomials at knots. Learn regression splines, natural cubic splines, smoothing splines and how to select the smoothing parameter.
Generalized additive model (GAM)
GAMs replace linear predictors with smooth functions, combining the flexibility of nonparametric regression with the interpretability of additive models.
Analysis of covariance (ANCOVA)
ANCOVA adjusts group mean comparisons for a continuous covariate, increasing statistical power and removing confounding. Learn the key assumption of homogeneous slopes and how to compute adjusted means.
Regularization
Regularization
Regularization adds a penalty to the loss function to prevent overfitting. Ridge shrinks all coefficients toward zero; Lasso can shrink them to exactly zero, performing variable selection.
Ridge regression
Ridge regression adds an L2 penalty to OLS, shrinking all coefficients toward zero without eliminating any. It stabilizes estimates under multicollinearity and has a closed-form solution.
Lasso regression
Lasso uses an L1 penalty that shrinks coefficients to exactly zero, performing automatic variable selection. Learn the soft-thresholding operator, coordinate descent, and when to use ElasticNet instead.
ElasticNet regression
ElasticNet combines L1 and L2 penalties, inheriting Lasso's variable selection and Ridge's stability with correlated predictors. The mixing parameter alpha controls the blend.
Classification
K-nearest neighbors (KNN)
KNN classifies new observations by majority vote among the k nearest training points. Learn how k controls complexity, which distance metric to use, and why KNN struggles in high dimensions.
Naive Bayes
Naive Bayes classifies by applying Bayes' theorem with a conditional independence assumption. Fast, interpretable, and surprisingly effective for text classification and spam filtering.
Discriminant analysis
LDA and QDA classify observations by modeling class-conditional Gaussian distributions. LDA assumes equal covariances and gives linear boundaries; QDA allows different covariances and gives quadratic ones.
Support vector machines (SVM)
SVM finds the hyperplane that maximizes the margin between classes. The kernel trick maps data to higher dimensions implicitly, enabling nonlinear boundaries without explicit feature computation.
Neural networks
Neural networks learn hierarchical representations through layers of weighted connections. Backpropagation computes gradients via the chain rule; activation functions introduce the nonlinearity that makes deep networks powerful.
Tree-based methods
Decision trees
Decision trees partition the feature space into rectangular regions by recursively splitting on the most informative feature. They are interpretable, require no feature scaling, and are the building block of random forests and gradient boosting.
Random forest
Random forest builds many deep trees on bootstrap samples, each using a random subset of features at each split. Averaging decorrelated trees reduces variance dramatically without increasing bias.
Gradient boosting and XGBoost
Gradient boosting fits trees sequentially to the residuals of the current model, converting weak learners into a strong predictor. XGBoost adds second-order optimization, regularization and speed.
Clustering
K-means clustering
K-means partitions data into K clusters by iteratively assigning points to the nearest centroid and updating centroids. Learn K-means++, elbow method, silhouette analysis and when K-means fails.
Hierarchical clustering
Hierarchical clustering builds a tree of nested clusters without requiring K in advance. The dendrogram shows the full merge history; cutting it at any height gives a flat clustering.
DBSCAN
DBSCAN finds clusters of arbitrary shape by grouping points in dense regions. It detects outliers automatically and requires no K upfront. Epsilon and minPts control the density threshold.
Dimensionality reduction
Principal component analysis (PCA)
PCA projects data onto the directions of maximum variance, reducing dimensionality while retaining as much information as possible. Learn eigendecomposition, SVD, biplots and scree plots.
Correspondence analysis
Correspondence analysis maps rows and columns of a contingency table into a shared low-dimensional space using chi-square distances. It is the PCA equivalent for categorical data.
t-SNE and UMAP
t-SNE and UMAP reduce high-dimensional data to 2D for visualization by preserving local neighborhood structure. They reveal clusters invisible to PCA but require careful interpretation.
Model evaluation
Cross-validation
Cross-validation estimates how well a model generalizes to new data. Learn k-fold, LOOCV, stratified CV, nested CV for unbiased hyperparameter selection, and time series CV.
ROC curve and AUC
The ROC curve plots sensitivity vs (1-specificity) across all classification thresholds. AUC measures overall discriminative ability: P(score for positive > score for negative).
Shapley values and SHAP
SHAP uses Shapley values from game theory to explain individual predictions of any model. Each feature gets a contribution that fairly accounts for all possible feature combinations.