Cross-validation

Cross-validation estimates how well a model will perform on new, unseen data by repeatedly partitioning the available data into training and validation sets. It is the standard tool for model evaluation, hyperparameter selection, and comparing competing models. Using it incorrectly produces optimistic estimates that fail to generalize.

Why a single train/test split is not enough

Splitting the data once into training and test sets is simple but noisy: the estimated performance depends heavily on which observations end up in each set. With \(n=200\) and an 80/20 split, the test set has only 40 observations: the estimated error has high variance across different random splits.

Cross-validation addresses this by using multiple splits and averaging the results. The estimate is more stable, and we get a measure of its uncertainty (standard deviation across folds).

K-fold cross-validation

Partition the data randomly into \(k\) folds of approximately equal size. For each fold \(i = 1, \ldots, k\):

  1. Train the model on the \(k-1\) other folds.
  2. Evaluate on fold \(i\), recording the metric (e.g., MSE, accuracy).

The CV estimate is the mean across all \(k\) folds:

\[\widehat{\text{CV}}_k = \frac{1}{k}\sum_{i=1}^k \text{error}_i\]

\[\widehat{\text{SE}} = \sqrt{\frac{1}{k(k-1)}\sum_{i=1}^k(\text{error}_i - \widehat{\text{CV}}_k)^2}\]

The standard error \(\widehat{\text{SE}}\) quantifies the uncertainty of the CV estimate. The one-standard-error rule: when comparing models, choose the simplest model whose CV error is within one SE of the minimum. This prefers parsimony without sacrificing much accuracy.

Diagram showing 5-fold cross-validation with each fold serving as the validation set once

Choosing k: the bias-variance tradeoff

The choice of \(k\) controls a bias-variance tradeoff in the CV estimate itself:

  • Large \(k\) (e.g., LOOCV): training sets are close to size \(n\) (low bias). But the \(k\) validation sets are almost identical (the models trained on them are highly correlated), so the variance of the CV estimate is high.
  • Small \(k\) (e.g., \(k=2\)): training sets have only \(n/2\) observations (high bias: the model is weaker than it would be on the full data). Low variance.
  • \(k=5\) or \(k=10\): empirically shown to give good bias-variance balance. The standard choice.

LOOCV (\(k=n\)) has one special property: for linear models, there is a shortcut formula that computes LOOCV without refitting \(n\) times:

\[\text{LOOCV} = \frac{1}{n}\sum_{i=1}^n \left(\frac{y_i - \hat{y}_i}{1 - h_{ii}}\right)^2\]

where \(h_{ii}\) is the leverage of observation \(i\). This makes LOOCV practical for linear models but it still has high variance as an estimator.

Box plots comparing CV error estimates for different values of k showing increasing variance with larger k

Variants of cross-validation

Stratified k-fold

For classification with imbalanced classes, ensure each fold has the same class proportions as the full dataset. Without stratification, a fold may contain very few minority class examples, making the error estimate noisy and unrepresentative. Always use stratified CV for classification.

Repeated k-fold

Repeat \(k\)-fold CV \(r\) times with different random partitions. Average over all \(r \times k\) folds. Gives more stable estimates at the cost of \(r\) times more computation. Useful when \(n\) is small.

Time series CV (walk-forward validation)

Standard k-fold randomly shuffles the data, creating training folds that contain future observations. For time series this is data leakage: the model “sees the future” during training.

Time series CV uses expanding or sliding windows: train on observations \(1, \ldots, t\) and evaluate on \(t+1, \ldots, t+h\) for increasing values of \(t\).

Diagram showing time series cross-validation with expanding training windows and fixed forecast horizons

Nested cross-validation: avoiding data leakage

A common mistake: use CV to select hyperparameters, then report that same CV error as the model’s performance estimate. This is optimistic because the test folds influenced the hyperparameter selection.

Nested CV uses two loops:

  • Outer loop: \(k_\text{outer}\) folds for unbiased performance estimation.
  • Inner loop: for each outer training set, run \(k_\text{inner}\)-fold CV to select hyperparameters.

Diagram showing nested cross-validation with outer folds for evaluation and inner folds for hyperparameter selection

The outer test folds (red) are never seen during hyperparameter selection. The inner validation folds (yellow) are used to select the best hyperparameters for each outer fold’s training set. The outer CV error is an unbiased estimate of the final model’s performance.

⚠️ Using CV error for model selection and reporting is double-dipping

If you use CV to select among 50 models and report the CV error of the winner, the reported error is optimistic: you searched over 50 models and picked the luckiest one. The more models you compare, the larger the optimism.

Correct procedure: use nested CV, or reserve a final holdout test set that is never used for any model selection decision (no hyperparameter tuning, no feature selection, no model comparison). The holdout test set is used exactly once, at the very end, to report the final performance.

💡 Cross-validation in R

library(caret)

# 10-fold CV
ctrl <- trainControl(method="cv", number=10)
fit  <- train(y ~ ., data=df, method="glm", trControl=ctrl)
fit$results   # CV metrics per fold
fit$resample  # individual fold results

# Stratified k-fold for classification
ctrl_strat <- trainControl(method="cv", number=10,
                            classProbs=TRUE, summaryFunction=twoClassSummary)

# Repeated 10-fold
ctrl_rep <- trainControl(method="repeatedcv", number=10, repeats=5)

# Nested CV
library(mlr3)
library(mlr3tuning)
task    <- as_task_classif(df, target="y")
learner <- lrn("classif.ranger", predict_type="prob")
resampling_outer <- rsmp("cv", folds=5)
resampling_inner <- rsmp("cv", folds=3)
measure  <- msr("classif.auc")
tuner    <- tnr("grid_search")
at <- AutoTuner$new(learner, resampling_inner, measure, tuner=tuner,
                    terminator=trm("evals", n_evals=20))
rr <- resample(task, at, resampling_outer)
rr$aggregate(measure)   # unbiased performance estimate