ROC curve and AUC

A classifier produces a score for each observation. Converting scores to class labels requires choosing a threshold. The ROC curve shows how sensitivity and specificity trade off as the threshold varies across its entire range. AUC summarizes this curve in a single number: the probability that the model ranks a random positive case higher than a random negative case.

From scores to predictions: the confusion matrix

A binary classifier assigns each observation a score \(\hat{p} \in [0,1]\) and predicts positive if \(\hat{p} \geq t\) for some threshold \(t\). The resulting predictions are summarized in the confusion matrix:

Predicted positive Predicted negative
Actually positive TP FN
Actually negative FP TN

Key metrics derived from the confusion matrix:

\[\text{Sensitivity (Recall, TPR)} = \frac{TP}{TP+FN} \quad \text{(of all positives, how many did we catch?)}\]

\[\text{Specificity} = \frac{TN}{TN+FP}, \quad \text{FPR} = 1 - \text{Specificity} = \frac{FP}{FP+TN}\]

\[\text{Precision (PPV)} = \frac{TP}{TP+FP} \quad \text{(of all predicted positives, how many are truly positive?)}\]

\[\text{Accuracy} = \frac{TP+TN}{n}\]

The threshold \(t\) controls the tradeoff: lower \(t\) catches more positives (high TPR) but also misclassifies more negatives (high FPR).

The ROC curve

The Receiver Operating Characteristic (ROC) curve plots TPR (sensitivity) on the y-axis against FPR (1-specificity) on the x-axis, for every possible threshold \(t \in [0,1]\).

ROC curve for three classifiers with AUC values annotated and the diagonal random classifier line

The diagonal line represents a random classifier (AUC = 0.5): it achieves the same TPR as FPR at every threshold, gaining nothing from the scores. A perfect classifier would reach the top-left corner (TPR=1, FPR=0) and have AUC=1. The further the curve bulges toward the top-left, the better the classifier.

AUC: area under the ROC curve

The AUC (Area Under the Curve) is the integral of the ROC curve:

\[\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}^{-1}(t))\, dt\]

It has a clean probabilistic interpretation (Wilcoxon-Mann-Whitney statistic):

\[\text{AUC} = P(\hat{p}_+ > \hat{p}_-)\]

The probability that a randomly chosen positive case receives a higher score than a randomly chosen negative case. AUC = 0.5: random ranking. AUC = 1: perfect ranking. AUC = 0: perfectly reversed ranking.

This interpretation makes AUC threshold-independent: it measures the quality of the score ranking, not the quality of any specific threshold decision.

Choosing an operating threshold

AUC evaluates overall ranking quality but real decisions require a threshold. How to choose \(t\):

  • Equal cost: maximize accuracy, or equivalently find the threshold closest to the top-left corner of the ROC curve (minimum distance to \((0,1)\)).
  • Cost-sensitive: if a false negative is \(c\) times more costly than a false positive, the optimal threshold satisfies the slope condition on the ROC curve.
  • Youden’s J index: maximize \(J = \text{TPR} - \text{FPR} = \text{Sensitivity} + \text{Specificity} - 1\).
  • Domain constraint: fix a maximum acceptable FPR (e.g., in screening, accept at most 10% false positives) and find the threshold that maximizes TPR under that constraint.

ROC curve with three operating points marked showing different threshold choices and their tradeoffs

Precision-Recall curve

For highly imbalanced datasets (e.g., fraud detection where 0.1% of cases are positive), ROC and AUC can be misleading: a classifier that labels everything as negative achieves FPR=0 and a high AUC while being completely useless.

The Precision-Recall (PR) curve plots precision vs recall across thresholds. It focuses on the positive class and is more informative when:

  • The dataset is highly imbalanced.
  • The cost of false negatives and false positives are very different.
  • The rare class is the class of interest.

The average precision (AP) summarizes the PR curve, analogous to AUC for the ROC curve.

Precision-recall curve for a good and a moderate classifier showing how performance differs on imbalanced data

The baseline (dashed) is the precision achieved by randomly predicting positive with probability equal to the prevalence. A useful classifier must stay well above this line.

⚠️ AUC is not the right metric when classes are severely imbalanced

Consider a fraud detection system where 0.1% of transactions are fraudulent. A classifier that always predicts “not fraud” achieves accuracy = 99.9%, AUC close to 0.5 (slightly above, because random guessing respects the imbalance). But average precision could be near zero.

Use precision-recall AUC (or average precision) instead of ROC-AUC when:

  • The positive class is rare (prevalence \(< 5\)-10%).
  • You care primarily about detecting the rare class.
  • False positives have very different costs from false negatives.

Also: when comparing models with AUC, check whether the difference is statistically significant. The DeLong test compares two ROC curves and provides a p-value for the difference in AUC.

💡 ROC and AUC in R

library(pROC)

# ROC curve and AUC
roc_obj <- roc(y_true, scores, levels=c(0,1), direction="<")
auc(roc_obj)                         # AUC value
ci.auc(roc_obj)                      # 95% CI for AUC
plot(roc_obj, col="#2563EB")         # ROC plot

# Compare two ROC curves
roc1 <- roc(y_true, scores1)
roc2 <- roc(y_true, scores2)
roc.test(roc1, roc2)                 # DeLong test

# Optimal threshold by Youden's J
coords(roc_obj, "best", best.method="youden")

# Precision-Recall
library(PRROC)
pr_obj <- pr.curve(scores.class1=scores[y_true==1],
                   scores.class0=scores[y_true==0],
                   curve=TRUE)
pr_obj$auc.integral   # average precision
plot(pr_obj)

# Full evaluation with caret
library(caret)
confusionMatrix(pred_labels, true_labels, positive="1")