Simple linear regression

Simple linear regression models the linear relationship between a response variable \(y\) and a single predictor \(x\) by fitting a straight line through the data. It is the foundation of regression analysis: every multiple regression, logistic regression, and regularized model builds on these concepts.

The model

\[y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \sim N(0, \sigma^2)\]

  • \(\beta_0\): intercept. The expected value of \(y\) when \(x = 0\).
  • \(\beta_1\): slope. The expected change in \(y\) for a one-unit increase in \(x\).
  • \(\varepsilon_i\): error term. Captures everything that affects \(y\) beyond \(x\): measurement error, omitted variables, inherent randomness.

The model makes four assumptions (LINE): Linearity, Independence of errors, Normality of errors, Equal variance (homoscedasticity). These are tested in the diagnostics post.

OLS estimation

Ordinary Least Squares (OLS) finds \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize the sum of squared residuals:

\[\text{SSE} = \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2\]

Taking derivatives and setting them to zero gives the closed-form estimators:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = r_{xy} \cdot \frac{S_y}{S_x}\]

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

where \(r_{xy}\) is the Pearson correlation, \(S_y\) and \(S_x\) are the sample standard deviations. The regression line always passes through the point of means \((\bar{x}, \bar{y})\).

The connection with correlation: \(\hat{\beta}_1 = 0\) iff \(r_{xy} = 0\). Testing \(H_0: \beta_1 = 0\) is equivalent to testing \(H_0: \rho = 0\).

Example: advertising spend and sales

A company tracks weekly TV advertising spend (thousands of €) and weekly sales (thousands of units) over 50 weeks.

Scatter plot of advertising spend vs sales with fitted regression line and confidence band

Each additional €1,000 of TV advertising is associated with approximately 1.79 thousand extra units sold. The intercept (9.76) represents baseline sales with zero advertising. $R^2 = $ 0.909 means 90.9% of the variation in sales is explained by advertising spend.

Goodness of fit: R²

\(R^2\) measures the proportion of variance in \(y\) explained by the model:

\[R^2 = 1 - \frac{\text{SSE}}{\text{SST}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\]

\(R^2 \in [0,1]\). An \(R^2\) of 0.70 means 70% of the variance in \(y\) is explained by \(x\); the remaining 30% is unexplained noise. In simple linear regression, \(R^2 = r_{xy}^2\).

Two panels showing SST (total variation) and SSE (unexplained variation) for the same regression to illustrate R squared

Inference on the slope

Under the model assumptions, \(\hat{\beta}_1\) is normally distributed:

\[\hat{\beta}_1 \sim N\!\left(\beta_1,\; \frac{\sigma^2}{\sum(x_i-\bar{x})^2}\right)\]

Since \(\sigma^2\) is unknown, we estimate it with \(\hat{\sigma}^2 = \text{SSE}/(n-2)\) and use the \(t\)-distribution:

\[t = \frac{\hat{\beta}_1 - 0}{\widehat{\text{SE}}(\hat{\beta}_1)} \sim t(n-2) \quad \text{under } H_0: \beta_1 = 0\]

A \((1-\alpha)\) confidence interval for \(\beta_1\):

\[\hat{\beta}_1 \pm t_{\alpha/2, n-2} \cdot \widehat{\text{SE}}(\hat{\beta}_1)\]

A significant \(t\)-test (\(p < 0.05\)) means \(x\) is a statistically significant predictor of \(y\): the observed slope is unlikely to arise by chance if \(\beta_1 = 0\).

⚠️ Statistical significance does not imply practical importance

A very large sample can produce a statistically significant slope that is practically negligible. An increase of \(\hat{\beta}_1 = 0.001\) units per €1,000 of advertising may be significant at \(p < 0.001\) with \(n = 10{,}000\) observations, but it is commercially irrelevant.

Always report the effect size (the slope itself and its confidence interval) alongside the p-value. The confidence interval communicates both the direction and the plausible magnitude of the effect.

Prediction

For a new observation at \(x_\text{new}\), the model produces two types of intervals:

Confidence interval for the mean response: uncertainty about the average \(y\) at \(x_\text{new}\) across the population.

Prediction interval for a new observation: wider, because it adds the individual error \(\varepsilon\) on top of the uncertainty about the mean.

\[\hat{y}_\text{new} \pm t_{\alpha/2, n-2} \cdot \hat{\sigma}\sqrt{1 + \frac{1}{n} + \frac{(x_\text{new}-\bar{x})^2}{\sum(x_i-\bar{x})^2}}\]

Both intervals widen as \(x_\text{new}\) moves away from \(\bar{x}\): extrapolation is increasingly unreliable.

💡 Simple linear regression in R

fit <- lm(sales ~ spend, data = df)
summary(fit)         # coefficients, t-tests, R²
confint(fit)         # 95% CIs for beta0 and beta1
predict(fit, newdata = data.frame(spend = 30),
        interval = "confidence")   # CI for mean response
predict(fit, newdata = data.frame(spend = 30),
        interval = "prediction")   # PI for new observation