MASTER STATISTICS

Home
Categories

Introduction to Statistics
Basics of statistics and key concepts

Descriptive statistics
Summarizing and interpreting data

Random variables
Understanding variables and probability

Distributions
Different types of statistical distributions

%

Probability
Fundamentals of probability theory

Point estimation and CI
Estimating parameters with confidence

α

Hypothesis testing
Testing statistical hypotheses

Sampling
Methods of selecting data samples

Bootstrap (resampling)
Resampling techniques in statistics

Regression and classification
ML techniques for prediction and clustering

Time series
Analyzing time-dependent data

Optimization
Techniques to optimize functions and models

Glossary Tables
Categories
Introduction to Statistics Descriptive statistics Random variables Distributions Probability Point estimation and CI Hypothesis testing Sampling Bootstrap (resampling) Regression and classification Time series Optimization
Glossary
Tables
R & Python

R

R CODER
From introductory to advanced R tutorials with examples

R

R CHARTS
Learn data visualization with base R and ggplot2

R

R PACKAGES
Explore all R packages available, functions and datasets

PY

PYTHON CHARTS
Learn data visualization with matplotlib, seaborn and plotly
R & Python
R CODER R CHARTS R PACKAGES PYTHON CHARTS
Español

HOME › GLOSSARY

GLOSSARY OF STATISTICAL TERMS

An extensive glossary of statistical terms and comparisons of statistical concepts

A

AIC vs BIC

AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are criteria used for model selection. AIC minimizes information loss while BIC penalizes model complexity more heavily, favoring simpler models when sample size is large.

A

Alternative hypothesis

The alternative hypothesis (H₁) states that there is a statistically significant effect or relationship in the population. It is what the researcher is trying to find evidence for, and is accepted when the null hypothesis is rejected.

A

ANOVA

Analysis of Variance (ANOVA) is a statistical test used to compare the means of three or more groups simultaneously. It tests whether at least one group mean differs significantly from the others by partitioning total variance into between-group and within-group components.

A

ARIMA

ARIMA (AutoRegressive Integrated Moving Average) combines autoregression, differencing, and moving average components to model stationary and non-stationary time series. The parameters (p, d, q) represent the AR order, differencing degree, and MA order. It is the standard benchmark model for univariate time series forecasting.

A

Autocorrelation

Autocorrelation measures the correlation of a time series with a lagged version of itself. Positive autocorrelation means consecutive values tend to be similar; negative autocorrelation means they tend to alternate. It is the first diagnostic check for any time series model.

B

Bayes' theorem

Bayes' theorem updates the probability of a hypothesis given new evidence. The posterior probability is proportional to the prior probability multiplied by the likelihood. It is the foundation of Bayesian inference and spam filters, medical diagnosis, and machine learning classifiers.

B

Bayesian vs frequentist inference

Frequentist inference treats parameters as fixed and unknown, making statements about the probability of data given the parameter. Bayesian inference treats parameters as random variables with prior distributions, updating them with data to obtain a posterior distribution.

B

Bernoulli distribution

The Bernoulli distribution models a single trial with two outcomes: success (1) with probability p and failure (0) with probability 1-p. Its mean is p and variance is p(1-p). It is the simplest discrete distribution and the building block for the binomial distribution.

B

Beta distribution

The beta distribution is a continuous distribution on [0,1] parameterized by shape parameters α and β. It is widely used as a prior for probabilities in Bayesian inference, for modeling proportions, and as the distribution of order statistics from a uniform distribution.

B

Bias

Bias is the systematic error of an estimator: the difference between its expected value and the true parameter value. A biased estimator consistently over- or underestimates the truth. Bias can be reduced by using unbiased estimators or correcting for known systematic errors.

B

Bias-variance tradeoff

The bias-variance tradeoff describes how prediction error decomposes into bias (systematic error from wrong assumptions), variance (sensitivity to training data fluctuations), and irreducible noise. Reducing bias tends to increase variance and vice versa. Regularization and ensemble methods manage this tradeoff.

B

Binomial distribution

The binomial distribution models the number of successes in n independent Bernoulli trials, each with probability p. Its mean is np and variance is np(1-p). It converges to the normal distribution for large n and to the Poisson distribution when n is large and p is small.

B

Bootstrap

Bootstrap is a resampling method that estimates the sampling distribution of a statistic by repeatedly drawing samples with replacement from the observed data. It is used to compute standard errors and confidence intervals without assuming a parametric distribution.

B

Boxplot

A boxplot displays the five-number summary of a dataset: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. The box spans the interquartile range (IQR = Q3 - Q1), and points beyond 1.5 × IQR from the box are plotted as outliers.

C

Central limit theorem

The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population distribution shape. This justifies the use of normal-based inference for large samples and is the most important theorem in applied statistics.

C

Chi-square test

The chi-square test assesses whether observed frequencies in a contingency table differ significantly from expected frequencies under independence. It is used to test associations between categorical variables and to test goodness of fit of a theoretical distribution.

C

Cluster sampling

Cluster sampling divides the population into groups (clusters), randomly selects some clusters, and surveys all members within those clusters. It is more practical than simple random sampling when the population is geographically dispersed, though it typically yields less precise estimates.

C

Coefficient of variation

The coefficient of variation (CV) is the ratio of the standard deviation to the mean, expressed as a percentage. It measures relative variability and allows comparison of dispersion across datasets with different units or scales.

C

Conditional probability

The conditional probability P(A|B) is the probability of event A given that B has occurred, defined as P(A∩B)/P(B). It updates probabilities when partial information is available. Conditional probabilities are the foundation of Bayes' theorem and probability trees.

C

Confidence interval

A confidence interval is a range of values computed from sample data that, under repeated sampling, would contain the true population parameter a specified percentage of the time. A 95% CI does not mean there is a 95% probability that the parameter lies in this specific interval.

C

Confidence interval vs prediction interval

A confidence interval quantifies uncertainty about the mean response at a given predictor value. A prediction interval is wider because it also includes the variability of individual observations around the mean. Prediction intervals are always wider than confidence intervals.

C

Confusion matrix

A confusion matrix is a table that summarizes the performance of a classification model. Rows represent actual classes and columns represent predicted classes. It shows true positives, false positives, true negatives, and false negatives, from which accuracy, precision, recall, and F1 score are derived.

C

Correlation

Correlation measures the strength and direction of the linear relationship between two continuous variables. Pearson's r ranges from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no linear relationship. Correlation does not imply causation.

C

Covariance

Covariance measures the joint variability of two random variables. A positive covariance means both variables tend to increase together; negative means one tends to decrease as the other increases. Correlation is the standardized version of covariance, bounded between -1 and 1.

C

Cramér's V

Cramér's V measures the strength of association between two categorical variables, derived from the chi-square statistic. It ranges from 0 (no association) to 1 (perfect association) and is comparable across tables of different sizes.

C

Cross-validation

Cross-validation estimates model generalization performance by repeatedly splitting data into training and validation sets. In k-fold CV, data is divided into k folds; the model is trained on k-1 and evaluated on the remaining fold, repeated k times. It gives a more stable estimate than a single train-test split.

C

Cumulative distribution function

The cumulative distribution function (CDF) F(x) gives the probability that a random variable X takes a value less than or equal to x. For discrete variables it is a step function; for continuous variables it is smooth and strictly increasing. The CDF fully characterizes a probability distribution.

D

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are densely connected within a neighborhood radius ε, classifying sparse points as noise. Unlike K-means, it discovers clusters of arbitrary shape and does not require specifying the number of clusters in advance.

D

Decision tree

A decision tree partitions the feature space into rectangular regions by recursively splitting on the feature and threshold that best separates the classes. It is highly interpretable but has high variance: small changes in the training data can produce a completely different tree.

D

Degrees of freedom

Degrees of freedom are the number of independent values in a calculation that are free to vary. When estimating k parameters from n observations, the residual degrees of freedom are n - k. They determine the shape of t, chi-square, and F distributions used in hypothesis tests.

D

Descriptive vs inferential statistics

Descriptive statistics summarize and describe the observed data using measures like mean, variance, and graphs. Inferential statistics use sample data to make conclusions about a larger population, quantifying uncertainty through confidence intervals and hypothesis tests.

D

Discrete vs continuous variables

A discrete variable takes countable distinct values (number of defects, number of children). A continuous variable can take any value in an interval (height, temperature, time). The distinction determines which probability distributions and statistical methods are appropriate.

E

Effect size

Effect size quantifies the magnitude of an effect independently of sample size. Cohen's d measures standardized mean differences; Pearson's r measures correlation strength; eta-squared measures variance explained in ANOVA. Unlike p-values, effect sizes reflect practical significance.

E

ElasticNet

ElasticNet combines L1 (Lasso) and L2 (Ridge) penalties in a convex mixture controlled by a mixing parameter. It performs variable selection like Lasso while retaining the grouping effect of Ridge, making it preferred when predictors are correlated and the model is sparse.

E

Entropy

Entropy (Shannon entropy) measures the average uncertainty or information content of a probability distribution. High entropy means outcomes are nearly equally likely; low entropy means one outcome dominates. It is used in decision trees as a splitting criterion and in information theory.

E

Estimator

An estimator is a function of sample data used to estimate an unknown population parameter. A good estimator is unbiased (expected value equals the true parameter), consistent (converges to the true value as n grows), and efficient (has minimum variance among unbiased estimators).

E

Exponential distribution

The exponential distribution models the time between events in a Poisson process, with rate parameter λ. Its mean is 1/λ and it is memoryless: the probability of an event in the next instant does not depend on how long you have already waited. It is widely used in reliability and survival analysis.

F

F-distribution

The F-distribution is a continuous probability distribution that arises as the ratio of two independent chi-square distributions divided by their respective degrees of freedom. It is used in ANOVA F-tests, tests of equality of variances, and overall significance tests in regression.

F

F-test

The F-test compares two nested models or tests equality of two variances. In regression, the F-test assesses whether at least one predictor is significant. In ANOVA, it tests whether group means differ. The test statistic follows an F-distribution under the null hypothesis.

F

False positive vs false negative

A false positive (Type I error) occurs when a true null hypothesis is rejected. A false negative (Type II error) occurs when a false null hypothesis is not rejected. In classification, a false positive predicts the positive class incorrectly; a false negative misses a true positive case.

F

Feature importance

Feature importance measures how much each predictor contributes to a model's predictions. Impurity-based importance sums the impurity reduction across all splits on a variable. Permutation importance measures how much the error increases when a feature's values are randomly shuffled.

G

Gamma distribution

The gamma distribution generalizes the exponential distribution to model the waiting time until the k-th event in a Poisson process, with shape k and rate λ. It includes the chi-squared distribution (k=ν/2, λ=1/2) and exponential distribution (k=1) as special cases.

G

Geometric distribution

The geometric distribution models the number of trials needed to get the first success, with success probability p per trial. Its mean is 1/p and it is the only discrete memoryless distribution. It is used in quality control, queuing theory, and modeling first-success phenomena.

G

Gradient descent

Gradient descent minimizes a differentiable function by iteratively moving in the direction of the negative gradient (steepest descent). The learning rate controls step size. It is the foundational algorithm for training neural networks and fitting logistic regression.

H

Hierarchical clustering

Hierarchical clustering builds a tree of nested clusters (dendrogram) either by merging small clusters bottom-up (agglomerative) or splitting large ones top-down (divisive). The linkage criterion (single, complete, average, Ward) determines how inter-cluster distance is measured. The number of clusters is chosen after inspecting the dendrogram.

H

Histogram

A histogram displays the distribution of a continuous variable by dividing its range into bins and showing the frequency or density of observations in each bin. Unlike a bar chart, the bins are contiguous. The choice of bin width strongly affects the visual appearance.

H

Hypergeometric distribution

The hypergeometric distribution models the number of successes when drawing n items without replacement from a population of N items containing K successes. Unlike the binomial, successive draws are dependent. It is used in quality control, genetics, and Fisher's exact test.

H

Hypothesis testing

Hypothesis testing is a statistical procedure for deciding between two competing hypotheses about a population parameter. It involves specifying H₀ and H₁, computing a test statistic, and comparing it to a critical value or computing a p-value to make a decision at a chosen significance level α.

I

Independent events

Two events A and B are independent if the occurrence of one does not affect the probability of the other: P(A∩B) = P(A)·P(B). Independence implies P(A|B) = P(A). It is a stronger condition than mutual exclusivity and is fundamental to defining independent random variables.

I

Interquartile range

The interquartile range (IQR) is the difference between Q3 (75th percentile) and Q1 (25th percentile). It measures the spread of the middle 50% of the data and is robust to outliers. IQR is used to define outlier thresholds in boxplots: points beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR.

J

Jackknife

The jackknife is a resampling method that estimates bias and variance by repeatedly leaving out one observation at a time. For n observations, it produces n jackknife samples of size n-1. It is computationally cheaper than bootstrap and particularly useful for bias correction.

K

K-means clustering

K-means partitions n observations into K clusters by iteratively assigning each point to its nearest centroid and recomputing centroids as cluster means. It minimizes within-cluster sum of squares. K must be specified in advance; K-means++ improves initialization to avoid poor local minima.

K

K-nearest neighbors

KNN classifies a new observation by majority vote among its k nearest training points under a chosen distance metric. It is a non-parametric, instance-based learner with no training phase. Performance degrades in high dimensions due to the curse of dimensionality.

K

Kolmogorov-Smirnov test

The Kolmogorov-Smirnov (KS) test compares a sample distribution against a reference distribution (one-sample) or compares two sample distributions (two-sample). The test statistic is the maximum absolute difference between the empirical CDFs. It is a general goodness-of-fit test valid for continuous distributions.

K

Kruskal-Wallis test

The Kruskal-Wallis test is a non-parametric alternative to one-way ANOVA that tests whether k independent groups come from the same distribution. It uses ranks rather than raw values and does not assume normality. Significant results can be followed by Dunn's post-hoc tests.

K

Kurtosis

Kurtosis measures the heaviness of the tails of a distribution relative to the normal distribution. Excess kurtosis = kurtosis - 3. Positive excess kurtosis (leptokurtic) indicates heavy tails and more extreme values; negative (platykurtic) indicates light tails.

L

Lasso regression

Lasso (L1 regularization) adds a penalty proportional to the sum of absolute coefficient values to the OLS loss. Unlike Ridge, Lasso can shrink coefficients to exactly zero, performing automatic variable selection. It is preferred when the true model is sparse.

L

Law of large numbers

The law of large numbers states that as sample size increases, the sample mean converges to the population mean. The weak law gives convergence in probability; the strong law gives almost sure convergence. It provides the theoretical justification for using sample statistics to estimate population parameters.

L

Likelihood function

The likelihood function L(θ; x) gives the probability of observing the data x as a function of the parameter θ. Maximum likelihood estimation (MLE) finds the parameter value that maximizes L. Unlike probability, likelihood is not normalized and cannot be interpreted as a probability over θ.

L

Linear regression

Linear regression models the expected value of a continuous response as a linear combination of predictors, estimated by OLS. It assumes linearity, independence of errors, homoscedasticity, and normality. It is the simplest and most interpretable regression model.

L

Logistic regression

Logistic regression models the probability of a binary outcome using the logistic (sigmoid) function applied to a linear combination of predictors. It is estimated by maximum likelihood and produces coefficients interpretable as log-odds ratios. It is the standard baseline for binary classification.

M

Maximum likelihood estimation

Maximum likelihood estimation (MLE) finds the parameter values that maximize the probability of observing the sample data. MLE estimators are consistent and asymptotically efficient. For normal data, MLE coincides with OLS for the mean; for binary data, it gives logistic regression.

M

Mean

The arithmetic mean is the sum of all values divided by the number of observations. It is the most common measure of central tendency but is sensitive to outliers, which can pull it far from the center of the distribution. For skewed data, the median is often more representative.

M

Mean squared error

Mean squared error (MSE) is the average of squared differences between predicted and observed values. MSE = Bias² + Variance, making it a natural measure of the bias-variance tradeoff. Root MSE (RMSE) is in the same units as the response and easier to interpret.

M

Median

The median is the middle value of an ordered dataset, dividing the distribution into two equal halves. It is robust to outliers and a better measure of central tendency than the mean for skewed distributions. For an even number of observations, it is the average of the two middle values.

M

Mode

The mode is the most frequently occurring value in a dataset. A distribution can be unimodal, bimodal, or multimodal. It is the only measure of central tendency applicable to nominal data and is used to describe the most common category.

M

Monte Carlo simulation

Monte Carlo simulation uses repeated random sampling to estimate numerical quantities that are difficult to compute analytically. It is used to approximate integrals, estimate probabilities, and propagate uncertainty in complex models. The accuracy improves as O(1/√n) with the number of samples.

M

Multicollinearity

Multicollinearity occurs when two or more predictors in a regression model are highly correlated. It inflates the variance of coefficient estimates, making them unstable and hard to interpret. VIF (variance inflation factor) quantifies it: VIF > 10 indicates problematic collinearity.

M

Multiple linear regression

Multiple linear regression extends simple linear regression to k predictors. Each coefficient measures the effect of one predictor while holding all others constant. OLS minimizes the sum of squared residuals and requires the design matrix to have full column rank.

N

Naive Bayes

Naive Bayes classifies by applying Bayes' theorem with the assumption that all features are conditionally independent given the class. Despite this rarely being true, it performs well in practice, especially for text classification. It is fast, interpretable, and works well with small training sets.

N

Negative binomial distribution

The negative binomial distribution models the number of trials needed to achieve r successes, with probability p per trial. It generalizes the geometric distribution (r=1) and is also used to model overdispersed count data where variance exceeds the mean, as an alternative to Poisson.

N

Neural networks

A neural network is a computational model composed of layers of interconnected nodes (neurons) that apply weighted linear transformations followed by nonlinear activation functions. Deep networks with many hidden layers learn hierarchical feature representations and achieve state-of-the-art performance on image, text, and sequential data.

N

Normal distribution

The normal distribution is a symmetric, bell-shaped continuous distribution fully described by its mean μ and standard deviation σ. About 68% of values lie within 1σ of the mean, 95% within 2σ. It arises naturally through the central limit theorem and is central to classical statistical inference.

N

Null hypothesis

The null hypothesis (H₀) is the default assumption that there is no effect, no difference, or no relationship in the population. Hypothesis testing attempts to find sufficient evidence to reject it in favor of the alternative. Failing to reject H₀ does not prove it is true.

O

Outlier

An outlier is an observation that lies unusually far from the rest of the data. It can result from measurement errors, data entry mistakes, or genuinely extreme values. Outliers distort means, standard deviations, and regression estimates; the median and IQR are robust alternatives.

O

Overfitting

Overfitting occurs when a model learns the noise in the training data rather than the underlying pattern, leading to low training error but high test error. It is prevented through regularization, cross-validation, more training data, and simpler models.

P

p-value

The p-value is the probability of observing a test statistic as extreme as, or more extreme than, the one computed from the sample, assuming H₀ is true. A small p-value (typically < 0.05) is taken as evidence against H₀. The p-value does not measure the probability that H₀ is true.

P

Parametric vs non-parametric tests

Parametric tests assume the data follow a specific distribution (usually normal) and estimate distribution parameters. Non-parametric tests make no distributional assumptions and use ranks instead of raw values. Non-parametric tests are more robust but generally less powerful when parametric assumptions hold.

P

Pearson vs Spearman correlation

Pearson correlation measures the linear relationship between two continuous variables; it is sensitive to outliers. Spearman correlation measures the monotonic relationship using ranks; it is robust to outliers and applies to ordinal data. Use Spearman when normality cannot be assumed.

P

Percentile

The k-th percentile is the value below which k% of the observations fall. The 25th, 50th, and 75th percentiles are Q1, Q2 (median), and Q3. Percentiles describe the relative standing of a value within a distribution and are used in growth charts, test scores, and income distributions.

P

Poisson distribution

The Poisson distribution models the number of events occurring in a fixed interval of time or space when events happen at a constant rate λ independently of each other. Its mean and variance are both equal to λ. It approximates the binomial when n is large and p is small.

P

Population vs sample

A population is the complete set of individuals sharing a common characteristic of interest. A sample is a subset selected from the population for study. Because studying entire populations is usually impractical, inference is made from samples using statistical methods.

P

Power of a test

The power of a hypothesis test is the probability of correctly rejecting the null hypothesis when it is false (1 - β). Power increases with larger sample sizes, larger effect sizes, and higher significance levels. A power of 0.80 (80%) is conventionally considered adequate.

P

Principal component analysis

PCA finds a lower-dimensional representation of data by projecting onto the directions of maximum variance (principal components). Components are orthogonal and ordered by variance explained. It is used for dimensionality reduction, visualization, and noise removal before modeling.

P

Probability density function

The probability density function (PDF) f(x) of a continuous random variable gives the relative likelihood of the variable taking a given value. The probability of the variable falling in an interval [a, b] is the integral of f(x) from a to b. The total area under the PDF equals 1.

P

Probability mass function

The probability mass function (PMF) of a discrete random variable gives the probability of each possible value. P(X = x) ≥ 0 for all x, and the sum over all possible values equals 1. It is the discrete counterpart of the probability density function.

R

R-squared

R² (coefficient of determination) measures the proportion of variance in the response explained by the model, ranging from 0 to 1. Adding predictors always increases R²; adjusted R² penalizes for model complexity. R² = 1 in simple linear regression equals the square of the Pearson correlation.

R

Random forest

Random forest builds many deep decision trees on bootstrap samples, using a random subset of features at each split to decorrelate trees. Predictions are averaged (regression) or majority-voted (classification). It reduces variance substantially compared to a single tree.

R

Random variable

A random variable is a function that assigns a numerical value to each outcome of a random experiment. Discrete random variables take countable values; continuous random variables take values in an interval. Their behavior is described by probability distributions.

R

Regularization

Regularization adds a penalty to the loss function to reduce model complexity and prevent overfitting. L1 regularization (Lasso) encourages sparsity; L2 (Ridge) shrinks all coefficients toward zero. Regularization introduces bias but reduces variance, improving generalization performance.

R

Residual

A residual is the difference between an observed value and its fitted value from a model: eᵢ = yᵢ - ŷᵢ. Residuals should be approximately normally distributed with constant variance (homoscedasticity) and zero autocorrelation for OLS to be valid. Residual analysis is the primary regression diagnostic tool.

R

Ridge regression

Ridge regression (L2 regularization) adds a penalty proportional to the sum of squared coefficients to the OLS loss. It has a closed-form solution, shrinks all coefficients toward zero without eliminating any, and stabilizes estimates under multicollinearity. Preferred when all predictors are expected to contribute.

R

ROC curve and AUC

The ROC curve plots sensitivity (TPR) against 1-specificity (FPR) across all classification thresholds. AUC (area under the curve) summarizes it: AUC = 0.5 is random; AUC = 1 is perfect. AUC equals the probability that the model ranks a random positive case above a random negative case.

S

Sample size

Sample size determination calculates the minimum number of observations needed to detect a given effect size with specified power and significance level. Larger samples reduce variance and increase power but cost more. Underpowered studies are a major source of irreproducible research.

S

Sampling distribution

The sampling distribution of a statistic is the probability distribution of that statistic computed over all possible samples of a given size from the population. The standard deviation of the sampling distribution is the standard error. The central limit theorem describes the sampling distribution of the mean.

S

Shapiro-Wilk test

The Shapiro-Wilk test assesses whether a sample comes from a normal distribution. It computes the ratio of the best linear unbiased estimate of scale to the sample variance. It is considered the most powerful normality test for small to medium samples (n < 50) and is widely used as a preliminary check before applying parametric tests.

S

Shapley values

Shapley values from cooperative game theory fairly attribute a model prediction to individual features. They satisfy four axioms: efficiency (values sum to the prediction minus baseline), symmetry, dummy, and additivity. SHAP makes them tractable for machine learning models including tree ensembles.

S

Simple random sampling

Simple random sampling selects n observations from a population of N such that every possible sample of size n has an equal probability of selection. It is the reference method against which other sampling designs are evaluated. In practice it requires a complete sampling frame.

S

Skewness

Skewness measures the asymmetry of a distribution around its mean. Positive skewness means a long right tail (mean > median); negative skewness means a long left tail (mean < median). Symmetric distributions like the normal have skewness zero.

S

Standard deviation

The standard deviation is the square root of the variance: the average distance of observations from the mean. It is in the same units as the data, unlike the variance. The sample standard deviation uses n-1 in the denominator (Bessel's correction) to be an unbiased estimator of the population standard deviation.

S

Standard error

The standard error (SE) is the standard deviation of the sampling distribution of a statistic. For the sample mean, SE = σ/√n. It measures how precisely the sample statistic estimates the population parameter. Confidence intervals and t-tests use the standard error.

S

Stationarity

A time series is stationary if its mean, variance, and autocovariance structure do not change over time. Most time series models (ARIMA, VAR) require stationarity. Non-stationary series are transformed through differencing, log transformation, or detrending. The ADF test formally tests for a unit root (non-stationarity).

S

Stratified sampling

Stratified sampling divides the population into homogeneous subgroups (strata) and draws a simple random sample from each. It ensures representation of all subgroups and typically yields more precise estimates than simple random sampling when strata are internally homogeneous.

S

Support vector machine

SVM finds the hyperplane that maximizes the margin between two classes. Only the support vectors (points on the margin boundary) determine its position. The kernel trick allows nonlinear boundaries by implicitly mapping data to a higher-dimensional space.

S

Systematic sampling

Systematic sampling selects every k-th element from an ordered population after a random start. It is simpler to implement than simple random sampling and often produces similar precision. It can be problematic if the population has a periodic pattern with period equal to the sampling interval.

T

t-distribution

The t-distribution is a symmetric, bell-shaped distribution with heavier tails than the normal distribution. It arises when estimating the mean of a normally distributed population with unknown variance. As degrees of freedom increase, it converges to the standard normal distribution.

T

t-test

The t-test assesses whether a sample mean differs significantly from a hypothesized value (one-sample), whether two independent group means differ (two-sample), or whether paired differences have a mean of zero (paired). It assumes normality; Welch's t-test does not assume equal variances.

T

Time series

A time series is a sequence of observations recorded at successive equally spaced time points. Key properties include trend (long-run direction), seasonality (periodic patterns), and stationarity (constant mean and variance over time). ARIMA models are the classical framework for time series forecasting.

T

Training set vs test set

The training set is used to fit the model; the test set is used to estimate its generalization performance. The test set must not influence any modeling decisions. A validation set is a third split used for hyperparameter tuning, keeping the test set for final evaluation only.

T

Type I vs Type II error

A Type I error (false positive, α) rejects a true null hypothesis. A Type II error (false negative, β) fails to reject a false null hypothesis. Power = 1 - β. Reducing α (requiring stronger evidence) increases β. The tradeoff between them depends on the relative costs of each mistake.

U

Underfitting

Underfitting occurs when a model is too simple to capture the underlying structure of the data, resulting in high training error and high test error. It is caused by excessive regularization, too few parameters, or using the wrong model class. Bias is the dominant source of error.

V

Variance

Variance is the average squared deviation of observations from their mean. Sample variance divides by n-1 to correct for bias. High variance means data points are spread widely around the mean. In machine learning, variance refers to the sensitivity of a model to fluctuations in the training data.

W

Weibull distribution

The Weibull distribution is a flexible continuous distribution used extensively in reliability engineering and survival analysis. Its shape parameter β determines the hazard rate: β < 1 means decreasing hazard (infant mortality), β = 1 reduces to the exponential (constant hazard), and β > 1 means increasing hazard (aging).

W

Wilcoxon test

The Wilcoxon signed-rank test is a non-parametric alternative to the paired t-test, testing whether the median of paired differences is zero. The Wilcoxon rank-sum test (Mann-Whitney U) is a non-parametric alternative to the two-sample t-test. Both tests use ranks instead of raw values.

X

XGBoost

XGBoost is an optimized gradient boosting algorithm that builds trees sequentially, each correcting the errors of the previous ensemble. It uses second-order Taylor expansion of the loss, explicit L1/L2 regularization on leaf weights, and column subsampling. It dominates tabular data competitions.

Z

z-score

A z-score (standard score) measures how many standard deviations an observation is from the mean: z = (x - μ) / σ. It standardizes variables to a common scale, allowing comparison across different units. Z-scores follow a standard normal distribution when the data are normally distributed.

Z

z-test

The z-test is used to test hypotheses about a population mean when the population variance is known or the sample size is large (n > 30). The test statistic follows a standard normal distribution under H₀. In practice, the t-test is preferred because the population variance is rarely known.