GLOSSARY OF STATISTICAL TERMS
An extensive glossary of statistical terms and comparisons of statistical concepts
AIC vs BIC
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are criteria used for model selection. AIC minimizes information loss while BIC penalizes model complexity more heavily, favoring simpler models when sample size is large.
Alternative hypothesis
The alternative hypothesis (H₁) states that there is a statistically significant effect or relationship in the population. It is what the researcher is trying to find evidence for, and is accepted when the null hypothesis is rejected.
Read moreANOVA
Analysis of Variance (ANOVA) is a statistical test used to compare the means of three or more groups simultaneously. It tests whether at least one group mean differs significantly from the others by partitioning total variance into between-group and within-group components.
Read moreARIMA
ARIMA (AutoRegressive Integrated Moving Average) combines autoregression, differencing, and moving average components to model stationary and non-stationary time series. The parameters (p, d, q) represent the AR order, differencing degree, and MA order. It is the standard benchmark model for univariate time series forecasting.
Read moreAutocorrelation
Autocorrelation measures the correlation of a time series with a lagged version of itself. Positive autocorrelation means consecutive values tend to be similar; negative autocorrelation means they tend to alternate. It is the first diagnostic check for any time series model.
Read moreBayes' theorem
Bayes' theorem updates the probability of a hypothesis given new evidence. The posterior probability is proportional to the prior probability multiplied by the likelihood. It is the foundation of Bayesian inference and spam filters, medical diagnosis, and machine learning classifiers.
Read moreBayesian vs frequentist inference
Frequentist inference treats parameters as fixed and unknown, making statements about the probability of data given the parameter. Bayesian inference treats parameters as random variables with prior distributions, updating them with data to obtain a posterior distribution.
Bernoulli distribution
The Bernoulli distribution models a single trial with two outcomes: success (1) with probability p and failure (0) with probability 1-p. Its mean is p and variance is p(1-p). It is the simplest discrete distribution and the building block for the binomial distribution.
Read moreBeta distribution
The beta distribution is a continuous distribution on [0,1] parameterized by shape parameters α and β. It is widely used as a prior for probabilities in Bayesian inference, for modeling proportions, and as the distribution of order statistics from a uniform distribution.
Read moreBias
Bias is the systematic error of an estimator: the difference between its expected value and the true parameter value. A biased estimator consistently over- or underestimates the truth. Bias can be reduced by using unbiased estimators or correcting for known systematic errors.
Bias-variance tradeoff
The bias-variance tradeoff describes how prediction error decomposes into bias (systematic error from wrong assumptions), variance (sensitivity to training data fluctuations), and irreducible noise. Reducing bias tends to increase variance and vice versa. Regularization and ensemble methods manage this tradeoff.
Read moreBinomial distribution
The binomial distribution models the number of successes in n independent Bernoulli trials, each with probability p. Its mean is np and variance is np(1-p). It converges to the normal distribution for large n and to the Poisson distribution when n is large and p is small.
Read moreBootstrap
Bootstrap is a resampling method that estimates the sampling distribution of a statistic by repeatedly drawing samples with replacement from the observed data. It is used to compute standard errors and confidence intervals without assuming a parametric distribution.
Read moreBoxplot
A boxplot displays the five-number summary of a dataset: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. The box spans the interquartile range (IQR = Q3 - Q1), and points beyond 1.5 × IQR from the box are plotted as outliers.
Central limit theorem
The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population distribution shape. This justifies the use of normal-based inference for large samples and is the most important theorem in applied statistics.
Chi-square test
The chi-square test assesses whether observed frequencies in a contingency table differ significantly from expected frequencies under independence. It is used to test associations between categorical variables and to test goodness of fit of a theoretical distribution.
Read moreCluster sampling
Cluster sampling divides the population into groups (clusters), randomly selects some clusters, and surveys all members within those clusters. It is more practical than simple random sampling when the population is geographically dispersed, though it typically yields less precise estimates.
Read moreCoefficient of variation
The coefficient of variation (CV) is the ratio of the standard deviation to the mean, expressed as a percentage. It measures relative variability and allows comparison of dispersion across datasets with different units or scales.
Read moreConditional probability
The conditional probability P(A|B) is the probability of event A given that B has occurred, defined as P(A∩B)/P(B). It updates probabilities when partial information is available. Conditional probabilities are the foundation of Bayes' theorem and probability trees.
Read moreConfidence interval
A confidence interval is a range of values computed from sample data that, under repeated sampling, would contain the true population parameter a specified percentage of the time. A 95% CI does not mean there is a 95% probability that the parameter lies in this specific interval.
Read moreConfidence interval vs prediction interval
A confidence interval quantifies uncertainty about the mean response at a given predictor value. A prediction interval is wider because it also includes the variability of individual observations around the mean. Prediction intervals are always wider than confidence intervals.
Read moreConfusion matrix
A confusion matrix is a table that summarizes the performance of a classification model. Rows represent actual classes and columns represent predicted classes. It shows true positives, false positives, true negatives, and false negatives, from which accuracy, precision, recall, and F1 score are derived.
Read moreCorrelation
Correlation measures the strength and direction of the linear relationship between two continuous variables. Pearson's r ranges from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no linear relationship. Correlation does not imply causation.
Read moreCovariance
Covariance measures the joint variability of two random variables. A positive covariance means both variables tend to increase together; negative means one tends to decrease as the other increases. Correlation is the standardized version of covariance, bounded between -1 and 1.
Read moreCramér's V
Cramér's V measures the strength of association between two categorical variables, derived from the chi-square statistic. It ranges from 0 (no association) to 1 (perfect association) and is comparable across tables of different sizes.
Read moreCross-validation
Cross-validation estimates model generalization performance by repeatedly splitting data into training and validation sets. In k-fold CV, data is divided into k folds; the model is trained on k-1 and evaluated on the remaining fold, repeated k times. It gives a more stable estimate than a single train-test split.
Read moreCumulative distribution function
The cumulative distribution function (CDF) F(x) gives the probability that a random variable X takes a value less than or equal to x. For discrete variables it is a step function; for continuous variables it is smooth and strictly increasing. The CDF fully characterizes a probability distribution.
Read moreDBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are densely connected within a neighborhood radius ε, classifying sparse points as noise. Unlike K-means, it discovers clusters of arbitrary shape and does not require specifying the number of clusters in advance.
Read moreDecision tree
A decision tree partitions the feature space into rectangular regions by recursively splitting on the feature and threshold that best separates the classes. It is highly interpretable but has high variance: small changes in the training data can produce a completely different tree.
Read moreDegrees of freedom
Degrees of freedom are the number of independent values in a calculation that are free to vary. When estimating k parameters from n observations, the residual degrees of freedom are n - k. They determine the shape of t, chi-square, and F distributions used in hypothesis tests.
Descriptive vs inferential statistics
Descriptive statistics summarize and describe the observed data using measures like mean, variance, and graphs. Inferential statistics use sample data to make conclusions about a larger population, quantifying uncertainty through confidence intervals and hypothesis tests.
Discrete vs continuous variables
A discrete variable takes countable distinct values (number of defects, number of children). A continuous variable can take any value in an interval (height, temperature, time). The distinction determines which probability distributions and statistical methods are appropriate.
Read moreEffect size
Effect size quantifies the magnitude of an effect independently of sample size. Cohen's d measures standardized mean differences; Pearson's r measures correlation strength; eta-squared measures variance explained in ANOVA. Unlike p-values, effect sizes reflect practical significance.
ElasticNet
ElasticNet combines L1 (Lasso) and L2 (Ridge) penalties in a convex mixture controlled by a mixing parameter. It performs variable selection like Lasso while retaining the grouping effect of Ridge, making it preferred when predictors are correlated and the model is sparse.
Read moreEntropy
Entropy (Shannon entropy) measures the average uncertainty or information content of a probability distribution. High entropy means outcomes are nearly equally likely; low entropy means one outcome dominates. It is used in decision trees as a splitting criterion and in information theory.
Estimator
An estimator is a function of sample data used to estimate an unknown population parameter. A good estimator is unbiased (expected value equals the true parameter), consistent (converges to the true value as n grows), and efficient (has minimum variance among unbiased estimators).
Read moreExponential distribution
The exponential distribution models the time between events in a Poisson process, with rate parameter λ. Its mean is 1/λ and it is memoryless: the probability of an event in the next instant does not depend on how long you have already waited. It is widely used in reliability and survival analysis.
Read moreF-distribution
The F-distribution is a continuous probability distribution that arises as the ratio of two independent chi-square distributions divided by their respective degrees of freedom. It is used in ANOVA F-tests, tests of equality of variances, and overall significance tests in regression.
Read moreF-test
The F-test compares two nested models or tests equality of two variances. In regression, the F-test assesses whether at least one predictor is significant. In ANOVA, it tests whether group means differ. The test statistic follows an F-distribution under the null hypothesis.
Read moreFalse positive vs false negative
A false positive (Type I error) occurs when a true null hypothesis is rejected. A false negative (Type II error) occurs when a false null hypothesis is not rejected. In classification, a false positive predicts the positive class incorrectly; a false negative misses a true positive case.
Read moreFeature importance
Feature importance measures how much each predictor contributes to a model's predictions. Impurity-based importance sums the impurity reduction across all splits on a variable. Permutation importance measures how much the error increases when a feature's values are randomly shuffled.
Read moreGamma distribution
The gamma distribution generalizes the exponential distribution to model the waiting time until the k-th event in a Poisson process, with shape k and rate λ. It includes the chi-squared distribution (k=ν/2, λ=1/2) and exponential distribution (k=1) as special cases.
Read moreGeometric distribution
The geometric distribution models the number of trials needed to get the first success, with success probability p per trial. Its mean is 1/p and it is the only discrete memoryless distribution. It is used in quality control, queuing theory, and modeling first-success phenomena.
Read moreGradient descent
Gradient descent minimizes a differentiable function by iteratively moving in the direction of the negative gradient (steepest descent). The learning rate controls step size. It is the foundational algorithm for training neural networks and fitting logistic regression.
Read moreHierarchical clustering
Hierarchical clustering builds a tree of nested clusters (dendrogram) either by merging small clusters bottom-up (agglomerative) or splitting large ones top-down (divisive). The linkage criterion (single, complete, average, Ward) determines how inter-cluster distance is measured. The number of clusters is chosen after inspecting the dendrogram.
Read moreHistogram
A histogram displays the distribution of a continuous variable by dividing its range into bins and showing the frequency or density of observations in each bin. Unlike a bar chart, the bins are contiguous. The choice of bin width strongly affects the visual appearance.
Hypergeometric distribution
The hypergeometric distribution models the number of successes when drawing n items without replacement from a population of N items containing K successes. Unlike the binomial, successive draws are dependent. It is used in quality control, genetics, and Fisher's exact test.
Read moreHypothesis testing
Hypothesis testing is a statistical procedure for deciding between two competing hypotheses about a population parameter. It involves specifying H₀ and H₁, computing a test statistic, and comparing it to a critical value or computing a p-value to make a decision at a chosen significance level α.
Read moreIndependent events
Two events A and B are independent if the occurrence of one does not affect the probability of the other: P(A∩B) = P(A)·P(B). Independence implies P(A|B) = P(A). It is a stronger condition than mutual exclusivity and is fundamental to defining independent random variables.
Read moreInterquartile range
The interquartile range (IQR) is the difference between Q3 (75th percentile) and Q1 (25th percentile). It measures the spread of the middle 50% of the data and is robust to outliers. IQR is used to define outlier thresholds in boxplots: points beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR.
Read moreJackknife
The jackknife is a resampling method that estimates bias and variance by repeatedly leaving out one observation at a time. For n observations, it produces n jackknife samples of size n-1. It is computationally cheaper than bootstrap and particularly useful for bias correction.
Read moreK-means clustering
K-means partitions n observations into K clusters by iteratively assigning each point to its nearest centroid and recomputing centroids as cluster means. It minimizes within-cluster sum of squares. K must be specified in advance; K-means++ improves initialization to avoid poor local minima.
Read moreK-nearest neighbors
KNN classifies a new observation by majority vote among its k nearest training points under a chosen distance metric. It is a non-parametric, instance-based learner with no training phase. Performance degrades in high dimensions due to the curse of dimensionality.
Read moreKolmogorov-Smirnov test
The Kolmogorov-Smirnov (KS) test compares a sample distribution against a reference distribution (one-sample) or compares two sample distributions (two-sample). The test statistic is the maximum absolute difference between the empirical CDFs. It is a general goodness-of-fit test valid for continuous distributions.
Read moreKruskal-Wallis test
The Kruskal-Wallis test is a non-parametric alternative to one-way ANOVA that tests whether k independent groups come from the same distribution. It uses ranks rather than raw values and does not assume normality. Significant results can be followed by Dunn's post-hoc tests.
Read moreKurtosis
Kurtosis measures the heaviness of the tails of a distribution relative to the normal distribution. Excess kurtosis = kurtosis - 3. Positive excess kurtosis (leptokurtic) indicates heavy tails and more extreme values; negative (platykurtic) indicates light tails.
Read moreLasso regression
Lasso (L1 regularization) adds a penalty proportional to the sum of absolute coefficient values to the OLS loss. Unlike Ridge, Lasso can shrink coefficients to exactly zero, performing automatic variable selection. It is preferred when the true model is sparse.
Read moreLaw of large numbers
The law of large numbers states that as sample size increases, the sample mean converges to the population mean. The weak law gives convergence in probability; the strong law gives almost sure convergence. It provides the theoretical justification for using sample statistics to estimate population parameters.
Likelihood function
The likelihood function L(θ; x) gives the probability of observing the data x as a function of the parameter θ. Maximum likelihood estimation (MLE) finds the parameter value that maximizes L. Unlike probability, likelihood is not normalized and cannot be interpreted as a probability over θ.
Linear regression
Linear regression models the expected value of a continuous response as a linear combination of predictors, estimated by OLS. It assumes linearity, independence of errors, homoscedasticity, and normality. It is the simplest and most interpretable regression model.
Read moreLogistic regression
Logistic regression models the probability of a binary outcome using the logistic (sigmoid) function applied to a linear combination of predictors. It is estimated by maximum likelihood and produces coefficients interpretable as log-odds ratios. It is the standard baseline for binary classification.
Read moreMaximum likelihood estimation
Maximum likelihood estimation (MLE) finds the parameter values that maximize the probability of observing the sample data. MLE estimators are consistent and asymptotically efficient. For normal data, MLE coincides with OLS for the mean; for binary data, it gives logistic regression.
Mean
The arithmetic mean is the sum of all values divided by the number of observations. It is the most common measure of central tendency but is sensitive to outliers, which can pull it far from the center of the distribution. For skewed data, the median is often more representative.
Read moreMean squared error
Mean squared error (MSE) is the average of squared differences between predicted and observed values. MSE = Bias² + Variance, making it a natural measure of the bias-variance tradeoff. Root MSE (RMSE) is in the same units as the response and easier to interpret.
Median
The median is the middle value of an ordered dataset, dividing the distribution into two equal halves. It is robust to outliers and a better measure of central tendency than the mean for skewed distributions. For an even number of observations, it is the average of the two middle values.
Read moreMode
The mode is the most frequently occurring value in a dataset. A distribution can be unimodal, bimodal, or multimodal. It is the only measure of central tendency applicable to nominal data and is used to describe the most common category.
Read moreMonte Carlo simulation
Monte Carlo simulation uses repeated random sampling to estimate numerical quantities that are difficult to compute analytically. It is used to approximate integrals, estimate probabilities, and propagate uncertainty in complex models. The accuracy improves as O(1/√n) with the number of samples.
Multicollinearity
Multicollinearity occurs when two or more predictors in a regression model are highly correlated. It inflates the variance of coefficient estimates, making them unstable and hard to interpret. VIF (variance inflation factor) quantifies it: VIF > 10 indicates problematic collinearity.
Read moreMultiple linear regression
Multiple linear regression extends simple linear regression to k predictors. Each coefficient measures the effect of one predictor while holding all others constant. OLS minimizes the sum of squared residuals and requires the design matrix to have full column rank.
Read moreNaive Bayes
Naive Bayes classifies by applying Bayes' theorem with the assumption that all features are conditionally independent given the class. Despite this rarely being true, it performs well in practice, especially for text classification. It is fast, interpretable, and works well with small training sets.
Read moreNegative binomial distribution
The negative binomial distribution models the number of trials needed to achieve r successes, with probability p per trial. It generalizes the geometric distribution (r=1) and is also used to model overdispersed count data where variance exceeds the mean, as an alternative to Poisson.
Read moreNeural networks
A neural network is a computational model composed of layers of interconnected nodes (neurons) that apply weighted linear transformations followed by nonlinear activation functions. Deep networks with many hidden layers learn hierarchical feature representations and achieve state-of-the-art performance on image, text, and sequential data.
Read moreNormal distribution
The normal distribution is a symmetric, bell-shaped continuous distribution fully described by its mean μ and standard deviation σ. About 68% of values lie within 1σ of the mean, 95% within 2σ. It arises naturally through the central limit theorem and is central to classical statistical inference.
Read moreNull hypothesis
The null hypothesis (H₀) is the default assumption that there is no effect, no difference, or no relationship in the population. Hypothesis testing attempts to find sufficient evidence to reject it in favor of the alternative. Failing to reject H₀ does not prove it is true.
Read moreOutlier
An outlier is an observation that lies unusually far from the rest of the data. It can result from measurement errors, data entry mistakes, or genuinely extreme values. Outliers distort means, standard deviations, and regression estimates; the median and IQR are robust alternatives.
Overfitting
Overfitting occurs when a model learns the noise in the training data rather than the underlying pattern, leading to low training error but high test error. It is prevented through regularization, cross-validation, more training data, and simpler models.
Read morep-value
The p-value is the probability of observing a test statistic as extreme as, or more extreme than, the one computed from the sample, assuming H₀ is true. A small p-value (typically < 0.05) is taken as evidence against H₀. The p-value does not measure the probability that H₀ is true.
Read moreParametric vs non-parametric tests
Parametric tests assume the data follow a specific distribution (usually normal) and estimate distribution parameters. Non-parametric tests make no distributional assumptions and use ranks instead of raw values. Non-parametric tests are more robust but generally less powerful when parametric assumptions hold.
Pearson vs Spearman correlation
Pearson correlation measures the linear relationship between two continuous variables; it is sensitive to outliers. Spearman correlation measures the monotonic relationship using ranks; it is robust to outliers and applies to ordinal data. Use Spearman when normality cannot be assumed.
Read morePercentile
The k-th percentile is the value below which k% of the observations fall. The 25th, 50th, and 75th percentiles are Q1, Q2 (median), and Q3. Percentiles describe the relative standing of a value within a distribution and are used in growth charts, test scores, and income distributions.
Read morePoisson distribution
The Poisson distribution models the number of events occurring in a fixed interval of time or space when events happen at a constant rate λ independently of each other. Its mean and variance are both equal to λ. It approximates the binomial when n is large and p is small.
Read morePopulation vs sample
A population is the complete set of individuals sharing a common characteristic of interest. A sample is a subset selected from the population for study. Because studying entire populations is usually impractical, inference is made from samples using statistical methods.
Read morePower of a test
The power of a hypothesis test is the probability of correctly rejecting the null hypothesis when it is false (1 - β). Power increases with larger sample sizes, larger effect sizes, and higher significance levels. A power of 0.80 (80%) is conventionally considered adequate.
Read morePrincipal component analysis
PCA finds a lower-dimensional representation of data by projecting onto the directions of maximum variance (principal components). Components are orthogonal and ordered by variance explained. It is used for dimensionality reduction, visualization, and noise removal before modeling.
Read moreProbability density function
The probability density function (PDF) f(x) of a continuous random variable gives the relative likelihood of the variable taking a given value. The probability of the variable falling in an interval [a, b] is the integral of f(x) from a to b. The total area under the PDF equals 1.
Read moreProbability mass function
The probability mass function (PMF) of a discrete random variable gives the probability of each possible value. P(X = x) ≥ 0 for all x, and the sum over all possible values equals 1. It is the discrete counterpart of the probability density function.
Read moreR-squared
R² (coefficient of determination) measures the proportion of variance in the response explained by the model, ranging from 0 to 1. Adding predictors always increases R²; adjusted R² penalizes for model complexity. R² = 1 in simple linear regression equals the square of the Pearson correlation.
Read moreRandom forest
Random forest builds many deep decision trees on bootstrap samples, using a random subset of features at each split to decorrelate trees. Predictions are averaged (regression) or majority-voted (classification). It reduces variance substantially compared to a single tree.
Read moreRandom variable
A random variable is a function that assigns a numerical value to each outcome of a random experiment. Discrete random variables take countable values; continuous random variables take values in an interval. Their behavior is described by probability distributions.
Read moreRegularization
Regularization adds a penalty to the loss function to reduce model complexity and prevent overfitting. L1 regularization (Lasso) encourages sparsity; L2 (Ridge) shrinks all coefficients toward zero. Regularization introduces bias but reduces variance, improving generalization performance.
Read moreResidual
A residual is the difference between an observed value and its fitted value from a model: eᵢ = yᵢ - ŷᵢ. Residuals should be approximately normally distributed with constant variance (homoscedasticity) and zero autocorrelation for OLS to be valid. Residual analysis is the primary regression diagnostic tool.
Read moreRidge regression
Ridge regression (L2 regularization) adds a penalty proportional to the sum of squared coefficients to the OLS loss. It has a closed-form solution, shrinks all coefficients toward zero without eliminating any, and stabilizes estimates under multicollinearity. Preferred when all predictors are expected to contribute.
Read moreROC curve and AUC
The ROC curve plots sensitivity (TPR) against 1-specificity (FPR) across all classification thresholds. AUC (area under the curve) summarizes it: AUC = 0.5 is random; AUC = 1 is perfect. AUC equals the probability that the model ranks a random positive case above a random negative case.
Read moreSample size
Sample size determination calculates the minimum number of observations needed to detect a given effect size with specified power and significance level. Larger samples reduce variance and increase power but cost more. Underpowered studies are a major source of irreproducible research.
Read moreSampling distribution
The sampling distribution of a statistic is the probability distribution of that statistic computed over all possible samples of a given size from the population. The standard deviation of the sampling distribution is the standard error. The central limit theorem describes the sampling distribution of the mean.
Shapiro-Wilk test
The Shapiro-Wilk test assesses whether a sample comes from a normal distribution. It computes the ratio of the best linear unbiased estimate of scale to the sample variance. It is considered the most powerful normality test for small to medium samples (n < 50) and is widely used as a preliminary check before applying parametric tests.
Read moreShapley values
Shapley values from cooperative game theory fairly attribute a model prediction to individual features. They satisfy four axioms: efficiency (values sum to the prediction minus baseline), symmetry, dummy, and additivity. SHAP makes them tractable for machine learning models including tree ensembles.
Read moreSimple random sampling
Simple random sampling selects n observations from a population of N such that every possible sample of size n has an equal probability of selection. It is the reference method against which other sampling designs are evaluated. In practice it requires a complete sampling frame.
Read moreSkewness
Skewness measures the asymmetry of a distribution around its mean. Positive skewness means a long right tail (mean > median); negative skewness means a long left tail (mean < median). Symmetric distributions like the normal have skewness zero.
Read moreStandard deviation
The standard deviation is the square root of the variance: the average distance of observations from the mean. It is in the same units as the data, unlike the variance. The sample standard deviation uses n-1 in the denominator (Bessel's correction) to be an unbiased estimator of the population standard deviation.
Read moreStandard error
The standard error (SE) is the standard deviation of the sampling distribution of a statistic. For the sample mean, SE = σ/√n. It measures how precisely the sample statistic estimates the population parameter. Confidence intervals and t-tests use the standard error.
Stationarity
A time series is stationary if its mean, variance, and autocovariance structure do not change over time. Most time series models (ARIMA, VAR) require stationarity. Non-stationary series are transformed through differencing, log transformation, or detrending. The ADF test formally tests for a unit root (non-stationarity).
Read moreStratified sampling
Stratified sampling divides the population into homogeneous subgroups (strata) and draws a simple random sample from each. It ensures representation of all subgroups and typically yields more precise estimates than simple random sampling when strata are internally homogeneous.
Read moreSupport vector machine
SVM finds the hyperplane that maximizes the margin between two classes. Only the support vectors (points on the margin boundary) determine its position. The kernel trick allows nonlinear boundaries by implicitly mapping data to a higher-dimensional space.
Read moreSystematic sampling
Systematic sampling selects every k-th element from an ordered population after a random start. It is simpler to implement than simple random sampling and often produces similar precision. It can be problematic if the population has a periodic pattern with period equal to the sampling interval.
Read moret-distribution
The t-distribution is a symmetric, bell-shaped distribution with heavier tails than the normal distribution. It arises when estimating the mean of a normally distributed population with unknown variance. As degrees of freedom increase, it converges to the standard normal distribution.
Read moret-test
The t-test assesses whether a sample mean differs significantly from a hypothesized value (one-sample), whether two independent group means differ (two-sample), or whether paired differences have a mean of zero (paired). It assumes normality; Welch's t-test does not assume equal variances.
Read moreTime series
A time series is a sequence of observations recorded at successive equally spaced time points. Key properties include trend (long-run direction), seasonality (periodic patterns), and stationarity (constant mean and variance over time). ARIMA models are the classical framework for time series forecasting.
Read moreTraining set vs test set
The training set is used to fit the model; the test set is used to estimate its generalization performance. The test set must not influence any modeling decisions. A validation set is a third split used for hyperparameter tuning, keeping the test set for final evaluation only.
Read moreType I vs Type II error
A Type I error (false positive, α) rejects a true null hypothesis. A Type II error (false negative, β) fails to reject a false null hypothesis. Power = 1 - β. Reducing α (requiring stronger evidence) increases β. The tradeoff between them depends on the relative costs of each mistake.
Read moreUnderfitting
Underfitting occurs when a model is too simple to capture the underlying structure of the data, resulting in high training error and high test error. It is caused by excessive regularization, too few parameters, or using the wrong model class. Bias is the dominant source of error.
Read moreVariance
Variance is the average squared deviation of observations from their mean. Sample variance divides by n-1 to correct for bias. High variance means data points are spread widely around the mean. In machine learning, variance refers to the sensitivity of a model to fluctuations in the training data.
Read moreWeibull distribution
The Weibull distribution is a flexible continuous distribution used extensively in reliability engineering and survival analysis. Its shape parameter β determines the hazard rate: β < 1 means decreasing hazard (infant mortality), β = 1 reduces to the exponential (constant hazard), and β > 1 means increasing hazard (aging).
Read moreWilcoxon test
The Wilcoxon signed-rank test is a non-parametric alternative to the paired t-test, testing whether the median of paired differences is zero. The Wilcoxon rank-sum test (Mann-Whitney U) is a non-parametric alternative to the two-sample t-test. Both tests use ranks instead of raw values.
Read moreXGBoost
XGBoost is an optimized gradient boosting algorithm that builds trees sequentially, each correcting the errors of the previous ensemble. It uses second-order Taylor expansion of the loss, explicit L1/L2 regularization on leaf weights, and column subsampling. It dominates tabular data competitions.
Read morez-score
A z-score (standard score) measures how many standard deviations an observation is from the mean: z = (x - μ) / σ. It standardizes variables to a common scale, allowing comparison across different units. Z-scores follow a standard normal distribution when the data are normally distributed.
z-test
The z-test is used to test hypotheses about a population mean when the population variance is known or the sample size is large (n > 30). The test statistic follows a standard normal distribution under H₀. In practice, the t-test is preferred because the population variance is rarely known.
Read more