Assumptions of Regression Model
📊 Statistics & Data Analysis
Assumptions of Regression Model
The assumptions of a regression model determine whether your OLS estimates are valid. This guide covers all five core assumptions — linearity, independence, homoscedasticity, normality of residuals, and no multicollinearity — with diagnostic tests, visual checks, and practical fixes for students and researchers.
What & Why
Assumptions of Regression Model — Why They Matter
The assumptions of a regression model are the conditions that must hold for Ordinary Least Squares (OLS) estimates to be statistically valid, reliable, and interpretable. Every regression equation you build — whether you’re predicting exam scores from study hours or forecasting quarterly revenue from advertising spend — rests on a set of mathematical conditions. Violate them, and your coefficients become biased, your standard errors mislead you, and your p-values lie. Understand them, and you become the kind of analyst whose findings actually hold up.
Most students encounter regression in introductory statistics or econometrics classes and walk away knowing the formula. Far fewer understand what makes that formula trustworthy. The assumptions of a regression model aren’t bureaucratic checkboxes — they’re the logical scaffolding on which the whole method stands. When professors or journal reviewers ask “did you check your assumptions?”, they’re asking whether your results are credible or compromised. For a deeper foundation in how regression works as a tool, regression analysis as predictive modeling is worth reading first.
5
Core assumptions every OLS regression model must satisfy for estimates to be valid under the Gauss-Markov theorem
BLUE
What OLS becomes when all assumptions hold — Best Linear Unbiased Estimator, proven by the Gauss-Markov theorem
VIF>10
Variance Inflation Factor threshold most widely used to flag problematic multicollinearity in regression predictors
The five assumptions of a regression model stem from the Gauss-Markov theorem, named after mathematicians Carl Friedrich Gauss and Andrey Markov. This theorem proves that when specific conditions are met, OLS produces the Best Linear Unbiased Estimator (BLUE) of the regression coefficients. The theorem is foundational to every regression-based analysis in economics, psychology, public health, business, and the social sciences. Gauss-Markov theorem proofs and implications appear across introductory econometrics curricula worldwide.
This guide covers all five assumptions of the regression model in depth: what each one means mathematically, why it matters in practice, how to detect when it’s violated, and what to do about it. Students at universities across the United States and the United Kingdom — in statistics, econometrics, data science, psychology, and public policy programs — will find every relevant concept here. This is also the foundation you need before studying multiple linear regression or advanced extensions like logistic regression.
The core principle: Regression assumptions aren’t about the data itself — they’re about the error terms (residuals). Every assumption of a linear regression model is, at its core, a statement about the behavior of the errors that the model cannot explain. Understanding this reframes every diagnostic test you will ever run.
What Is a Regression Model?
A regression model is a statistical framework that estimates the relationship between one or more predictor variables (also called independent variables or features) and a continuous outcome variable (the dependent variable). The standard simple linear regression model takes the form: Y = β₀ + β₁X + ε, where Y is the outcome, X is the predictor, β₀ and β₁ are regression coefficients estimated from data, and ε (epsilon) represents the error term — the portion of Y that X cannot explain. In multiple linear regression, the model extends to: Y = β₀ + β₁X₁ + β₂X₂ + … + βₖXₖ + ε, with k predictors.
The error term ε is where all five assumptions of the regression model live. When we say “regression assumptions,” we mean assumptions about the distribution, behavior, and relationships of these residuals. The OLS method minimizes the sum of squared residuals to estimate coefficients — but the properties of those estimates depend entirely on whether the errors behave as assumed.
LSI and NLP Keywords Related to Regression Assumptions
Throughout this guide, you’ll encounter terms that are central to understanding regression diagnostics: OLS assumptions, error term, residuals, heteroscedasticity, homoscedasticity, autocorrelation, serial correlation, multicollinearity, collinearity, normality of errors, Gauss-Markov, BLUE estimator, variance inflation factor, Durbin-Watson statistic, Breusch-Pagan test, White test, Q-Q plot, Shapiro-Wilk test, Cook’s distance, leverage points, influential observations, endogeneity, omitted variable bias, robust standard errors, weighted least squares, generalized least squares, data transformation, log transformation, ridge regression, LASSO, and cross-validation. Each appears in context below.
The Five Core Assumptions
The Five Assumptions of Regression Model — Overview
The assumptions of a regression model form a hierarchy. Some are absolutely essential; others matter more in large samples or specific data contexts. But every competent analyst needs to know all five, test all five, and know what to do when any of them fails. Here is the complete set.
1
Linearity
The relationship between each predictor and the outcome is linear. The model form Y = Xβ + ε is correctly specified.
2
Independence
Observations are independent of each other. Residuals show no autocorrelation or serial correlation across observations.
3
Homoscedasticity
The variance of the residuals is constant across all levels of the predictor variables. No fan-shaped spreading of errors.
4
Normality of Residuals
The error terms follow an approximately normal distribution. Required for valid hypothesis testing on coefficients.
5
No Multicollinearity
Predictor variables are not highly correlated with each other. High correlation inflates coefficient variance and makes estimates unstable.
Some textbooks and courses present these as four assumptions (collapsing independence and no-autocorrelation) or as seven (splitting linearity into two). The five-assumption framework above is the most widely used in U.S. econometrics and statistics courses, including at institutions like MIT, Harvard, Stanford, University of Chicago, and in the UK at London School of Economics and Oxford. For a broader look at statistical modeling methods that build on these assumptions, see model selection with AIC and BIC.
Each assumption carries consequences when violated. Violating linearity produces systematically biased predictions. Violating independence understates standard errors. Violating homoscedasticity produces inefficient estimates. Violating normality invalidates small-sample hypothesis tests. Violating no-multicollinearity inflates the variance of coefficient estimates, making predictors appear statistically insignificant even when they have real effects. The residual analysis guide goes deeper on diagnosing all of these visually and statistically.
Assumption 1
Linearity — The Foundational Assumption of Regression
The linearity assumption of a regression model states that the relationship between the predictor variables and the outcome variable is linear. Formally, this means the conditional mean of Y given X is a linear function of X: E(Y|X) = β₀ + β₁X. This is not saying that the data must fall perfectly on a straight line. It says that the average value of Y changes linearly with X. Regression residuals capture the individual deviations around that linear trend.
Linearity is the foundational assumption because OLS is, by construction, a method for fitting linear relationships. If the true relationship between your variables is quadratic, exponential, or follows a different curve, fitting a straight line through the data produces systematically biased predictions — your model overestimates in some ranges and underestimates in others, in a patterned, non-random way. This pattern shows up clearly in diagnostic plots.
How to Test the Linearity Assumption
Residuals vs. Fitted Values Plot
The primary diagnostic for linearity is the residuals vs. fitted values plot. After running your regression, plot the residuals on the y-axis against the fitted (predicted) values on the x-axis. If the linearity assumption holds, you should see a random scatter of residuals around zero — no systematic curve, no U-shape, no pattern. Any consistent curve or wave in this plot signals a nonlinear relationship that your linear model is not capturing.
What you want to see: A flat, horizontal band of randomly scattered points centered at zero in the residuals vs. fitted plot. No trends. No curves. No patterns.
What signals a problem: A U-shape or inverted U-shape in the residuals — meaning the model over-predicts at both extremes and under-predicts in the middle, or vice versa. This is the clearest visual signature of a violated linearity assumption.
What signals a problem: A U-shape or inverted U-shape in the residuals — meaning the model over-predicts at both extremes and under-predicts in the middle, or vice versa. This is the clearest visual signature of a violated linearity assumption.
Partial Regression Plots
In multiple regression, partial regression plots (also called added-variable plots) allow you to assess linearity for each predictor individually, after accounting for the effects of all other predictors. These are available in R (via the car package), Python (via statsmodels), and SPSS. If a partial regression plot shows a nonlinear pattern for one predictor, that predictor’s relationship with the outcome needs transformation.
Component-Plus-Residual Plots
Component-plus-residual (CERES) plots are a related diagnostic that can detect nonlinearity even when the overall residuals vs. fitted plot appears acceptable. They’re particularly useful in complex multiple regression models with many predictors.
What Causes Linearity Violations?
The most common cause of violated linearity in the regression model is model misspecification. You’ve assumed a straight-line relationship where the true relationship curves. Other causes include failing to include an important quadratic or interaction term, and omitting a relevant variable that creates an apparent nonlinear pattern. This connects to the broader problem of correlation vs. causation — patterns in data don’t automatically reveal the correct functional form.
How to Fix a Linearity Violation
1
Transform the Variables
A log transformation of the outcome variable (Y → ln Y) is the most common fix when the residuals show a right-skewed fan shape. Transforming the predictor (X → ln X, or X → √X) works when the predictor-outcome relationship curves but the residuals’ variance is stable. Simple linear regression guides often cover log-linear and log-log model forms.
2
Add Polynomial Terms
If the relationship between X and Y is quadratic, add X² to the model. If it’s cubic, add X³. This is the foundation of polynomial regression. You’re still running a linear regression (linear in the parameters), but you’re modeling a curved relationship between X and Y by including powered versions of X as additional predictors.
3
Use a Nonlinear Model
If transformations don’t resolve the issue and the relationship is fundamentally nonlinear, consider switching to a generalized linear model (GLM), a spline regression, or a machine learning model better suited to nonlinear data. Generalized Linear Models (GLMs) extend regression to handle non-normal outcome distributions and nonlinear link functions.
4
Check for Omitted Variables
Sometimes apparent nonlinearity reflects an omitted variable. If a variable that belongs in the model is missing, the residuals can exhibit systematic patterns that mimic nonlinearity. Re-specify the model by adding theoretically relevant variables before concluding that the relationship itself is nonlinear.
Student Tip: Check Scatterplots Before Running Regression
Before fitting any regression model, plot each predictor against the outcome in a scatterplot. If you see a curve, you already know linearity may be violated and you can address it before running the model. Checking assumptions after fitting the model is valid — but checking potential assumption violations before fitting saves time and prevents embarrassing discoveries in your assignment write-up.
Struggling With Regression Assumptions in Your Assignment?
Our statistics experts can check your model’s assumptions, run the right diagnostic tests, and fix any violations — matched to your dataset and assignment requirements.
Get Statistics Help Now Log InAssumption 2
Independence of Observations — Detecting Autocorrelation
The independence assumption of a regression model requires that the observations in your dataset are independent of each other. Formally, this means that the error terms are uncorrelated: Cov(εᵢ, εⱼ) = 0 for all i ≠ j. Each observation’s deviation from the regression line should have no systematic relationship with any other observation’s deviation. When this assumption fails, we have autocorrelation (also called serial correlation).
Independence violations are most common in time-series data (where today’s error predicts tomorrow’s), panel data (where multiple observations from the same individual are correlated), spatial data (where geographically close observations tend to be similar), and clustered data (where observations within schools, hospitals, or firms share common characteristics). For students studying economics, public health, sociology, or any field that tracks outcomes over time, this is an assumption you will confront constantly.
What Is Autocorrelation?
Autocorrelation occurs when a residual at one time point or observation is correlated with a residual at another. Positive autocorrelation — the most common type — means that large positive residuals tend to follow large positive residuals, and large negative residuals follow negative ones. This is extremely common in economic time series (GDP, inflation, stock prices) where values in consecutive periods are naturally related. Negative autocorrelation means residuals alternate in sign, which is rarer but occurs in data with suppressed oscillations.
The consequence of autocorrelation is not biased coefficient estimates — OLS coefficients remain unbiased even with autocorrelation. The problem is that standard errors are underestimated. When residuals are positively autocorrelated, each new observation contains less independent information than it appears to. OLS doesn’t account for this, so it treats the data as if it had more independent information than it does, producing standard errors that are too small. This makes t-statistics too large and p-values too small — you reject null hypotheses you shouldn’t, committing Type I errors. See Type I and Type II errors for why this matters.
How to Test for Autocorrelation
The Durbin-Watson Test
The Durbin-Watson (DW) test is the most widely used formal test for first-order autocorrelation in regression residuals. The DW statistic ranges from 0 to 4. A value near 2 indicates no autocorrelation. Values below 2 suggest positive autocorrelation; values above 2 suggest negative autocorrelation. A common rule of thumb: DW values between 1.5 and 2.5 are acceptable; outside that range, further investigation is warranted. The DW test is available in SPSS, Stata, R, and Python’s statsmodels.
Residuals vs. Time Plot
If your data has a natural ordering (time, space, or another sequence), plot the residuals in order. If you see a wave-like pattern — runs of positive residuals followed by runs of negative residuals, or vice versa — positive autocorrelation is likely present. A random scatter around zero, with no tendency for runs, suggests independence.
Breusch-Godfrey Test
For higher-order autocorrelation — where residuals at time t are correlated with residuals at t-2, t-3, or further — the Breusch-Godfrey LM test is more powerful than the Durbin-Watson test. It’s also valid when the model includes lagged dependent variables, where Durbin-Watson is not.
How to Fix Autocorrelation
When your regression model shows autocorrelation, several remedies exist. First, check whether the autocorrelation is actually caused by a model specification error — an omitted variable, a missing lagged term, or a structural break in the data. Fix the specification first before applying statistical corrections.
If autocorrelation persists after re-specification, use Newey-West heteroscedasticity and autocorrelation consistent (HAC) standard errors, which adjust standard errors without changing the coefficient estimates. In time-series contexts, adding an AR(1) error term or using Generalized Least Squares (GLS) with a Cochrane-Orcutt or Prais-Winsten transformation directly models the autocorrelation structure. For panel data, clustered standard errors account for within-group correlation. The concept of time series analysis with ARIMA is particularly relevant for longitudinal regression contexts.
⚠️ Cross-sectional vs. time-series context: In purely cross-sectional data — a random sample of individuals surveyed at a single point in time — independence usually holds by design, assuming proper random sampling. The independence assumption becomes critical and technically demanding when you work with time-series, longitudinal, panel, or spatially structured data. Always consider your data structure before dismissing this assumption.
Assumption 3
Homoscedasticity — Constant Variance of Residuals
The homoscedasticity assumption of the regression model requires that the variance of the error terms is constant across all values of the predictor variables. Formally: Var(εᵢ) = σ² for all i. The word itself comes from Greek — “homo” meaning same, “skedasis” meaning dispersion. When this assumption fails, the variance of residuals changes systematically with the predictors or with the fitted values. That condition is called heteroscedasticity.
Heteroscedasticity does not bias OLS coefficient estimates — they remain consistent. But it makes OLS inefficient: the estimates are no longer the ones with minimum variance among all unbiased estimators. More importantly, heteroscedasticity invalidates standard errors. Conventional OLS standard errors assume constant residual variance; when that’s false, they’re wrong — usually underestimated for some ranges of predictors and overestimated for others. This produces unreliable hypothesis tests. Understanding hypothesis testing is essential context for appreciating why this matters.
What Causes Heteroscedasticity?
Heteroscedasticity is common in economics, finance, and any cross-sectional study where the units of observation vary widely in size or scale. A classic example: when regressing household income on household expenditure, the residuals tend to be larger for high-income households (which have more flexibility in their spending choices) and smaller for low-income households (whose spending closely tracks income out of necessity). Another common example: in finance, stock return volatility (the spread of residuals around the mean) tends to cluster — high volatility periods follow high volatility periods, a phenomenon called ARCH (Autoregressive Conditional Heteroscedasticity).
Heteroscedasticity can also arise from a skewed distribution of a predictor variable, from measurement error that varies across observations, or from a misspecified functional form — which is why linearity checks come before homoscedasticity checks.
How to Detect Heteroscedasticity
Residuals vs. Fitted Values Plot
The same plot used to check linearity also reveals heteroscedasticity. Look for a fan shape — residuals that spread out as fitted values increase (or decrease). A narrow band at low fitted values that widens at high fitted values is the classic signature of positive heteroscedasticity. An hourglass shape indicates heteroscedasticity that increases then decreases. Any systematic change in the spread of residuals across fitted values signals a violation.
Scale-Location Plot (Spread-Location Plot)
The Scale-Location plot plots the square root of the absolute standardized residuals against fitted values. A horizontal flat line with evenly spread points indicates homoscedasticity. An upward trend in this plot signals that residual variance grows with the fitted values — positive heteroscedasticity.
Breusch-Pagan Test
The Breusch-Pagan test formally tests the null hypothesis of homoscedasticity. It regresses the squared residuals on the predictor variables and tests whether any predictor can explain variation in the squared residuals. A significant result (typically p < 0.05) rejects homoscedasticity and confirms heteroscedasticity. Available in R (lmtest package), Python (statsmodels), Stata, and SPSS.
White Test
The White test is a more general version of Breusch-Pagan. It also includes squared predictors and cross-products (interaction terms) in the auxiliary regression, making it powerful against a wider range of heteroscedasticity patterns. The White test is particularly useful when you have no strong prior about which predictor is driving the changing variance.
How to Fix Heteroscedasticity
The most widely used remedy for heteroscedasticity in modern regression practice is robust standard errors (also called heteroscedasticity-consistent standard errors, or HC standard errors). These adjust the standard errors to account for the non-constant variance without altering the coefficient estimates. In R, the sandwich package provides robust standard errors. In Stata, the ,robust option does this in a single command. In Python, the HC3 covariance type in statsmodels implements White’s correction.
Weighted Least Squares (WLS) is a more direct fix. It assigns greater weight to observations with smaller residual variance (more reliable observations) and less weight to those with larger variance. WLS produces efficient estimates when the pattern of heteroscedasticity is known or can be estimated. Log transformation of the outcome variable — when appropriate — often stabilizes variance, converting multiplicative heteroscedasticity into approximate homoscedasticity. For regression contexts involving regularization, ridge and LASSO regression offer alternative modeling approaches that add other forms of robustness.
✓ Homoscedastic Residuals
- Residuals scatter randomly around zero at all levels of fitted values
- No fan shape or systematic widening of residual spread
- Scale-Location plot shows a flat horizontal trend line
- Breusch-Pagan test p-value above 0.05 — fail to reject homoscedasticity
- OLS standard errors are valid and efficient
✗ Heteroscedastic Residuals
- Residuals form a fan shape — wider spread at high or low fitted values
- Scale-Location plot shows an upward or downward trend
- Breusch-Pagan test p-value below 0.05 — reject homoscedasticity
- OLS standard errors are unreliable — too small or too large
- Hypothesis tests on coefficients cannot be trusted without correction
An important related topic is residual analysis, which ties all these visual diagnostics together into a systematic workflow for evaluating your regression model’s overall validity.
Assumption 4
Normality of Residuals — What It Is and Why It Matters
The normality assumption of the regression model states that the error terms εᵢ are normally distributed with mean zero and constant variance σ²: εᵢ ~ N(0, σ²). Normality of residuals is the assumption that underpins the validity of t-tests on individual coefficients and the F-test on the overall model in small samples.
An important nuance: in large samples, normality matters less because of the Central Limit Theorem. The Central Limit Theorem guarantees that the sampling distribution of OLS coefficient estimates approaches normality as sample size grows, even if the residuals themselves are not perfectly normal. A commonly cited rule of thumb: with n > 30 observations per predictor, mild non-normality has negligible impact on inference. With small samples — say, n < 30 — non-normality can meaningfully distort p-values and confidence intervals.
What Causes Non-Normality of Residuals?
Non-normal residuals most commonly arise from: a skewed outcome variable (common in income data, insurance claims, and count data where values are bounded at zero and right-skewed); the presence of outliers that create heavy tails in the residual distribution; a misspecified model (often, normalizing by correcting the functional form also normalizes the residuals); or a binary or count outcome variable for which a Gaussian (normal) error distribution is inappropriate from the outset. Understanding normal distributions, kurtosis, and skewness gives the foundational background for reading residual distributions.
How to Test Normality of Residuals
Q-Q Plot (Quantile-Quantile Plot)
The Normal Q-Q plot is the most visually intuitive diagnostic for normality. It plots the quantiles of your standardized residuals against the quantiles of a theoretical normal distribution. If residuals are normally distributed, the points fall approximately along a straight 45-degree diagonal line. Departures from this line reveal the nature of the non-normality. Points curving upward at both ends indicate heavy tails. An S-shaped curve indicates skewness. Isolated points far from the line are potential outliers.
Shapiro-Wilk Test
The Shapiro-Wilk test is the most powerful formal test for normality in small to medium samples (n < 2000). It tests the null hypothesis that the residuals are normally distributed. A significant result (p < 0.05) indicates non-normality. However, in very large samples, the Shapiro-Wilk test will reject normality for even trivially small deviations from normality that have no practical consequence. Always pair the formal test with visual inspection via the Q-Q plot.
Kolmogorov-Smirnov and Anderson-Darling Tests
The Kolmogorov-Smirnov (KS) test and the Anderson-Darling test are alternatives to Shapiro-Wilk. Anderson-Darling is more sensitive to deviations in the tails of the distribution, making it useful when you’re concerned about extreme values or heavy-tailed residuals. The KS test is more conservative and better suited to larger samples. All three are available in R, Python, SPSS, and Stata.
Histogram of Residuals
A simple histogram of residuals with a normal distribution overlay offers an intuitive visual check. Look for a roughly bell-shaped distribution centered at zero. Significant skew (most values piled to one side) or obvious bimodality (two humps) signals non-normality worth investigating. The histogram complements the Q-Q plot — use both together for a complete picture. Understanding probability distributions helps you interpret what these plots are telling you.
How to Fix Non-Normal Residuals
The most common and effective fix is variable transformation. Log-transforming a right-skewed outcome variable (like income, hospital costs, or reaction times) typically produces more normally distributed residuals. Square root transformations work well for count data. Box-Cox transformations offer a data-driven way to find the optimal transformation parameter.
If non-normality is driven by outliers, investigate those observations individually. Are they data entry errors? Genuinely extreme cases? Influential observations? Use Cook’s distance and leverage statistics to identify and assess them. Removing outliers arbitrarily is methodologically problematic; the right approach is to understand why they’re outliers and report them transparently.
For binary outcomes (yes/no, pass/fail, disease/no disease), the normal error assumption is wrong by construction — use logistic regression instead. For count outcomes (number of events, hospital visits, crimes), use Poisson or negative binomial regression. These generalized linear models replace the normality assumption with appropriate distributions for the outcome type. Poisson distribution and its applications in regression are covered in detail elsewhere on this site.
Large Sample? Normality Matters Less
With a large sample (n > 200), the Central Limit Theorem makes your regression inference approximately valid even with non-normal residuals. In large samples, focus your energy on linearity, homoscedasticity, and multicollinearity — these don’t self-correct with larger n. Normality is the one assumption that becomes less critical as your dataset grows. With small samples, however, it’s the assumption most likely to invalidate your t-tests and confidence intervals, so check it carefully. For a full treatment of how sampling variability works, see sampling distributions.
Need Help Testing Your Regression Assumptions?
Our statistics experts run full diagnostic tests — residual plots, Durbin-Watson, Breusch-Pagan, VIF, Shapiro-Wilk — and write up the results in your required format. SPSS, R, Python, Stata — we work in your software.
Start Your Order Log InAssumption 5
No Multicollinearity — Why Correlated Predictors Break Regression
The no multicollinearity assumption requires that the predictor variables in a regression model are not highly correlated with each other. In simple linear regression with only one predictor, multicollinearity cannot exist. It becomes relevant in multiple regression, where two or more predictors may be measuring related or overlapping constructs. When predictor variables are highly correlated, the OLS algorithm cannot distinguish their independent effects on the outcome — and coefficient estimates become extremely unstable, with inflated variance and dramatically wide confidence intervals.
Perfect multicollinearity — where one predictor is an exact linear combination of others — makes matrix inversion mathematically impossible and the regression cannot be estimated at all. Most real-world multicollinearity is imperfect (high but not perfect correlation), and OLS can technically produce estimates — but those estimates are unreliable, sensitive to small changes in the data, and often misleading. The multiple linear regression guide covers the mechanics of how predictor correlation affects the coefficient estimation process in detail.
What Causes Multicollinearity?
Multicollinearity arises naturally when predictors measure related constructs: height and weight in a health study; income and education in a social science model; advertising spending on TV and digital when a firm always allocates budget in a fixed ratio; age and years of experience in a labor economics regression. It’s also created artificially by including dummy variables for all categories of a categorical variable without dropping one (the dummy variable trap), or by including both a variable and a transformation of that variable without considering their correlation.
How to Detect Multicollinearity
Variance Inflation Factor (VIF)
The Variance Inflation Factor (VIF) is the standard diagnostic for multicollinearity. For each predictor, VIF measures how much its variance is inflated due to correlation with other predictors. A VIF of 1 indicates no inflation — that predictor is uncorrelated with others. A VIF of 5 means the variance is inflated fivefold. The commonly used thresholds are: VIF > 5 warrants concern; VIF > 10 indicates severe multicollinearity. The reciprocal of VIF is called the Tolerance statistic — a tolerance below 0.1 (VIF > 10) signals a serious problem. VIF is available in SPSS (as part of the Coefficients table), Stata (vif command), R (car::vif()), and Python (statsmodels.stats.outliers_influence.variance_inflation_factor).
Correlation Matrix
A simple pairwise correlation matrix of all predictors identifies pairs of variables with high correlation (|r| > 0.8 or 0.9). While a correlation matrix doesn’t fully capture multicollinearity involving three or more variables simultaneously, it’s a useful first screen. If no pair of predictors has a correlation above 0.8, severe multicollinearity is unlikely, though not impossible. For understanding how pairwise associations are measured and interpreted, the guide to correlation vs. causation provides essential background.
Condition Index and Condition Number
The condition index examines the eigenvalues of the predictor matrix to detect complex multicollinearity patterns involving multiple predictors simultaneously. A condition index above 30 indicates moderate to severe multicollinearity. This diagnostic is available in SPSS’s regression output under “Collinearity Diagnostics” and in R’s perturb and car packages.
How to Fix Multicollinearity
1
Remove Redundant Predictors
If two predictors measure essentially the same construct, remove one. This is the simplest fix and often the most defensible from a theoretical standpoint. Decide which predictor is more theoretically relevant, better measured, or more directly interpretable, and retain that one. Use model selection criteria like AIC or BIC to compare models with and without the redundant predictor. Model selection with AIC and BIC provides a framework for these decisions.
2
Ridge Regression
Ridge regression adds a penalty term (L2 regularization) to the OLS objective function that shrinks coefficient estimates toward zero, reducing their variance even in the presence of multicollinearity. Ridge regression sacrifices a small amount of bias to dramatically reduce the variance of estimates when predictors are correlated — a bias-variance tradeoff that often produces more stable and reliable estimates than OLS under multicollinearity. The full mechanics are covered in ridge and LASSO regularization.
3
Principal Component Analysis (PCA)
Principal Component Analysis transforms correlated predictors into a smaller set of uncorrelated components and uses those components as predictors in a regression model (called Principal Component Regression). The resulting components are orthogonal — by construction they have zero correlation — eliminating multicollinearity. The tradeoff is interpretability: principal components are linear combinations of the original variables and often lack intuitive meaning. See Principal Component Analysis for a full treatment.
4
Collect More Data or Rethink the Research Design
Multicollinearity is often a sample problem as much as a theoretical one. With a larger, more varied sample, predictors that appeared highly correlated may exhibit more independent variation. Redesigning a study to ensure predictors vary more independently — through experimental manipulation or more diverse sampling — addresses multicollinearity at its source rather than statistically compensating for it.
Multicollinearity is a data problem, not a model problem: It does not mean your model is wrong. It means the data you’ve collected doesn’t contain enough independent variation in the predictors to estimate their separate effects reliably. The fix usually involves what information you put into the model, not just how you estimate it.
The Theoretical Foundation
The Gauss-Markov Theorem and the BLUE Estimator
The Gauss-Markov theorem is the mathematical proof that ties all the assumptions of a regression model together into a coherent theoretical framework. It states that when specific conditions hold, the OLS estimator of β is the Best Linear Unbiased Estimator (BLUE). Every word in that acronym matters.
- Best — minimum variance among all linear unbiased estimators. No other unbiased linear estimation method produces estimates with smaller variance than OLS when the assumptions hold.
- Linear — the estimator is a linear function of the outcome variable Y.
- Unbiased — E(β̂) = β. On average, the estimates equal the true population values.
- Estimator — the OLS formula applied to a sample to estimate the true (unknown) population regression coefficients.
The Gauss-Markov conditions required for BLUE are: (1) the model is linear in parameters; (2) the errors have a zero conditional mean — E(ε|X) = 0; (3) errors are homoscedastic; (4) errors are uncorrelated. Notably, the Gauss-Markov theorem does not require normality of errors. Normality is an additional condition needed for valid finite-sample inference (t-tests and F-tests), but not for the BLUE property itself. The Gauss-Markov theorem is a cornerstone of classical econometrics curriculum.
What the Zero Conditional Mean Assumption Means
The zero conditional mean assumption — E(ε|X) = 0 — is one of the most important and subtle conditions in regression theory. It requires that the error term has zero expected value for every possible value of the predictor X. In practical terms, it means that the predictor variables are exogenous: they are not correlated with the error term. When this assumption fails — a situation called endogeneity — OLS estimates are biased and inconsistent, meaning the bias does not go away as sample size increases.
Endogeneity arises from three main sources: omitted variable bias (a variable that affects Y is excluded from the model and is correlated with an included predictor); simultaneity (X and Y are jointly determined, each causing the other — common in supply and demand models); and measurement error in the predictor variables. Instrumental Variables (IV) estimation and Two-Stage Least Squares (2SLS) are the standard fixes for endogeneity, though they require valid instruments — a demanding condition in practice. The relationship between causal structure and regression assumptions connects to causal inference and randomized controlled trials.
What the Scientific Method Has to Do With Regression Assumptions
Many students wonder why regression assumptions feel so abstract. The connection to practice is direct: when you report a regression coefficient, you’re making an empirical claim about the relationship between variables in the real world. Whether that claim is credible depends entirely on whether the assumptions hold. This is no different from the requirement that any scientific result be obtained through a valid methodology. The scientific method in empirical research demands that the tools used are appropriate for the data and questions at hand — and regression assumptions are how you determine that OLS is an appropriate tool for your context. Research conducted with violated assumptions produces invalid findings, regardless of the sophistication of the analysis.
Quick Reference
Complete Reference Table: All Regression Assumptions, Tests, and Fixes
The table below consolidates every assumption of the regression model — what it means, how to detect violations, what consequences follow from violations, and how to fix them. This is a reference you will use repeatedly in statistics courses and in research.
| Assumption | Formal Statement | Detection Methods | Consequence if Violated | Common Fixes |
|---|---|---|---|---|
| 1. Linearity | E(Y|X) = Xβ — relationship between predictors and outcome is linear | Residuals vs. fitted plot; partial regression plots; scatterplots | Biased, systematically wrong predictions; model misspecification | Transform variables (log, square root); add polynomial terms; use GLM |
| 2. Independence | Cov(εᵢ, εⱼ) = 0 for i ≠ j — error terms are uncorrelated | Durbin-Watson test; residuals vs. time plot; Breusch-Godfrey test | Standard errors underestimated; t-stats inflated; incorrect p-values | Newey-West HAC standard errors; GLS; clustered standard errors |
| 3. Homoscedasticity | Var(εᵢ) = σ² — constant variance across all observations | Residuals vs. fitted (fan shape); Breusch-Pagan test; White test | Inefficient estimates; invalid standard errors; unreliable hypothesis tests | Robust (HC) standard errors; WLS; log transform of outcome |
| 4. Normality of Residuals | εᵢ ~ N(0, σ²) — errors are normally distributed | Q-Q plot; Shapiro-Wilk test; histogram of residuals; Anderson-Darling test | Invalid t-tests and F-tests in small samples; distorted confidence intervals | Transform outcome; remove/investigate outliers; use non-parametric tests |
| 5. No Multicollinearity | Predictors not highly correlated with each other | VIF (>10 = severe); correlation matrix; condition index | Inflated coefficient variance; unstable estimates; wide confidence intervals | Remove redundant predictors; ridge regression; PCA; collect more data |
For anyone working with software: SPSS produces the Durbin-Watson statistic, VIF, tolerance, and collinearity diagnostics automatically in the standard regression output. R’s plot(model) function generates all four standard diagnostic plots simultaneously. Python’s statsmodels OLSResults class provides influence statistics, normality tests, and heteroscedasticity tests through its get_influence() and summary diagnostic methods. Understanding which test does what — and when to use it — is the practical skill this guide builds. Connecting these diagnostics to broader statistical practice, the guide on descriptive vs. inferential statistics provides useful context.
Systematic Workflow
How to Check Regression Assumptions Step by Step
Knowing the assumptions of a regression model is necessary but not sufficient. You also need a systematic workflow for checking them in practice. The order matters: some assumption checks build on others, and fixing an earlier violation can resolve what appeared to be a later violation. Here is the workflow used by experienced applied statisticians and econometricians.
1
Begin With Exploratory Data Analysis (EDA)
Before running any regression, examine your data. Plot each predictor against the outcome (scatterplots). Check the distributions of all variables (histograms, box plots). Compute a correlation matrix of your predictors. Look for obvious outliers, unusual distributions, and strong predictor correlations. EDA reveals potential assumption violations before the model is estimated, allowing you to address them in your model specification rather than as post-hoc corrections. Good datasets and proper EDA go hand in hand before fitting any model.
2
Check Linearity First
Plot residuals vs. fitted values after running your initial model. If linearity is violated, transform variables or add polynomial terms before moving on. Trying to diagnose homoscedasticity or normality in a misspecified model produces misleading results — the other assumption checks are only valid given a correctly specified model.
3
Check Multicollinearity
Compute VIF for all predictors. If any VIF exceeds 10, investigate which predictors are correlated. Consider removing redundant variables or using ridge regression before interpreting coefficients. Multicollinearity doesn’t violate normality or homoscedasticity, but inflated coefficient variance can make subsequent inference difficult to interpret meaningfully.
4
Check Independence
If your data has a temporal, spatial, or hierarchical structure, run the Durbin-Watson test (for time-series) or consider whether clustered standard errors are needed. For pure cross-sectional random samples, independence usually holds by design and formal testing is less critical.
5
Test Homoscedasticity
Inspect the scale-location plot and run the Breusch-Pagan test. If you detect heteroscedasticity, switch to robust standard errors for inference. If heteroscedasticity is severe, consider WLS or transform the outcome variable.
6
Assess Normality of Residuals
Generate the Q-Q plot and run Shapiro-Wilk if your sample is small (n < 200). Investigate any serious departures — check for outliers (Cook's distance), consider variable transformations, and reconsider whether a linear regression is the appropriate model for your outcome type.
7
Investigate Influential Observations
After the main assumption checks, examine influence statistics: Cook’s distance (overall influence of each observation on all coefficients), leverage (how far each observation’s predictor values are from the mean of predictors), and DFFITS (standardized change in each fitted value when the observation is deleted). High-leverage, high-influence points deserve individual investigation — they may represent data errors, unusual cases, or genuine edge cases that your model doesn’t generalize to well.
8
Document Everything
In an academic paper or report, the assumption-checking process must be reported. State which tests you ran, what the results were, what violations you detected (if any), and what you did about them. Reviewers and professors expect this — it’s what distinguishes a rigorous analysis from a regression run without validation. For students writing research papers or statistics assignments, the research paper writing guide covers how to structure the Methods and Results sections in which these checks are reported.
Extensions & Related Methods
Extensions of the Regression Model and Their Own Assumptions
Standard linear regression — with its five core assumptions — is the foundation. But modern statistics offers a rich range of regression extensions, each designed to handle situations where one or more of the classical assumptions cannot be satisfied. Understanding which tool to use when is as important as understanding the tools themselves.
Logistic Regression — For Binary Outcomes
When the outcome variable is binary (0 or 1), linear regression violates linearity and normality assumptions by construction. Logistic regression replaces the normal error distribution with a Bernoulli distribution and uses a logit link function to model the log-odds of the outcome. Logistic regression has its own assumptions: independence of observations, absence of multicollinearity, and linearity in the log-odds — but not homoscedasticity or normality of errors. A recent study in the Journal of the American Statistical Association showed logistic regression remains robust to moderate sample size conditions when the outcome is genuinely binary and predictors are well-specified. For deeper reading, Journal of the American Statistical Association publishes foundational applied statistics research.
Ridge and LASSO Regression — For Multicollinearity and High-Dimensional Data
Ridge and LASSO regression add penalty terms to the OLS objective function that shrink coefficients toward zero. Ridge addresses multicollinearity by reducing coefficient variance. LASSO performs variable selection by shrinking some coefficients to exactly zero, effectively removing predictors from the model. Both are essential tools in contexts with many predictors relative to observations — a setting where OLS assumptions about invertible predictor matrices can break down.
Polynomial Regression — For Nonlinear Relationships
Polynomial regression adds powered versions of predictors (X², X³) to capture curved relationships while remaining linear in parameters. It extends the linearity assumption to accommodate monotonic nonlinearity — increasing or decreasing at a changing rate. The homoscedasticity and normality assumptions still apply; multicollinearity often becomes a concern because X, X², and X³ are highly correlated.
Multiple Linear Regression — Extending to Many Predictors
All five assumptions of a regression model apply equally to multiple linear regression, with the addition of the no-multicollinearity assumption that doesn’t arise in simple regression. Multiple regression also introduces the risk of overfitting — fitting the training sample noise rather than the true population relationship. Cross-validation techniques, covered in cross-validation and bootstrapping, are the standard way to assess whether a multiple regression model generalizes beyond the sample used to estimate it. For broader context on the difference between parametric data approaches like regression and nonparametric alternatives, the distinction between qualitative and quantitative data is also worth understanding.
Generalized Linear Models — For Non-Normal Outcomes
Generalized Linear Models (GLMs) extend linear regression to outcome variables from the exponential family of distributions — including Poisson (for counts), Gamma (for positive-valued continuous outcomes), and Bernoulli/Binomial (for binary outcomes). GLMs replace the normality assumption with a distribution-appropriate error family and replace the identity link function with a link function appropriate for the outcome type. The assumptions of independence and absence of multicollinearity still apply; homoscedasticity is replaced by a mean-variance relationship specified by the chosen distribution.
Time Series Regression — When Independence Fails
When observations are temporally ordered and the independence assumption fails, time series models like ARIMA integrate autoregressive and moving average components directly into the model structure, rather than treating autocorrelation as a nuisance to correct. ARIMA and exponential smoothing methods are the standard toolkit. Regression models with ARIMA error structures (ARIMAX) allow including external predictors while explicitly modeling temporal dependence in the errors — the best of both approaches.
Know When to Stop Debugging Assumptions and Change the Model
Students sometimes spend hours trying to force a linear model to satisfy assumptions it will never satisfy because the data simply isn’t suitable for OLS. If you’ve tried transformations, removed clear outliers, and checked for specification errors — and the assumptions still fail — the most honest and productive response is to consider whether a different model class (GLM, mixed model, time-series model) is simply a better fit for the data generating process. Regression assumptions are ultimately questions about whether the model matches reality — and sometimes the answer is that this particular model doesn’t.
Field Applications
Regression Assumptions Across Academic Disciplines
The assumptions of a regression model play out differently depending on the field and the type of data. Understanding how assumption violations manifest in your specific discipline helps you know which ones to watch for most carefully in your own work.
Economics and Econometrics
In economics, endogeneity (violation of the zero conditional mean / independence assumption) is the dominant concern. Wages affect education choices even as education affects wages — simultaneity makes OLS estimates of the returns to education biased. Instrumental Variables (IV) and Two-Stage Least Squares are the workhorse fixes. Heteroscedasticity is routine in cross-sectional economic data — robust standard errors are standard practice. Omitted variable bias from unobserved confounders is so pervasive that applied economists use randomization and quasi-experimental designs to establish causal identification rather than relying on assumptions alone.
Psychology and Behavioral Sciences
In psychology, normality of residuals and homoscedasticity are the most frequently checked assumptions, because psychological studies often have small samples (where normality matters most for inference) and often compare groups of different sizes (where heteroscedasticity can arise). The American Psychological Association (APA) recommends reporting effect sizes and confidence intervals alongside p-values — all of which depend on valid assumption checking. Violations of independence arise in studies with nested data (students within schools, patients within hospitals), requiring multilevel models.
Public Health and Epidemiology
Public health regression models frequently deal with count outcomes (number of disease cases, hospitalizations) that violate normality, and binary outcomes (infected vs. not infected) that violate linearity and normality simultaneously. Poisson regression and logistic regression replace OLS in these contexts. Clustered data — patients within hospitals, children within schools, residents within neighborhoods — routinely violates independence, requiring mixed effects models or generalized estimating equations (GEE). The independence assumption is arguably the most consequential one in public health regression, where ignoring clustering understates standard errors and overstates statistical precision.
Business and Finance
Financial return data is notorious for violating homoscedasticity — volatility clustering means that large price movements tend to be followed by large movements (in either direction), violating constant-variance assumptions. ARCH/GARCH models are designed specifically to model time-varying conditional variance in financial time series. Multicollinearity is a persistent issue in marketing mix models, where advertising channels (TV, digital, radio) tend to move together in response to budget decisions. Ridge regression is widely used in marketing analytics to handle correlated predictors in media mix models.
Social Sciences and Sociology
In sociology and social science, data collected from surveys often involves hierarchical or clustered structures — individuals within households, within neighborhoods, within cities — that violate independence assumptions. Multilevel modeling (also called hierarchical linear modeling) explicitly models the nested structure of the data. Omitted variable bias from unobserved individual characteristics (like innate ability in education research) is a perennial concern. The design and analysis of quantitative social research, including these challenges, is a topic well-covered in methodological journals like Sociological Methods & Research. For broader foundational guidance on statistics as applied in the social sciences, see statistics assignment help.
| Discipline | Most Critical Assumption(s) | Common Violations | Typical Remedies |
|---|---|---|---|
| Economics | Zero conditional mean (endogeneity); independence | Omitted variable bias; simultaneity; measurement error | Instrumental Variables; natural experiments; panel fixed effects |
| Psychology | Normality; homoscedasticity; independence | Small samples amplify non-normality; nested data | Robust standard errors; multilevel models; bootstrapping |
| Public Health | Linearity; normality; independence | Binary/count outcomes; clustering within facilities | Logistic/Poisson regression; GEE; mixed effects models |
| Finance | Homoscedasticity; independence | Volatility clustering; autocorrelated returns | GARCH models; HAC standard errors; ARIMA-regression hybrids |
| Sociology | Independence; zero conditional mean | Clustered survey data; unobserved confounders | Multilevel models; clustered standard errors; propensity score methods |
Academic Success Tips
Common Mistakes Students Make With Regression Assumptions
Students in statistics and econometrics courses across the U.S. and UK consistently make the same errors when working with regression assumptions. Knowing what these are — and how to avoid them — is the difference between an assignment that demonstrates statistical competence and one that reveals a superficial understanding of the model. For broader study skills that support academic performance in quantitative courses, the guide on online resources for homework help covers how to use academic resources effectively.
Mistake 1: Skipping Assumption Checks Entirely
The most common mistake is running a regression and reporting the results without ever checking whether the assumptions hold. Many students treat assumption testing as optional extra work. It isn’t. Regression results without assumption validation are methodologically incomplete. Peer reviewers and examiners know this — and will mark it accordingly.
Mistake 2: Confusing the Distribution of X With the Distribution of Residuals
The normality assumption of the regression model applies to the residuals — not to the predictor variables, not to the outcome variable. It’s the errors εᵢ that must be normally distributed, not X or Y themselves. Students frequently check whether their predictor is normally distributed and report that as evidence of satisfied assumptions. It’s not. Check normality of the residuals after fitting the model.
Mistake 3: Using Significance Tests as the Sole Check for Normality
In large samples (n > 500), the Shapiro-Wilk test will reject normality even for tiny, practically irrelevant deviations. In small samples, the test has low power and will fail to detect moderate non-normality. Always pair formal tests with visual inspection via the Q-Q plot. The Q-Q plot tells you both whether normality is violated and how severely. See the discussion of statistical power for why formal test results must be interpreted in context of sample size.
Mistake 4: Reporting VIF Without Interpreting It
Many assignment submissions include a VIF table but make no comment on it — not identifying which predictors have problematic VIF values, not explaining what was done about it. If VIF values appear in your output, interpret them explicitly. State which predictors (if any) have VIF above your chosen threshold, what that implies for coefficient reliability, and what — if anything — you did about it.
Mistake 5: Treating All Violations as Equally Serious
Not all assumption violations have equal consequences. A mild departure from normality in a large sample may have negligible practical impact. A severe violation of independence in a time-series regression can completely invalidate all inference. Linearity violations produce biased predictions. Multicollinearity makes individual coefficient estimates unstable. Part of statistical maturity is learning to calibrate the seriousness of each violation in context — not just checking boxes and moving on.
⚠️ A note on assumptions in machine learning contexts: Students who come to regression from machine learning sometimes assume that assumptions don’t matter in predictive contexts — only prediction accuracy does. This is partly true for pure prediction: if your goal is only to minimize prediction error on held-out data, violated regression assumptions matter less. But when your goal is inference — understanding coefficient meaning, testing hypotheses, constructing valid confidence intervals — assumptions are non-negotiable. Most academic assignments are inference-focused. Know which goal you’re serving. For understanding cross-validation as a tool for evaluating predictive models, see cross-validation and bootstrapping.
Need Expert Help With Your Regression Assignment?
From checking all five assumptions to building and interpreting a complete regression model — our statistics specialists deliver accurate, well-documented analysis in SPSS, R, Python, or Stata. Available 24/7.
Order Now Log InFrequently Asked Questions
Frequently Asked Questions About Assumptions of Regression Model
What are the 5 assumptions of linear regression?
The five core assumptions of a linear regression model are: (1) Linearity — the relationship between each predictor and the outcome is linear; (2) Independence — observations and their error terms are uncorrelated with each other; (3) Homoscedasticity — the variance of the residuals is constant across all values of the predictors; (4) Normality of residuals — the error terms follow an approximately normal distribution; and (5) No multicollinearity — predictor variables are not highly correlated with each other. These five assumptions, drawn from the Gauss-Markov theorem, determine whether OLS estimates are the Best Linear Unbiased Estimator (BLUE).
What happens if regression assumptions are violated?
The consequences depend on which assumption is violated. Violating linearity produces biased, systematically wrong predictions. Violating independence (autocorrelation) underestimates standard errors, making t-statistics too large and p-values too small — you incorrectly reject null hypotheses. Violating homoscedasticity (heteroscedasticity) produces inefficient estimates and invalid standard errors. Violating normality of residuals invalidates t-tests and F-tests in small samples. Violating no multicollinearity inflates coefficient variance, making estimates unstable and individual predictors appear statistically insignificant even when they have real effects. Some violations only affect efficiency; others (like endogeneity) produce biased and inconsistent estimates that don’t improve with larger samples.
How do you test the assumptions of a regression model?
Each assumption has specific diagnostic tests: Linearity — residuals vs. fitted values plot (look for random scatter, no pattern). Independence — Durbin-Watson test (value near 2 = no autocorrelation); residuals vs. time/order plot. Homoscedasticity — scale-location plot; Breusch-Pagan test; White test. Normality — Normal Q-Q plot (points along diagonal); Shapiro-Wilk test; histogram of residuals. No multicollinearity — Variance Inflation Factor (VIF > 10 signals a problem); pairwise correlation matrix; condition index. All these diagnostics are available in standard statistical software including SPSS, R, Python (statsmodels), and Stata.
What is homoscedasticity in regression?
Homoscedasticity is the assumption that the variance of the residuals (error terms) is constant across all levels of the predictor variables. It means that the spread of residuals around the regression line is approximately the same regardless of the value of X. The opposite condition — heteroscedasticity — occurs when variance is non-constant: for example, when residuals spread out as X increases (a fan shape in the residuals vs. fitted plot). Homoscedasticity is required for OLS standard errors to be valid. When heteroscedasticity is present, standard errors are wrong and hypothesis tests on coefficients are unreliable. Fixes include robust (heteroscedasticity-consistent) standard errors, Weighted Least Squares, or log-transforming the outcome variable.
What is the Gauss-Markov theorem?
The Gauss-Markov theorem states that under four conditions — linearity of the model in parameters, zero conditional mean of errors (E(ε|X) = 0), homoscedasticity, and independence of errors — the Ordinary Least Squares (OLS) estimator is the Best Linear Unbiased Estimator (BLUE). “Best” means minimum variance among all unbiased linear estimators. Importantly, the Gauss-Markov theorem does not require normality of errors — normality is needed for valid finite-sample inference (t-tests, F-tests, confidence intervals) but not for the BLUE property itself. The theorem was proved by Carl Friedrich Gauss in the early 19th century and formalized by Andrey Markov, and is the theoretical cornerstone of classical regression analysis.
What is multicollinearity and how do you fix it?
Multicollinearity occurs in multiple regression when two or more predictor variables are highly correlated with each other. High correlation makes it difficult for OLS to estimate the independent contribution of each predictor — the algorithm cannot distinguish their separate effects. The result is inflated coefficient variance: wide confidence intervals, unstable coefficient estimates that change dramatically with small dataset changes, and predictors appearing statistically insignificant even when they have real effects on the outcome. Multicollinearity is detected using the Variance Inflation Factor (VIF > 10 = severe) and correlation matrices. Common fixes include: removing redundant predictors; using ridge regression (which shrinks coefficients and reduces variance in the presence of multicollinearity); applying Principal Component Analysis (PCA) to create orthogonal predictors; or collecting a larger, more varied sample.
Does normality of residuals matter in large samples?
Normality of residuals matters less in large samples due to the Central Limit Theorem, which guarantees that the sampling distribution of OLS coefficient estimates approaches normality as n grows — even if the residuals themselves are not perfectly normal. A common practical guideline: with n > 30 observations per predictor (and many statisticians use n > 100 as a more conservative threshold), mild non-normality has negligible impact on inference. In very large samples (n > 500), formal normality tests like Shapiro-Wilk often reject normality for trivially small, practically irrelevant deviations. With small samples, however, normality is critical — t-tests and confidence intervals rely on it directly. Always check the Q-Q plot regardless of sample size; focus correction efforts on non-normality in small samples.
What is the difference between linearity assumption and other regression assumptions?
The linearity assumption is about the functional form of the relationship between predictors and the outcome — it’s a statement about the model’s structure, not just the error terms. The other four assumptions — independence, homoscedasticity, normality, and no multicollinearity — are primarily about the behavior of the error terms and the relationships between predictors. Linearity is foundational in a different way: if the true relationship is nonlinear, no amount of correcting standard errors or removing multicollinearity will make a linear regression model produce valid predictions or correct coefficient estimates. It must be addressed through model specification — transformation or a different model class — not through inference corrections. This is why linearity is always checked first in systematic assumption validation workflows.
What are the assumptions of logistic regression?
Logistic regression has its own set of assumptions that differ from linear regression. The key assumptions are: (1) the outcome variable is binary (or ordinal/multinomial for extensions); (2) observations are independent — no clustering or autocorrelation; (3) absence of multicollinearity among predictors; (4) linearity of predictors in the log-odds — the log-odds of the outcome is a linear function of the predictors; and (5) a large sample size — logistic regression relies on maximum likelihood estimation, which requires sufficient observations per outcome category for reliable estimates (a rule of thumb is at least 10 events per predictor). Notably, logistic regression does not assume linearity of predictors with the outcome itself (it’s a nonlinear model overall), and it does not assume homoscedasticity or normality of residuals.
What is autocorrelation in regression and why does it matter?
Autocorrelation (serial correlation) in regression occurs when the error terms are correlated across observations — typically across time points in time-series data. Positive autocorrelation means large positive residuals tend to follow large positive residuals, and negative residuals follow negatives. Autocorrelation does not bias OLS coefficient estimates, but it invalidates the standard errors. OLS treats autocorrelated observations as if they contained fully independent information; in reality, each observation contains less unique information when residuals are correlated. This causes OLS to underestimate standard errors, inflate t-statistics, and produce p-values that are too small — leading to incorrect rejection of null hypotheses. The Durbin-Watson test (values near 2 = no autocorrelation) is the standard diagnostic. Fixes include Newey-West HAC standard errors, Generalized Least Squares, and ARIMA-based approaches for time-series data.
