Model Selection: Understanding AIC and BIC in Statistical Modeling
Statistics & Data Science Guide
Model Selection: Understanding AIC and BIC in Statistical Modeling
Every statistical model involves a trade-off. Add more parameters and your model fits the data better — but it may be capturing noise rather than signal. That tension is what AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) were designed to resolve. These two information criteria are the most widely used tools for model selection in statistics, machine learning, and econometrics — and understanding how they work is essential for anyone building, comparing, or evaluating statistical models.
This guide covers everything: the mathematical formulas for AIC and BIC, what each one measures, how they differ in their penalty for complexity, and when to use each based on your research goals. You’ll understand why Hirotugu Akaike’s 1974 criterion takes a frequentist, prediction-focused view of model quality while Gideon Schwarz’s 1978 BIC takes a Bayesian, consistency-focused perspective. You’ll also explore the corrected variant AICc, alternatives like the Hannan-Quinn criterion (HQC), and how these tools apply across regression, time series (ARIMA), structural equation models, and machine learning pipelines.
The theoretical foundations draw on Kullback-Leibler divergence, maximum likelihood estimation, and Bayesian model evidence. Practical examples show how AIC and BIC are implemented in R and Python, how to interpret differences in criterion values, and what to do when the two criteria select different models. Scholarly sources from PMC (NCBI), University of Washington, and MathWorks are referenced throughout.
Whether you are a college student working through a statistics assignment, a graduate researcher comparing regression models, or a working data scientist building an ARIMA forecasting pipeline, this guide gives you the complete conceptual and practical toolkit for AIC and BIC model selection — with precision, not fluff.
What It Is & Why It Matters
Model Selection Using AIC and BIC — The Problem Every Statistician Faces
AIC and BIC model selection begins with a problem that every statistician, data scientist, and researcher faces: you have built several plausible models for your data, and you need a principled, objective way to choose between them. You cannot simply pick the model with the best fit — more complex models always fit training data better, but that improvement often reflects noise rather than genuine signal. This is the overfitting problem, and AIC and BIC are the two most widely used tools for solving it. Understanding the assumptions of regression models is the essential foundation before you can meaningfully compare them using information criteria.
Think of it this way. You are a detective with two theories about a crime. One theory explains every clue perfectly — but it requires seventeen specific assumptions. The other leaves one clue slightly unexplained, but it uses only three assumptions and is internally consistent. Which theory is more credible? This is exactly the question AIC and BIC answer for statistical models: which model provides the best balance between explaining the data and remaining parsimonious? Regression analysis is the most common arena where this question arises, but the same logic applies to any model fitted by maximum likelihood.
1974
Year Hirotugu Akaike published the AIC at the Institute of Statistical Mathematics, Tokyo
1978
Year Gideon Schwarz published the BIC as a Bayesian approximation at Princeton University
47,307
Latent variable modeling studies using AIC/BIC retrieved from PsycINFO — illustrating the scale of real-world usage
What Is Model Selection?
Model selection is the process of choosing one model from a set of candidate models for a given dataset. It sounds straightforward — pick the one that fits best. But “best” is ambiguous. A model with ten predictors will almost always have a higher R² than one with three, yet the ten-predictor model may generalize poorly to new data. The challenge is distinguishing genuine signal from noise. AIC and BIC model selection formalizes this trade-off by combining a measure of model fit (the log-likelihood) with a penalty for complexity (the number of parameters). The model that minimizes the combined score wins.
Model selection is not the same as model validation. Validation — using a held-out test set or cross-validation — assesses how well a chosen model generalizes. Model selection is the upstream step: choosing which model to validate. Cross-validation and bootstrapping are alternative approaches to model selection that don’t require parametric assumptions about the likelihood — each has trade-offs relative to AIC and BIC that are worth understanding for any serious statistical analysis.
The Overfitting Problem — Why Raw Fit Is Not Enough
Overfitting occurs when a model captures noise in the training data as if it were genuine structure. The result: the model fits the observed data beautifully but performs poorly on new, unseen data. Adding parameters always improves fit on training data — a model with n parameters can perfectly fit n data points — but those extra parameters may be capturing sampling variation rather than the true underlying relationship. Regularization in machine learning (Ridge, Lasso) is one approach to penalizing complexity; AIC and BIC are the classical statistical approach to the same problem.
Both AIC and BIC address overfitting by explicitly penalizing the number of free parameters in the model. Historically, various information criteria have been proposed that attempt to correct for the bias of maximum likelihood by adding a penalty term to compensate for the overfitting of more complex models. The difference between AIC and BIC lies in how aggressive that penalty is — and understanding this difference is what allows you to use each criterion intelligently rather than mechanically.
The core insight of information criteria: A model is not just what it fits — it is what it fails to explain. AIC and BIC simultaneously reward models that explain the data well and penalize models that require many parameters to do so. The best model under these criteria is the one that communicates the most information about the data using the least number of assumptions. This is parsimony — and it is the operating principle behind every model selection criterion.
The Akaike Information Criterion
What Is AIC? The Akaike Information Criterion Explained
The Akaike Information Criterion (AIC) is a measure developed by Japanese statistician Hirotugu Akaike at the Institute of Statistical Mathematics in Tokyo in 1974. It estimates the relative quality of statistical models by measuring how much information each model loses relative to the true data-generating process. AIC is founded on information theory. When a statistical model is used to represent the process that generated the data, the representation will almost never be exact; so some information will be lost by using the model to represent the process. AIC estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model. Lower AIC values indicate better models. Statistics assignment help for topics involving AIC most frequently involves both deriving the formula and interpreting results from software outputs — both of which this section covers.
The AIC Formula
The formula for AIC is elegant in its simplicity. It combines just two components: a measure of model fit (the log-likelihood) and a penalty for complexity (the number of parameters).
AIC = 2k − 2 ln(L̂)
where k = number of estimated parameters; L̂ = maximized value of the likelihood function
The term −2 ln(L̂) is the negative log-likelihood — a measure of how poorly the model fits the data. Smaller values mean better fit. The term 2k is the penalty for complexity — it increases with every additional parameter. The model that minimizes the total AIC score achieves the best balance between fit and parsimony. In logistic regression, AIC is computed from the model’s log-likelihood and is reported automatically by most statistical software.
For linear regression with normally distributed errors, the AIC simplifies to a form involving the residual sum of squares:
AIC = n · ln(RSS/n) + 2k
where n = sample size; RSS = residual sum of squares; k = number of parameters including the intercept and error variance
This form makes the trade-off explicit: as RSS decreases (better fit), AIC goes down. But as k increases (more parameters), AIC goes up. The minimum AIC model is the one where the marginal gain in fit from adding a parameter no longer outweighs the penalty cost of 2 units per parameter. Simple linear regression gives a clear starting point: with only one predictor, AIC is easily calculated and compared to alternative specifications.
The Theoretical Foundation: Kullback-Leibler Divergence
AIC is not an ad hoc statistic. It has a rigorous theoretical grounding in information theory, specifically in Kullback-Leibler (KL) divergence — a measure of how much one probability distribution differs from another. The AIC compares models from the perspective of information entropy, as measured by Kullback-Leibler divergence. The idea is that the true data-generating process has some unknown distribution f, and any statistical model g is an approximation of f. KL divergence measures the information lost when g is used instead of f. AIC provides an unbiased estimator of the expected KL divergence — the model with the lowest AIC loses the least information relative to the truth.
This grounding has an important implication: AIC is not designed to find the true model. It is designed to find the best approximating model among a set of candidates. If the “true model” is not in the candidate set, then the most that we can hope to do is select the model that best approximates the true model. AIC is appropriate for finding the best approximating model. This is precisely why AIC is favored for predictive modeling: when you accept that no model is true and your goal is minimizing prediction error, AIC is theoretically optimal. Hypothesis testing frameworks operate under a different logic — they assume a null hypothesis world — whereas AIC operates under the more pragmatic assumption that all models are approximations.
AICc: Corrected AIC for Small Samples
The standard AIC formula can overfit in small samples — it underpenalizes complexity when n is small relative to k. AICc (corrected AIC) addresses this by adding a second-order correction term. In small samples, AIC tends to overfit. The AICc adds a second-order bias-correction term to the AIC for better performance in small samples. Because the term approaches 0 with increasing sample size, AICc approaches AIC asymptotically. The analysis suggests using AICc when numObs/numParam is less than 40. The formula is:
AICc = AIC + (2k² + 2k) / (n − k − 1)
Use AICc whenever n/k < 40. As n → ∞, AICc → AIC, so there is no cost to using AICc routinely.
The practical rule: always use AICc when your sample size is less than 40 times the number of parameters. For larger samples, AIC and AICc produce virtually identical rankings. Confidence intervals and sample size considerations are directly connected — small samples create uncertainty in parameter estimates that carries through to the log-likelihood and therefore to AIC values.
How to Interpret AIC Differences Between Models
AIC values are meaningless in isolation. You compare them relative to each other across candidate models fit to the same dataset. The model with the lowest AIC is preferred. But how big a difference is meaningful? The standard guidance from Burnham and Anderson (2002) — the definitive applied reference for AIC-based model selection — uses ΔAIC (delta AIC), the difference between each model’s AIC and the minimum AIC in the set. Models with ΔAIC < 2 have substantial support. Models with ΔAIC between 4 and 7 have considerably less support. Models with ΔAIC > 10 have essentially no empirical support relative to the best model. Understanding significance levels helps contextualize these thresholds — just as p-value cutoffs are conventions, so are ΔAIC cutoffs, and context always matters.
AIC in Practice: What Software Gives You
In R, AIC(model) computes the AIC for any model fitted with maximum likelihood. The step() function uses AIC by default for stepwise variable selection. In Python, statsmodels model objects expose .aic and .bic attributes after fitting. scikit-learn does not provide AIC/BIC directly (it uses cross-validation), but they can be derived from the log-likelihood. For ARIMA time series models, auto_arima in the pmdarima package selects the optimal order using AIC or BIC, depending on the user’s preference. Time series analysis with ARIMA is one of the most common applications where AIC model selection is used systematically to choose between competing specifications.
The Bayesian Information Criterion
What Is BIC? The Bayesian Information Criterion Explained
The Bayesian Information Criterion (BIC), also known as the Schwarz Information Criterion (SIC), was developed by Israeli-American mathematician Gideon Schwarz at Princeton University and published in the Annals of Statistics in 1978. Its derivation is entirely different from AIC’s: while AIC emerges from information theory and KL divergence, BIC emerges from Bayesian probability and the concept of model evidence. The BIC was developed by Gideon Schwarz and published in a 1978 paper, as a large-sample approximation to the Bayes factor. Despite their different theoretical origins, the two criteria share a structurally similar formula — with one critical difference in the penalty term. Bayesian methods like Markov chain Monte Carlo operate in the same probabilistic framework as BIC, so understanding one deepens understanding of the other.
The BIC Formula
BIC is calculated using the same components as AIC — log-likelihood and number of parameters — but the penalty term uses the natural logarithm of the sample size rather than the constant 2.
BIC = k · ln(n) − 2 ln(L̂)
where k = number of estimated parameters; n = number of observations; L̂ = maximized likelihood
The critical difference: AIC’s penalty per parameter is 2 (constant regardless of sample size). BIC’s penalty per parameter is ln(n), which grows as the dataset grows. Both BIC and AIC attempt to resolve the overfitting problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC for sample sizes greater than 7. For n = 10, ln(10) ≈ 2.30 — slightly heavier than AIC. For n = 1000, ln(1000) ≈ 6.91 — dramatically heavier. This is why BIC consistently selects simpler models than AIC, especially as sample size grows. Sampling distributions and sample size effects are directly relevant here — BIC’s behavior is fundamentally sample-size dependent.
For linear regression with normal errors, BIC takes the form:
BIC = n · ln(RSS/n) + k · ln(n)
Same logic as AIC but replaces the constant 2 with ln(n) — heavier penalization for more parameters
The Bayesian Derivation: Why BIC Is “Consistent”
BIC’s Bayesian derivation gives it a property AIC lacks: consistency. BIC is consistent, meaning it selects the true model with probability approaching 1 as sample size increases, while AIC is not consistent and may select an overly complex model even with large sample sizes. This property follows directly from BIC’s Bayesian interpretation: it approximates the log of the Bayes factor — the ratio of the marginal likelihoods of two models — which is the Bayesian standard for model comparison. As n grows, the data swamp the prior, and BIC converges to the posterior probability of the true model.
But — and this is crucial — BIC’s consistency guarantee assumes the true model is in the candidate set. Some people would argue that unlike AIC that chooses the best predictive model, the BIC attempts to select the true model if it exists in the model set. In most real-world data analysis, this assumption is questionable. George Box‘s famous observation — “all models are wrong, but some are useful” — is the anti-BIC worldview. If you believe no model you are testing is literally true, BIC’s consistency advantage is largely theoretical. Decision theory provides the broader framework for thinking about what “best model” even means under uncertainty — a question AIC and BIC answer differently.
Interpreting BIC Differences: The Kass and Raftery Scale
Robert Kass and Adrian Raftery at Carnegie Mellon University developed an interpretive scale for BIC differences (ΔBIC) analogous to Burnham and Anderson’s ΔAIC scale. A ΔBIC between 0 and 2 is weak evidence for the lower-BIC model. Between 2 and 6 is positive evidence. Between 6 and 10 is strong evidence. Above 10 is very strong evidence. These thresholds are widely cited in the biomedical and social science literature. Research published in PMC on information criteria sensitivity and specificity demonstrates that different criteria can support genuinely different models — making the interpretation of these thresholds context-dependent rather than absolute. Type I and Type II error thinking is relevant here: BIC’s stricter penalty reduces the risk of selecting spuriously complex models (false positives for extra parameters) at the cost of potentially missing genuinely useful predictors.
Struggling With AIC/BIC in Your Statistics Assignment?
Our statistics experts help you compute, interpret, and write up information criteria for regression, ARIMA, SEM, and more — with clear explanations and proper academic formatting.
Get Statistics Help Now Log InComparing the Two Criteria
AIC vs BIC: Key Differences, When Each Criterion Wins, and How to Choose
Students and researchers who encounter AIC vs BIC for the first time often treat the choice as arbitrary — “just use whichever your professor uses.” That’s a missed opportunity. The choice between AIC and BIC reflects a genuine philosophical choice about what you want from model selection: prediction accuracy or parsimony. Getting this right is the difference between a publishable analysis and one that reviewers question. Mastering academic writing in statistics-heavy fields requires knowing not just how to run AIC and BIC, but how to justify which one you used and why.
The Penalty Difference — Illustrated
The most concrete way to understand AIC vs BIC is to see how their penalties diverge across sample sizes. For a model with k = 3 parameters (the interesting third parameter being the one under debate):
AIC Penalty for k Parameters
- n = 10: penalty = 2k = 6 (for k=3)
- n = 100: penalty = 2k = 6
- n = 1,000: penalty = 2k = 6
- n = 10,000: penalty = 2k = 6
- Penalty does not grow with n — AIC tolerates complexity equally at all sample sizes
BIC Penalty for k Parameters
- n = 10: penalty = k·ln(10) ≈ 6.9
- n = 100: penalty = k·ln(100) ≈ 13.8
- n = 1,000: penalty = k·ln(1000) ≈ 20.7
- n = 10,000: penalty = k·ln(10000) ≈ 27.6
- Penalty grows with n — BIC increasingly disfavors extra parameters as data accumulates
This illustration reveals BIC’s core behavior: the more data you have, the harder BIC is on additional parameters. With 10,000 observations, BIC imposes nearly 5 times the penalty per parameter that AIC does. In a large clinical trial or a national survey dataset, AIC might accept a 4-predictor model over a 3-predictor one even when the 4th predictor adds marginal explanatory value. BIC would almost certainly reject it. Statistical power analysis intersects with this: large samples have the power to detect tiny effects, and BIC’s growing penalty prevents you from selecting models that include those trivially small but statistically significant effects.
AIC Is Better for Prediction; BIC Is Better for Explanation
This is the core practical rule. In general, if the goal is prediction, AIC and leave-one-out cross-validations are preferred. AIC minimizes expected Kullback-Leibler divergence — the expected information loss when making predictions on new data. It is asymptotically equivalent to leave-one-out cross-validation under certain conditions. Use AIC when you are building a forecasting model, a classification algorithm, a regression for future prediction, or any model whose primary purpose is predictive accuracy on new data. Polynomial regression model selection is a classic AIC use case: how many polynomial terms should you include in a predictive regression?
BIC is better for explanation and inference. When you want to identify which variables truly matter, understand the underlying structure of the data, or select a model for interpretation rather than prediction, BIC’s stricter penalty guards against spurious complexity. Research published in PMC on model selection and psychological theory demonstrates that in psychology and the social sciences — where latent variable models and factor analyses are common — BIC is heavily relied upon precisely because researchers are testing theories about underlying structure, not just trying to predict outcomes. Factor analysis — choosing how many latent factors to retain — is one of the most common BIC applications in social science research.
When AIC and BIC Disagree
They will disagree. Often. Information criteria based on penalized likelihood, such as AIC and BIC, are widely used for model selection in health and biological research. However, different criteria sometimes support different models, leading to discussions about which is the most trustworthy. When AIC selects model A and BIC selects model B, neither is definitively right — they are answering different questions. AIC says: “Model A makes better predictions.” BIC says: “Model B is more likely to be the true generating model.” The question is which objective you care more about. The distinction between qualitative and quantitative research goals parallels this: predictive goals align with AIC, explanatory/confirmatory goals align with BIC.
When criteria disagree, the best response is not to flip a coin — it is to examine the substantive implications. How different are the two models substantively? Is the extra predictor in the AIC-preferred model theoretically meaningful? Would including it change the interpretation of the other predictors? These questions require judgment, not just algorithms. Transparent results reporting in academic papers often means reporting both AIC and BIC and explaining how you resolved any disagreement — reviewers and editors increasingly expect this. The misuse of statistics through selective reporting — choosing the criterion that supports your preferred model without disclosing that criteria disagree — is a real concern in the published literature.
| Feature | AIC | BIC |
|---|---|---|
| Developed by | Hirotugu Akaike, 1974 (Institute of Statistical Mathematics, Tokyo) | Gideon Schwarz, 1978 (Princeton University) |
| Theoretical framework | Information theory — Kullback-Leibler divergence; frequentist | Bayesian inference — approximation to Bayes factor; Bayesian |
| Formula | 2k − 2ln(L̂) | k·ln(n) − 2ln(L̂) |
| Penalty per parameter | 2 (constant; does not grow with n) | ln(n) (grows with sample size) |
| Consistency | Not consistent — may select overly complex models even as n → ∞ | Consistent — converges to true model as n → ∞ (if true model in set) |
| Model complexity preference | More forgiving — accepts moderately complex models | Stricter — increasingly favors parsimonious models with large n |
| Best used for | Prediction, forecasting, exploratory modeling | Explanation, inference, confirmatory modeling |
| Small sample behavior | Can overfit; use AICc correction when n/k < 40 | Generally more conservative; also sensitive to small n |
| Common applications | Regression, ARIMA, machine learning, ecological modeling | Structural equation models, factor analysis, cluster number selection |
| Software implementation (R) | AIC(model); step(model) |
BIC(model); stepAIC(model, k = log(n)) |
Where AIC and BIC Are Used
AIC and BIC in Practice: Applications Across Statistical Modeling Contexts
AIC and BIC are not just theoretical tools — they appear routinely in published research, data science workflows, and academic assignment answers across many modeling contexts. Understanding how they work in each context will sharpen your ability to apply them correctly and interpret others’ analyses critically. Choosing the right statistical test in assignments and research projects requires knowing not just which test to use but which model selection criterion is most defensible for the problem at hand.
AIC and BIC in Linear and Logistic Regression
In regression — whether linear, logistic, or generalized linear — AIC and BIC are the standard criteria for variable selection and model comparison. The typical workflow: specify a set of candidate models (including different combinations of predictors, interaction terms, or polynomial terms), fit each by maximum likelihood, compute AIC and/or BIC for each, and select the model that minimizes the criterion. Logistic regression is a natural arena for AIC: because logistic models can always include more interaction terms, AIC helps identify when additional complexity stops being worthwhile. In the context of polynomial regression, AIC typically selects a lower-degree polynomial than AICc or BIC for small samples — which reinforces why choosing the right variant of the criterion matters.
A critical constraint applies: AIC and BIC can only compare models fit to exactly the same dataset. Dropping a single observation changes n, which changes the log-likelihood scale, which makes AIC/BIC values incomparable. This constraint is frequently violated in practice — and it is one of the most common methodological errors in published regression analyses. Regression model assumptions must be checked before fitting models for comparison — comparing models that violate assumptions may produce misleading AIC rankings.
AIC and BIC in ARIMA and Time Series Modeling
ARIMA model selection — choosing the autoregressive (p), integration (d), and moving average (q) orders — is one of the most systematic applications of AIC and BIC. Time series analysis requires careful model selection to balance fit and complexity. The Akaike Information Criterion and Bayesian Information Criterion are key tools for model selection. These methods compare models based on their fit and complexity, helping analysts choose the most appropriate model for their time series data. In practice, analysts fit a grid of candidate ARIMA(p,d,q) models and identify the one with the lowest AIC or BIC. The auto.arima() function in R’s forecast package and auto_arima() in Python’s pmdarima automate this search. ARIMA and exponential smoothing assignments commonly require students to demonstrate this selection process explicitly, reporting the AIC and BIC for competing models before justifying the chosen specification.
BIC often selects simpler ARIMA models than AIC — lower p and q orders — which can be advantageous for avoiding spurious autocorrelation patterns that appear in training data but don’t persist in new data. When forecasting is the primary goal, AIC’s tendency to accept slightly more complex models may actually improve out-of-sample performance.
AIC and BIC in Structural Equation Modeling and Factor Analysis
In structural equation modeling (SEM) and factor analysis — the dominant frameworks for latent variable modeling in psychology, sociology, and education research — AIC and BIC are used to select the number of latent factors, the factor structure, and the measurement model. There appears to be a heavy data-driven reliance on fit criteria like the Akaike Information Criterion and the Bayesian Information Criterion in latent variable models, perhaps for a lack of applicability of more traditional fit measures. In practice, researchers compare a series of models (one-factor, two-factor, three-factor solutions) and select the one with the lowest BIC — because in SEM, the goal is typically theoretical explanation rather than pure prediction. Factor analysis assignments that involve model selection should explicitly report and justify the information criterion used.
A well-known challenge in latent class analysis and mixture modeling: AIC and BIC often suggest different numbers of classes. If researchers had used AIC or ABIC rather than BIC it appears that they would have chosen at least a five-class model instead — on the other hand, CAIC would have agreed with BIC. Neither is wrong, even though in their case the authors had to choose one or the other. The resolution requires invoking substantive theory — which number of classes is interpretable and theoretically meaningful? — alongside statistical criteria. This is a mature insight: information criteria guide but do not replace expert judgment. MANOVA and multivariate methods similarly require theoretical judgment about which model complexity is substantively meaningful, not just statistically optimal.
AIC and BIC in Machine Learning Model Selection
In machine learning, AIC and BIC are less commonly used than cross-validation — primarily because many ML models are not fitted by maximum likelihood, making the likelihood-based penalty inapplicable. However, for probabilistic ML models (Gaussian mixture models, hidden Markov models, Bayesian networks, linear models fitted via MLE), AIC and BIC are fully applicable and computationally cheaper than cross-validation. A benefit of probabilistic model selection methods is that a test dataset is not required, meaning that all of the data can be used to fit the model, and the final model that will be used for prediction in the domain can be scored directly. This matters in data-scarce settings: when you have a small dataset, holding out a test set for cross-validation may leave too few observations for stable model fitting. Ridge and Lasso regularization effectively solve a similar problem from a different angle — penalizing coefficient magnitude rather than parameter count.
The MDL Connection: The Minimum Description Length (MDL) principle — treating model selection as a compression problem — is closely related to BIC. Under MDL, the best model is the one that produces the shortest description of the data (model plus data given model). MDL and BIC are asymptotically equivalent, sharing the same fundamental preference for parsimony. This connection, developed at CWI (Centrum Wiskunde & Informatica) in the Netherlands and advanced by researchers including Peter Grünwald, reinforces BIC’s theoretical coherence from an entirely different angle. Decision theory provides the unified framework linking MDL, BIC, and Bayesian model averaging.
Need Help With Model Selection for Your Statistics Assignment?
From ARIMA order selection to regression variable comparison — our experts handle AIC/BIC applications across all statistical modeling contexts. Fast turnaround, 24/7.
Start Your Order LoginBeyond AIC and BIC
Beyond AIC and BIC: Alternative Model Selection Criteria and When to Use Them
AIC and BIC are dominant, but they are not the only information criteria available. Understanding the alternatives — when they were developed, what problems they address, and when they outperform AIC and BIC — is the mark of a sophisticated statistical analyst. This knowledge also helps when you encounter these criteria in published papers and need to critically evaluate methodological choices. Writing an exemplary literature review in statistics requires engaging critically with methods sections, including the model selection strategy authors used and whether it was appropriate for their goals.
AICc — When to Use the Small-Sample Correction
As discussed above, AICc is the go-to correction for small samples. Burnham and Anderson (2002) recommend using AICc as the default in ecological research because ecological datasets are frequently small relative to the number of candidate predictors. The practical rule (n/k < 40) is widely adopted in biology, ecology, and psychology. In large-sample economics and epidemiology, standard AIC is typically used without correction. Understanding sampling distribution behavior in small samples illuminates why this correction matters — the log-likelihood estimates themselves become noisier with fewer observations, which propagates into AIC’s estimates of model quality.
Hannan-Quinn Criterion (HQC) — The Middle Ground
The Hannan-Quinn Information Criterion (HQC), proposed by E.J. Hannan and B.G. Quinn in 1979, applies a penalty between AIC and BIC. Its formula uses 2k·ln(ln(n)) as the penalty term — growing more slowly with n than BIC but more quickly than AIC’s constant penalty. The Hannan-Quinn criterion offers a middle ground between AIC and BIC by applying a lighter penalty than BIC but a heavier one than AIC. HQC is consistent (like BIC) but less prone to underfitting in medium-sized samples. It is particularly popular in time series and econometrics, where it is sometimes preferred over BIC for autoregressive order selection. The Bank of England, Federal Reserve economists, and econometric researchers at the London School of Economics (LSE) frequently encounter HQC in macroeconomic modeling contexts. ARIMA and time series analysis assignments in econometrics courses may ask you to compare models using HQC alongside AIC and BIC.
WAIC and DIC — Bayesian Alternatives
For fully Bayesian models — where parameters are assigned prior distributions and posterior distributions are estimated by Markov chain Monte Carlo — neither AIC nor BIC is ideal. Two Bayesian alternatives fill this role. The Deviance Information Criterion (DIC), developed by David Spiegelhalter at Cambridge University, extends BIC to hierarchical Bayesian models where parameter counting is ambiguous. The Widely Applicable Information Criterion (WAIC), developed by Sumio Watanabe at Tokyo Institute of Technology, is theoretically superior to both DIC and AIC for singular models (those where the usual regularity conditions fail). WAIC, in particular, is asymptotically equivalent to leave-one-out cross-validation and applies even in complex or singular models. For students working with Bayesian models in courses using Stan, JAGS, or PyMC, WAIC is the modern standard for model comparison. Markov chain Monte Carlo methods are the computational backbone of Bayesian model estimation — WAIC and DIC operate on the posterior samples these methods produce.
Cross-Validation — When AIC/BIC Are Not Applicable
Cross-validation (CV) is the model selection approach most commonly used in machine learning and in cases where maximum likelihood estimation is not the fitting framework. Leave-one-out cross-validation (LOO-CV) and k-fold CV directly estimate the model’s predictive performance on new data by repeatedly fitting on subsets and evaluating on the held-out portion. Under certain regularity conditions, LOO-CV is asymptotically equivalent to AIC. The advantage of CV: it makes no parametric assumptions about the likelihood. The disadvantage: it is computationally expensive and requires held-out data that reduces the sample available for fitting. Cross-validation and bootstrapping are the essential complement to AIC/BIC — knowing when to use each is part of a complete model selection toolkit. For large samples, AIC and BIC are computationally cheaper and theoretically equivalent to CV; for small samples and non-parametric settings, CV is often preferable.
⚠️ Which Criterion Should You Use? A Decision Guide
Use AICc when your sample is small (n/k < 40) and your goal is prediction. Use AIC when your sample is large and your goal is prediction or forecasting. Use BIC when your goal is identifying the true model structure, especially for large samples in confirmatory research. Use HQC in time series and econometrics when you want a penalty between AIC and BIC. Use WAIC or DIC for fully Bayesian hierarchical models. Use cross-validation when your model is not fitted by maximum likelihood, your sample is too small for holdout, or you distrust parametric assumptions. When in doubt, compute both AIC and BIC — if they agree, proceed with confidence. If they disagree, report both and justify your choice based on your research question’s primary objective. Understanding whether your research goal is predictive or explanatory is the first step in making this choice correctly.
Key Entities & Figures
Key Entities, Researchers, and Institutions in AIC and BIC Model Selection
Understanding who developed these tools, where they worked, and what makes their contributions unique separates a surface-level statistical model selection assignment from one that demonstrates genuine scholarly depth. The following are the most significant figures and institutions in this field.
Hirotugu Akaike — The Information Criterion Pioneer
Hirotugu Akaike (1927–2009) was a Japanese statistician at the Institute of Statistical Mathematics in Tokyo, Japan. What makes Akaike uniquely significant is not merely that he created AIC — it is that he fundamentally reframed how statisticians think about model selection. Before AIC, model comparison was dominated by significance testing (likelihood ratio tests, F-tests) — a null hypothesis testing framework that Akaike argued was the wrong tool for model selection. Akaike’s insight was that model selection is an information problem, not a hypothesis testing problem: the question is not “is this model significantly better?” but “how much information does each model lose?” His 1974 paper “A new look at the statistical model identification” in IEEE Transactions on Automatic Control introduced AIC and is one of the most cited papers in all of statistics. Akaike received the Kyoto Prize in 2006 — often described as Japan’s Nobel Prize — for his contributions to statistical methodology.
Gideon Schwarz — The Bayesian Counterpoint
Gideon Schwarz (1933–2007) was an Israeli-American statistician and professor at Princeton University and later at the Hebrew University of Jerusalem. What makes Schwarz uniquely significant is that his 1978 BIC paper — barely three pages long — derived an entirely different model selection criterion from Bayesian principles that challenged AIC’s dominance. Schwarz showed that BIC approximates the log Bayes factor — the gold-standard Bayesian measure of model evidence — making BIC the Bayesian answer to Akaike’s frequentist criterion. Despite the brevity of the original paper, BIC has become arguably as widely used as AIC across the social, biological, and medical sciences. The Wikipedia entry on BIC provides accessible background; the original paper in the Annals of Statistics (1978) is the scholarly source.
Kenneth Burnham and David Anderson — The Applied Champions
Kenneth Burnham and David Anderson at the U.S. Geological Survey and Colorado State University are the figures most responsible for the widespread adoption of AIC-based model selection in ecology, wildlife biology, and the environmental sciences. Their book Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (2002, 2nd ed.) is the definitive applied reference for AIC usage — readable, comprehensive, and explicitly critical of null hypothesis significance testing as a model selection tool. The Burnham-Anderson framework introduced the ΔAIC scale, the concept of evidence ratios, and the practice of model averaging (combining predictions across multiple models weighted by their AIC scores). Their approach is now standard in top ecology and conservation biology journals including Ecology, Conservation Biology, and the Journal of Wildlife Management.
The Journal of the Royal Statistical Society (JRSS)
The Journal of the Royal Statistical Society — published by the Royal Statistical Society (RSS) in London, UK — is one of the primary venues where advances in model selection theory are published and debated. Series B (Statistical Methodology) of JRSS has published foundational work on the theoretical properties of AIC, BIC, and related criteria. For students and researchers writing statistics assignments requiring peer-reviewed citations, JRSS Series B is one of the highest-quality sources available for statistical methodology literature. Literature review writing for statistics-heavy research papers should draw on JRSS, Biometrika, Annals of Statistics, and the Journal of the American Statistical Association (JASA) for methodological foundations.
PMC (PubMed Central) / NCBI — The Health Sciences Authority
The National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) in Bethesda, Maryland hosts PubMed Central (PMC) — the open-access repository of biomedical and life science research. PMC publishes extensive methodological literature on AIC and BIC applications in health research. The paper by Vrieze (2012) on “Model selection and psychological theory” and the paper by Tein, Coxe, and Cham on information criteria sensitivity and specificity are both freely accessible on PMC and are among the most useful references for understanding when AIC and BIC make different choices in behavioral and health science contexts. For students citing AIC/BIC in psychology, education, or public health assignments, PMC is an essential scholarly source.
| Entity | Type / Location | Key Contribution | Scholarly Resource |
|---|---|---|---|
| Hirotugu Akaike | Statistician / Tokyo, Japan | AIC (1974); information-theoretic model selection framework; Kyoto Prize 2006 | IEEE Trans. Automatic Control, 1974; APA/PsycNET archives |
| Gideon Schwarz / Princeton University | Statistician / New Jersey, USA | BIC (1978); Bayesian model evidence approximation; Bayes factor connection | Annals of Statistics, 1978 |
| Burnham & Anderson / Colorado State University | Ecologists / Colorado, USA | Applied AIC framework; ΔAIC scale; model averaging; multimodel inference | Model Selection and Multimodel Inference, 2002 |
| Royal Statistical Society (RSS) | Professional Organization / London, UK | Publishes JRSS — leading venue for model selection theory advances | academic.oup.com/jrsssb |
| NCBI / NIH (PMC) | Government Research / Bethesda, USA | Open-access AIC/BIC applied research in psychology, health, biology | pmc.ncbi.nlm.nih.gov — Vrieze (2012); Tein et al. |
| Institute of Statistical Mathematics | Research Institute / Tokyo, Japan | Akaike’s institutional home; continuing research in information-theoretic statistics | ism.ac.jp |
Practical Application
How to Use AIC and BIC in Statistical Assignments: Step-by-Step
Knowing the theory is one thing. Executing it correctly in a statistics assignment or research paper is another. The following walkthrough covers the complete applied workflow for AIC and BIC model selection — from defining the candidate model set to writing up the results clearly and defensibly. Mastering academic writing for statistics assignments means integrating methodology justification, numerical results, and interpretation into a coherent analysis — not just reporting numbers. Conducting thorough statistical research before writing ensures your analysis rests on appropriate methods.
Step 1: Define the Candidate Model Set
Before computing a single AIC or BIC value, you need a principled set of candidate models. This means models you have a theoretical reason to consider — not every possible combination of variables. The AIC statistic is designed for preplanned comparisons between models, as opposed to comparisons of many models during automated searches. Random automated stepwise searches over large variable sets produce data dredging — the information criteria values become inflated, and the selected model is likely overfit to the training data. The candidate set should be small, theoretically motivated, and exhaustive of the plausible alternatives for your research question. P-hacking and data dredging are the pathological cases of undisciplined model searching — AIC/BIC do not prevent this if the candidate set is constructed opportunistically.
Step 2: Fit All Models Using Maximum Likelihood
All models in the candidate set must be fitted using maximum likelihood estimation (MLE) — not ordinary least squares, not Bayesian MCMC (unless WAIC is being used), not method of moments. AIC and BIC are defined in terms of the maximized log-likelihood. In most software, regression models are fitted by MLE by default (OLS is equivalent to MLE under normality). Logistic regression is always fitted by MLE, making it naturally compatible with AIC/BIC comparisons. Survival analysis using the Cox model uses partial likelihood — a modified version of MLE — and AIC is available for Cox models in most software, though the parameter count requires careful specification.
Step 3: Compute AIC, BIC (and AICc if n/k < 40)
In R: AIC(model1, model2, model3) returns a table of AIC values for all listed models. BIC(model1, model2, model3) does the same for BIC. For AICc in R, use the AICcmodavg package. In Python with statsmodels: after fitting model = smf.ols('y ~ x1 + x2', data=df).fit(), access model.aic and model.bic. For ARIMA in Python: auto_arima(series, information_criterion='aic') or 'bic'. Report the AIC/BIC values for all candidate models in a table — not just the winning model. Journals and assignment rubrics increasingly require reporting the full comparison, not just the selected model. Creating professional tables and charts for presenting AIC/BIC comparisons in your assignment makes your analysis visually clear and easier to evaluate.
Step 4: Compute ΔAIC and ΔBIC — Don’t Just Report Raw Values
Subtract the minimum AIC in your candidate set from each model’s AIC to get ΔAIC. Do the same for BIC. This converts the arbitrary scale of information criteria into a meaningful scale of relative evidence. Apply Burnham and Anderson’s ΔAIC thresholds or Kass and Raftery’s ΔBIC thresholds to characterize the strength of evidence for each model. Transparent results reporting requires presenting these deltas alongside the raw values so readers can evaluate the evidence themselves. Some journals now require reporting Akaike weights (wi = exp(−0.5·ΔAIC) / sum of exp(−0.5·ΔAIC) across all models) — these convert ΔAIC values into probabilities of being the best model in the set.
Step 5: Interpret the Results in Context
Numbers don’t interpret themselves. After identifying the model with the lowest AIC or BIC, you need to do three things: (1) verify the selected model makes theoretical sense — a model that wins on information criteria but contradicts established theory warrants skepticism; (2) check the model’s assumptions and diagnostics — a low AIC doesn’t rescue a model with severe heteroscedasticity or non-normality of residuals; (3) report whether AIC and BIC agree — if they select different models, explain which criterion you are prioritizing and why, given your research objective. Choosing the right statistical approach involves exactly this kind of contextual reasoning — the tool must fit the question. Building a persuasive, logically coherent statistical argument means justifying your methodological choices, not just reporting outcomes.
Common AIC/BIC Mistakes in Statistics Assignments
The five most common errors in student assignments involving AIC and BIC: (1) Comparing models fit to different datasets — this invalidates the comparison entirely. (2) Using AIC when n/k < 40 without applying the AICc correction — introduces small-sample bias. (3) Reporting only the winning model’s AIC without comparing it to alternatives — provides no evidence of selection quality. (4) Treating a slightly lower AIC as definitive evidence for one model over another — ΔAIC < 2 means models are roughly equivalent. (5) Conflating a lower AIC with a “better” model in an absolute sense — AIC is a relative criterion that only ranks models within the candidate set, not against some absolute standard of adequacy. Common student mistakes in statistical writing often come down to the same root cause: reporting results without fully engaging with what they mean.
Key Terms & LSI Concepts
Essential Terms, LSI Keywords, and Conceptual Map for AIC and BIC
Scoring well on AIC and BIC model selection assignments — particularly at graduate level — requires demonstrating command of the field’s vocabulary and conceptual landscape. The following terms and NLP themes are the ones most likely to appear in assignment rubrics, exam questions, and the peer-reviewed literature you need to cite.
Core Statistical Terms
Maximum likelihood estimation (MLE) — the parameter estimation framework in which parameters are chosen to maximize the probability of observing the data; the foundation of both AIC and BIC. Log-likelihood (ln L̂) — the natural log of the likelihood function evaluated at the MLE; the goodness-of-fit component of both criteria. Penalized likelihood — the general class of model selection criteria that balance fit against complexity by subtracting a penalty from the log-likelihood. Parsimony — the principle of preferring simpler models when they explain the data equally well; Occam’s Razor in statistical form. Goodness of fit — how well a model explains the observed data; measured here by the log-likelihood. Model complexity — the number of free parameters in the model; the quantity penalized by both AIC and BIC.
Overfitting — capturing noise in the training data as if it were signal; the problem both AIC and BIC are designed to prevent. Underfitting — using a model too simple to capture the genuine structure in the data; BIC is more prone to this than AIC in large samples. Kullback-Leibler divergence — the information-theoretic measure of how much a model distribution differs from the truth; the quantity AIC estimates. Bayes factor — the Bayesian ratio of model evidences; what BIC approximates asymptotically. Likelihood ratio test (LRT) — hypothesis test comparing nested models using the difference in log-likelihoods; related to AIC/BIC but requires nested models and a null hypothesis framework. Nested vs non-nested models — AIC and BIC can compare both; LRT can only compare nested models. Chi-square goodness-of-fit testing is related to LRT and represents the hypothesis-testing alternative to information-criterion-based comparison.
NLP/Advanced Concepts for Graduate-Level Analysis
Asymptotic consistency — a criterion’s property of selecting the true model with probability 1 as n → ∞; BIC has this, AIC does not. Model averaging / multimodel inference — combining predictions from multiple candidate models weighted by their AIC scores; reduces model selection uncertainty. Akaike weights — probabilities assigned to each candidate model based on its ΔAIC; sum to 1 across all models. Evidence ratios — ratios of Akaike weights between pairs of models; quantify the relative evidence for one model over another. Variable importance — the sum of Akaike weights across all models containing a given predictor; measures how consistently a predictor appears in well-supported models. Selection bias — the inflation of apparent model fit that occurs when the same data are used for both model selection and inference. Regularization — penalized regression methods (Ridge, Lasso, Elastic Net) that impose continuous parameter shrinkage; a related but distinct approach to the same problem AIC/BIC address via discrete model comparison.
For academic assignments requiring a critical comparative analysis, the debate between information-criterion-based selection and null hypothesis significance testing is a rich vein of content. Paul Meehl at the University of Minnesota argued that conventional significance testing was inadequate for model selection in psychology — a view echoed by Akaike and Burnham/Anderson. Andrew Gelman at Columbia University has argued for posterior predictive checking and WAIC over AIC/BIC in Bayesian contexts. These ongoing methodological debates are part of what makes model selection a live and contested area of statistical practice, not a settled textbook algorithm. Writing a strong argumentative essay in statistics engages precisely this kind of contested methodological territory — taking a defensible position with evidence rather than simply describing tools.
The bias-variance tradeoff is the machine learning framing of the same AIC/BIC concept: models that fit training data well (low bias) often generalize poorly (high variance) because they overfit. AIC and BIC operationalize an optimal point on this tradeoff from the likelihood perspective. Ridge and Lasso regularization address the same tradeoff from a different direction — by penalizing coefficient magnitude rather than parameter count. Understanding both approaches and their relationship is the mark of a student who has genuinely internalized the problem rather than memorized a formula. Principal component analysis (PCA) offers yet another lens — dimensionality reduction as a form of model simplification — that connects to AIC/BIC thinking about parsimony and complexity.
AIC and BIC Assignment Due? We Can Help.
Our statistics experts cover AIC, BIC, AICc, model comparison, ARIMA selection, and written interpretation — all properly cited and formatted for university submission.
Order Now Log InFrequently Asked Questions
Frequently Asked Questions: AIC and BIC in Statistical Modeling
What is AIC in statistical modeling?
AIC (Akaike Information Criterion), developed by Hirotugu Akaike at the Institute of Statistical Mathematics in Tokyo in 1974, estimates the relative quality of statistical models by measuring information loss. It is calculated as AIC = 2k − 2ln(L̂), where k is the number of parameters and L̂ is the maximized likelihood. Lower AIC means a better model. AIC balances goodness of fit against model complexity — it penalizes extra parameters with a constant cost of 2 per parameter, making it suitable for predictive modeling. It is grounded in information theory, specifically Kullback-Leibler divergence, and is the most widely used model selection criterion in ecology, time series, and machine learning applications.
What is BIC and how does it differ from AIC?
BIC (Bayesian Information Criterion), developed by Gideon Schwarz at Princeton University in 1978, is calculated as BIC = k·ln(n) − 2ln(L̂). The key difference from AIC is the penalty: BIC uses k·ln(n) instead of AIC’s 2k. Because ln(n) grows with sample size, BIC applies heavier penalties to complex models as data accumulates — strongly favoring parsimonious models in large datasets. BIC is theoretically consistent (it selects the true model with probability 1 as n → ∞, assuming the true model is in the candidate set), while AIC is not. AIC is preferred for prediction; BIC for explanation and inference. For any sample size above 7 observations, BIC penalizes additional parameters more harshly than AIC.
When should I use AIC over BIC, or vice versa?
Use AIC when your primary goal is prediction — AIC minimizes expected information loss on new data and is asymptotically equivalent to leave-one-out cross-validation under certain conditions. Use BIC when your goal is explanation, inference, or identifying the true model structure — BIC’s consistency property makes it appropriate for confirmatory research where parsimony matters. For small samples (n/k < 40), use AICc instead of standard AIC. In time series and econometrics, the Hannan-Quinn criterion (HQC) is sometimes preferred as a middle ground. When AIC and BIC disagree, report both and justify your choice based on the primary research objective: prediction (AIC) or inference (BIC).
What does a lower AIC or BIC value mean?
Lower AIC or BIC means a better-performing model relative to competitors in the same candidate set, fitted to the same dataset. The absolute value is not interpretable — only differences between models (ΔAIC or ΔBIC) are meaningful. A ΔAIC < 2 suggests two models are roughly equivalent in predictive quality. A ΔAIC between 4 and 7 suggests the higher-AIC model has considerably less support. A ΔAIC > 10 indicates virtually no support for the higher model. For BIC, a ΔBIC > 10 is considered very strong evidence against the higher-BIC model (Kass and Raftery scale). Never compare AIC/BIC values across models fit to different datasets — the comparison is meaningless.
How do I calculate AIC in R or Python?
In R: fit your model with a maximum likelihood function (e.g., lm(), glm(), arima()), then call AIC(model) or BIC(model). To compare multiple models simultaneously: AIC(model1, model2, model3) returns a table. For AICc: install and load the AICcmodavg package, then use AICc(model). For stepwise selection using AIC: use step(full_model). In Python using statsmodels: fit a model with .fit(), then access model.aic and model.bic attributes. For ARIMA in Python: use pmdarima’s auto_arima(series, information_criterion=’aic’) or ‘bic’. For any model, you can also compute AIC manually from the log-likelihood: AIC = 2*k – 2*loglik, where loglik = model.llf in statsmodels.
Can AIC and BIC select different models from the same data?
Yes — frequently, especially with large sample sizes. Because BIC’s penalty grows with n while AIC’s does not, BIC increasingly disfavors complex models as data accumulates. With large datasets, AIC may select a 5-predictor model while BIC selects a 3-predictor one. This is not a contradiction — they are answering different questions. AIC is asking “which model makes the best predictions?” BIC is asking “which model is most likely to be the true data-generating structure?” When they disagree, researchers should report both values and explicitly justify their final selection based on the research question’s primary objective. Sensitivity analysis (reporting how conclusions change if you use the other criterion) is good practice.
What is AICc and when should I use it?
AICc (corrected AIC) adds a second-order bias correction to standard AIC: AICc = AIC + (2k² + 2k)/(n − k − 1). It is recommended whenever the sample size n divided by the number of parameters k is less than 40 (n/k < 40). In small samples, standard AIC tends to select overly complex models because its penalty underestimates the overfitting risk. AICc corrects this. As n grows, the correction term approaches zero and AICc converges to AIC. Burnham and Anderson recommend using AICc as the default in ecological research. For most regression applications with sample sizes of several hundred observations and fewer than 20 parameters, AIC and AICc will produce identical model rankings.
How is AIC used in ARIMA time series model selection?
In ARIMA modeling, AIC is the standard criterion for selecting the autoregressive order (p), the moving average order (q), and sometimes the integration order (d) — collectively written ARIMA(p,d,q). The workflow is: determine d using unit root tests (ADF, KPSS); then fit a grid of ARIMA(p,0,q) models for a range of p and q values; compute AIC or BIC for each; select the model with the lowest criterion value. The auto.arima() function in R and auto_arima() in Python automate this search. AIC is typically preferred over BIC for ARIMA selection when the goal is forecasting accuracy. BIC-selected ARIMA models tend to have lower p and q values (simpler dynamics), which can be advantageous for interpretability but may miss important autocorrelation structure.
What are the limitations of AIC and BIC?
Key limitations: (1) Both only compare models on the same dataset with the same response variable — values are incomparable across datasets. (2) Neither provides an absolute measure of model adequacy — the best model in a bad candidate set is still a bad model. (3) AIC is inconsistent — it may select unnecessarily complex models in large samples. (4) BIC assumes the true model exists in the candidate set — an assumption rarely satisfied in real-world complex data. (5) Both require maximum likelihood estimation — not applicable to all model types (e.g., some distance-based or nonparametric methods). (6) In high-dimensional settings (many predictors, few observations), BIC cannot handle feature selection well. (7) Data dredging — testing many models and picking the best AIC — inflates the apparent quality of the selected model.
Is BIC always better than AIC for large datasets?
Not necessarily. BIC’s heavier penalty in large samples can lead to underfitting — selecting models that are too simple and miss genuine structure. Research published in PMC on AIC/BIC sensitivity shows that in mixture models and latent class analyses, BIC often selects fewer classes than are truly present in large, complex datasets, while AIC selects the correct number. The optimal choice depends on your loss function: if you care more about not including spurious predictors (false positives), BIC’s stricter penalty is beneficial. If you care more about not missing genuine predictors (false negatives), AIC’s more permissive penalty is better. For large datasets where even tiny effects are detectable, BIC’s parsimony can be a meaningful advantage or a meaningful limitation — depending on whether those tiny effects are scientifically important.
