Power Analysis and Effect Size Cohen’s d
📊 Statistics & Research Methods
Power Analysis and Effect Size Cohen’s d
Power analysis and Cohen’s d are the two pillars every researcher must understand before collecting a single data point. This guide explains what statistical power means, how Cohen’s d quantifies the magnitude of an effect, how to calculate your required sample size, and how these concepts connect to hypothesis testing, Type I and Type II errors, and reproducible research. Whether you are writing a dissertation methods section, designing an experiment, or preparing a statistics assignment, this is the complete resource.
Foundations
Power Analysis and Effect Size Cohen’s d: Why Every Researcher Needs Both
Power analysis and Cohen’s d are two of the most important yet consistently misunderstood concepts in quantitative research. Every study that tests a hypothesis needs to answer one question before data collection begins: how large does my sample need to be to detect the effect I expect? That question cannot be answered without understanding statistical power. And statistical power cannot be calculated without an estimate of effect size — most commonly expressed as Cohen’s d when comparing two group means.
Here is why this matters so much. Underpowered studies are not just inefficient — they are misleading. A study with 40% power has a 60% chance of failing to detect a real effect. When researchers publish null results from underpowered studies, those results get treated as evidence that no effect exists. That is a scientific error with real consequences in medicine, education, psychology, and public policy. Hypothesis testing only works as intended when the study is designed with adequate power from the start.
The concept of statistical power was formalized by Jacob Cohen, an American statistician and professor at New York University. His 1988 book Statistical Power Analysis for the Behavioral Sciences remains the foundational text in the field. Cohen did not just define power — he showed that most published studies in psychology were dramatically underpowered, a problem that has since been replicated across fields. The effect size metric that bears his name, Cohen’s d, gave researchers a standardized, unit-free way to quantify how meaningful a difference between two groups actually is.
0.80
Conventional minimum for statistical power — meaning an 80% chance of detecting a true effect
0.05
Standard alpha level — the threshold for statistical significance, set to limit Type I errors
d = 0.5
Cohen’s benchmark for a “medium” effect size — detectable with roughly 64 participants per group
This guide covers every component of power analysis and Cohen’s d that students and researchers need. You will learn what these metrics mean, how to calculate them, how to use tools like G*Power and R, how to report them in a dissertation or journal article, and how they connect to Type I and Type II errors, sample size, and the replication crisis in science. For students in psychology, education, public health, business, or any field requiring quantitative research, this is the complete resource.
The core insight: Statistical significance tells you whether an effect probably exists. Effect size tells you whether it is worth caring about. Power tells you whether your study was capable of finding it in the first place. You need all three to interpret research honestly.
Core Concept
What Is Statistical Power? Definition, Formula, and Meaning
The Definition of Statistical Power
Statistical power is the probability that a statistical test will correctly reject a false null hypothesis. In plain terms: it is the probability of detecting a real effect when one actually exists. Power is expressed as a value between 0 and 1. A power of 0.80 means the study has an 80% chance of detecting the effect it was designed to find. Type I and Type II errors are the two failure modes in hypothesis testing, and power is directly tied to both.
To understand power clearly, you need to understand its four interconnected components. These four factors determine power jointly — changing one changes all the others. Research by Cohen (1992) made this relationship the foundation of modern study design.
- Alpha (α): The significance level — the threshold for rejecting the null hypothesis. Typically set at 0.05.
- Effect size (d): The magnitude of the difference between groups, standardized in units of standard deviation.
- Sample size (n): The number of participants per group. Larger samples produce more power.
- Power (1 − β): The probability of correctly detecting the effect. The complement of Type II error.
Power, Alpha, and Beta: The Three-Way Trade-off
Statistical power equals 1 minus beta (β). Beta is the probability of a Type II error — failing to reject a false null hypothesis, i.e., missing a real effect. If power is 0.80, then β = 0.20, meaning there is a 20% chance of a false negative. Alpha (α) is the probability of a Type I error — incorrectly rejecting a true null hypothesis, i.e., finding an effect that does not exist.
These three parameters constrain each other. If you keep your sample size fixed and lower alpha from 0.05 to 0.01 (making it harder to claim significance), you increase the threshold for rejection — and reduce power. If you want to maintain power at 0.80 while using a stricter alpha, you must increase your sample size. This three-way trade-off is the engine of power analysis. Understanding statistical power is essential before any study design decision.
Power = 1 − β
Where β = probability of Type II error (false negative).
Power of 0.80 means β = 0.20 — a 20% chance of missing a real effect.
Power of 0.80 means β = 0.20 — a 20% chance of missing a real effect.
Why Is 0.80 the Conventional Power Threshold?
Jacob Cohen proposed 0.80 as the conventional minimum for adequate statistical power in the behavioral sciences. His reasoning was pragmatic: if alpha is set at 0.05 (a 1-in-20 chance of a false positive), then the ratio of Type II to Type I errors in a study with power = 0.80 is 4:1. Cohen considered this ratio reasonable for most research. He acknowledged it was a convention, not a law of nature, and explicitly stated that some research contexts — such as medical trials where missing a real treatment effect has serious consequences — should use higher power targets such as 0.90 or 0.95.
Today, many methodologists argue that 0.80 is too low. A 2016 review in Nature Reviews Neuroscience by Button et al. argued that underpowered studies produce inflated effect sizes and poor replicability. Fields including psychology, social science, and clinical medicine have faced reproducibility crises partly because the standard of 80% power was treated as a ceiling rather than a floor. Many contemporary researchers now aim for 90% or 95% power, particularly for pre-registered studies.
What Factors Increase Statistical Power?
Power increases when any of the following occur — all else equal:
- The sample size increases. More data reduces sampling error and makes it easier to detect smaller effects.
- The effect size is larger. A larger real-world difference between groups is easier to detect.
- The alpha level is relaxed (e.g., from 0.01 to 0.05). A wider rejection region increases power at the cost of more Type I errors.
- A one-tailed test replaces a two-tailed test when the direction of the effect is known in advance. One-tailed tests concentrate rejection power in one direction.
- Measurement precision improves. Lower measurement error reduces the denominator in the effect size calculation, effectively increasing d.
Power in Practice: What Underpowered Studies Really Mean
A study with 40% power that produces a null result cannot legitimately be cited as evidence that no effect exists. It only shows that if the effect is real, this study had a 60% chance of missing it. Interpreting a null result as absence of evidence requires adequate power. When reviewing studies for a literature review or dissertation, always check reported power or calculate it from reported sample sizes and effect sizes. Writing a literature review without evaluating study power is a common but costly oversight.
Effect Size
What Is Cohen’s d? Definition, Formula, and Interpretation
The Definition of Cohen’s d
Cohen’s d is a standardized effect size measure that expresses the difference between two group means in units of pooled standard deviation. It was introduced by Jacob Cohen in his 1969 book Statistical Power Analysis for the Behavioral Sciences and formalized in the 1988 second edition. Cohen’s d is the most widely used effect size measure in behavioral, social, health, and educational research when comparing two groups on a continuous outcome measure.
Effect size matters because statistical significance and practical significance are not the same thing. A study with 10,000 participants might detect a statistically significant difference (p < 0.001) between two groups where Cohen’s d = 0.04 — a difference so tiny it has no real-world meaning. Conversely, a study with 20 participants might find d = 0.9 that fails to reach significance only because the sample is too small. Descriptive versus inferential statistics plays directly into this distinction — effect size belongs to the world of descriptive meaning, not inferential testing.
The Cohen’s d Formula
Cohen’s d is calculated by dividing the difference between two group means by a measure of variability — typically the pooled standard deviation.
d = (M₁ − M₂) / SDpooled
Where M₁ = mean of group 1, M₂ = mean of group 2,
and SDpooled = the pooled standard deviation of both groups.
Pooled SD = √[ (SD₁² + SD₂²) / 2 ]
and SDpooled = the pooled standard deviation of both groups.
Pooled SD = √[ (SD₁² + SD₂²) / 2 ]
The result is a dimensionless number. A d of 1.0 means the two group means are separated by one full standard deviation. A d of 0.5 means they are separated by half a standard deviation. The sign of d indicates direction: a positive d means group 1 scored higher; a negative d means group 2 scored higher. For most reporting purposes, the absolute value of d is used. This standardized metric makes Cohen’s d comparable across studies even when outcome variables are measured in completely different units.
Cohen’s d Benchmarks: Small, Medium, and Large
Jacob Cohen proposed three benchmark thresholds as practical conventions for interpreting the magnitude of d values. These have become the universal reference point in research methods courses and journal publications alike. Statistics assignment help questions about Cohen’s d almost always ask students to interpret values against these benchmarks.
0.2
Small Effect
The groups overlap considerably. The difference is real but subtle — often not perceptible without measurement.
0.5
Medium Effect
A visible, meaningful difference — Cohen described this as “an effect likely to be visible to the naked observer.”
0.8
Large Effect
A substantial difference between groups — typically obvious without statistical testing and of clear practical significance.
These benchmarks are conventions, not laws. Cohen himself cautioned that they should be used only “when no better basis for setting effect size is available.” In practice, what constitutes a meaningful effect depends heavily on the field. In pharmacology, an effect size of d = 0.2 might represent a clinically meaningful difference in patient survival. In social media engagement research, d = 0.8 might be mundane. Cohen’s 1992 paper in Psychological Bulletin makes this context-dependence explicit.
Cohen’s d vs Other Effect Sizes
Cohen’s d is specific to comparisons between two group means. Other effect size measures serve different purposes. r (Pearson’s correlation) measures the strength of a linear relationship between two continuous variables. Eta-squared (η²) or partial eta-squared measure the proportion of variance explained in ANOVA designs. Odds ratios are used in logistic regression and categorical outcomes. Hedge’s g is a corrected version of Cohen’s d that is preferred when sample sizes differ substantially between groups. For t-test comparisons, Cohen’s d remains the standard effect size metric.
Worked Example: Calculating Cohen’s d by Hand
Suppose a researcher tests whether a new study technique improves exam scores. Group A (technique) has a mean of 78 (SD = 10). Group B (control) has a mean of 72 (SD = 12).
Step 1: Calculate pooled SD = √[(10² + 12²) / 2] = √[(100 + 144) / 2] = √[244 / 2] = √122 ≈ 11.05
Step 2: Cohen’s d = (78 − 72) / 11.05 = 6 / 11.05 ≈ 0.54
Interpretation: d = 0.54 falls just above Cohen’s medium effect threshold (0.5). The new study technique produces a meaningfully larger difference than would be expected by chance, and the effect is of practical significance. Calculating standard deviation correctly is the critical first step here.
Need Help With Power Analysis or Cohen’s d?
Our statistics experts handle everything from sample size calculations to G*Power output interpretation — for dissertations, research proposals, and statistics assignments.
Get Statistics Help Now Log InStudy Design
What Is Power Analysis and When Do You Need It?
Power Analysis: The Core Definition
Power analysis is a statistical procedure used to determine the sample size required to detect an effect of a given size with a specified level of confidence. It is performed before data collection as part of study design — this is called an a priori power analysis. Power analysis can also be performed after a study to evaluate what the study was capable of detecting given its actual sample size — this is called a post hoc power analysis, though its use is controversial and often misleading.
Power analysis uses the four components introduced earlier — alpha, power, effect size, and sample size — in a specific direction. You specify three and solve for the fourth. In most study design contexts, you specify your desired alpha (0.05), your desired power (0.80 or higher), and your expected effect size (from prior literature or a pilot study), and the power analysis calculates the sample size you need. Sampling distributions underpin this calculation mathematically.
A Priori vs Post Hoc Power Analysis
The distinction between these two types matters enormously in research practice.
An a priori power analysis is performed before the study begins. Its purpose is to determine how many participants you need to recruit in order to have a reasonable chance of detecting your hypothesized effect. Funding bodies, ethics committees, and dissertation committees in the United States (particularly at research universities like Harvard, Stanford, UCLA, and University of Michigan) and in the United Kingdom (at institutions including Oxford, UCL, and Edinburgh) routinely require a priori power analysis as part of study justification.
A post hoc power analysis is performed after data collection, using the observed effect size and actual sample size to calculate what power the study had. The problem is circular: if the study found a significant result, power is irrelevant; if it found a null result, a post hoc power calculation is misleading because it uses the observed (and potentially unreliable) effect size. Most methodologists, including Hoenig and Heisey (2001), advise against routine use of post hoc power analysis for null result interpretation.
The Four Scenarios in Power Analysis
Power analysis is flexible. Depending on what you know and what you need, you can use it in four ways:
- Solve for sample size: Given alpha, power, and effect size, find n. This is the most common use in study planning.
- Solve for power: Given alpha, n, and effect size, find power. Used to evaluate the adequacy of a planned or completed study.
- Solve for effect size: Given alpha, power, and n, find the minimum detectable effect. Used when sample size is fixed (e.g., by budget) to understand study limitations.
- Solve for alpha: Given power, n, and effect size, find the required alpha. Less common but used in equivalence testing and Bayesian analysis frameworks.
⚠️ Common mistake: Using a post hoc power analysis after a null result and citing it as evidence the study was adequately powered. Post hoc power is mathematically tied to the p-value of the observed test. A non-significant result will always yield low post hoc power using the observed effect size — this is a tautology, not new information. Use sensitivity analysis or confidence intervals instead.
Power Analysis for Different Statistical Tests
Different statistical tests require different power calculations because they have different test statistics and sampling distributions. Power analysis logic applies broadly, but the specific formulas and inputs vary. Chi-square tests use a different effect size measure (Cohen’s w or Cramér’s V). ANOVA uses f as the effect size. The two-sample t-test uses Cohen’s d. The following are the most common tests and their associated effect size metrics:
- Independent samples t-test: Cohen’s d
- Paired samples t-test: Cohen’s d (dz for within-subjects designs)
- One-sample t-test: Cohen’s d
- One-way ANOVA: Cohen’s f
- Chi-square goodness-of-fit or independence: Cohen’s w
- Pearson correlation: r directly as effect size
- Multiple regression: Cohen’s f²
- MANOVA: Cohen’s f²
Step-by-Step Process
How to Conduct a Power Analysis: Step-by-Step
Running a power analysis correctly requires making deliberate, defensible choices at each step. The following process applies to the most common scenario in student and graduate research: a two-sample independent t-test comparing means between two groups. The same logic extends to other designs with appropriate modifications.
1
Define Your Research Question and Statistical Test
Before you can calculate anything, you need to know what test you are running. Are you comparing two independent group means? Use an independent samples t-test and Cohen’s d. Are you examining a before-and-after design within a single group? Use a paired t-test. Are you comparing three or more groups? Use ANOVA and Cohen’s f. The choice of test determines which power formula and which effect size metric applies. Research methods in psychology typically requires students to justify test selection before any analysis.
2
Estimate the Expected Effect Size (Cohen’s d)
This is the step that demands the most intellectual honesty. You have three sources for your expected effect size. First, prior published literature in your area: look for meta-analyses or well-powered studies that report Cohen’s d or give enough information to calculate it. Second, a pilot study you ran to estimate the effect. Third, Cohen’s conventional benchmarks as a last resort. The worst approach is using an unrealistically large effect size to justify a small, cheap sample. Academic research tools like PsycINFO, PubMed, and Web of Science are essential for finding relevant prior effect sizes.
3
Set Your Alpha Level
The conventional alpha is 0.05 for two-tailed tests in most social, behavioral, and educational research. If your study is exploratory or theory-generating, 0.10 is sometimes acceptable. If you are working in a medical or clinical context where a false positive carries serious consequences, 0.01 or even 0.001 may be appropriate. State your alpha and justify it. In a two-tailed test, you are allowing for the effect to be in either direction. In a one-tailed test, you are predicting the direction in advance — and must be prepared to defend that prediction.
4
Set Your Desired Power Level
Conventional minimum is 0.80. Many universities and funding bodies now expect 0.90 for dissertation studies. Pre-registered studies increasingly report 0.95. If you are conducting a replication study of a previously published finding, you should aim for higher power than the original study used, because the original study is likely to have overestimated its effect size due to publication bias. The Center for Open Science has detailed guidelines on power requirements for pre-registered and registered report submissions.
5
Run the Power Analysis in G*Power or R
G*Power is the most widely used free software for power analysis, developed by Franz Faul and colleagues at the University of Kiel, Germany. It handles all major statistical tests and allows you to specify inputs and solve for the unknown parameter (most often sample size). In R, the pwr package (Champely, 2020) provides equivalent functionality with reproducible code that can be included in your methods appendix. Both tools are free. For most student and dissertation-level research, G*Power is sufficient and widely accepted by thesis committees at institutions including Columbia University, King’s College London, and University of Toronto.
6
Report and Justify Your Sample Size
In your methods section, report your power analysis inputs and output. State the effect size you assumed, the alpha level you used, the power you targeted, the test you used, and the sample size produced. If you used G*Power, you can include the parameters exactly. A typical dissertation methods statement reads: “A priori power analysis using G*Power 3.1 indicated that a minimum sample of 128 participants (64 per group) was required to detect a medium effect (d = 0.50) with 80% power at a two-tailed alpha of .05 for an independent samples t-test.” Research paper writing standards require this level of specificity.
Power Analysis in G*Power: Key Parameters
G*Power structures every analysis around five parameters. Understanding these fields removes the confusion most students experience the first time they open the software.
Test family: Choose the family of tests — t-test, F-test, χ² test, z-test, etc.
Statistical test: Choose the specific test within the family — e.g., “Means: Difference between two independent groups” for an independent t-test.
Type of power analysis: Choose “A priori” to solve for sample size.
Input parameters: Enter alpha (0.05), power (0.80), and effect size (Cohen’s d — e.g., 0.5 for medium).
Output: G*Power produces the total sample size, the sample size per group, the actual power achieved, and the critical test statistic value. For an independent t-test with d = 0.5, α = 0.05, power = 0.80, G*Power returns n = 64 per group (128 total).
Reference Tables
Sample Size Requirements for Different Effect Sizes and Power Levels
The following two tables summarize the sample sizes required for an independent samples t-test under different combinations of Cohen’s d, power level, and alpha. These are the most frequently referenced values in research methods courses and are derived from standard power analysis formulas. Confidence intervals and power analysis share the same mathematical infrastructure — both are based on the sampling distribution of the test statistic.
| Cohen’s d | Effect Size Category | n per group (power = 0.80, α = 0.05) | n per group (power = 0.90, α = 0.05) | n per group (power = 0.95, α = 0.05) |
|---|---|---|---|---|
| 0.20 | Small | 394 | 527 | 651 |
| 0.30 | Between small and medium | 176 | 235 | 290 |
| 0.50 | Medium | 64 | 85 | 105 |
| 0.80 | Large | 26 | 34 | 42 |
| 1.00 | Very large | 17 | 22 | 27 |
| 1.20 | Very large | 12 | 16 | 19 |
This table reveals something important that many students do not fully appreciate: small effects require enormous samples. Detecting a small effect (d = 0.2) at 80% power requires nearly 400 participants per group — nearly 800 in total for a two-group study. This is why undergraduate research projects rarely detect small effects, why meta-analyses are necessary to reliably estimate small but real effects, and why the replication crisis in psychology has been particularly severe in areas where most published effects were small and most studies were underpowered.
| Scenario | Cohen’s d | Alpha | Power Target | n per Group | Total N |
|---|---|---|---|---|---|
| Dissertation pilot study | 0.50 (medium) | 0.05 | 0.80 | 64 | 128 |
| Clinical RCT | 0.35 | 0.05 | 0.90 | 174 | 348 |
| Pre-registered replication | 0.30 | 0.05 | 0.95 | 290 | 580 |
| Education intervention study | 0.50 | 0.05 | 0.90 | 85 | 170 |
| Medical screening test | 0.20 (small) | 0.01 | 0.90 | 845 | 1690 |
Statistics Assignment or Dissertation Methods Section?
Whether you need G*Power output interpreted, sample size justified, or a full methods section written — our statistics experts deliver accurate, university-level work fast.
Order Your Assignment Log InAcademic Reporting
How to Report Cohen’s d and Power Analysis in APA Format
Reporting Cohen’s d in Results Sections
APA 7th edition style requires that effect sizes accompany significance test results. Cohen’s d should be reported alongside the t-statistic, degrees of freedom, and p-value when comparing two group means. The format varies slightly by test but the convention is clear and consistent. T-test reporting in APA style requires all of these elements together.
Example — Independent t-test result with Cohen’s d in APA 7:
“The intervention group (M = 78.4, SD = 10.2) scored significantly higher than the control group (M = 72.1, SD = 11.8) on the post-test measure, t(126) = 2.94, p = .004, d = 0.54, 95% CI [0.19, 0.88]. This medium effect size indicates a practically meaningful difference between conditions.”
Several elements are non-negotiable in APA reporting. You must include the test statistic (t), degrees of freedom in parentheses, the exact p-value (not just “p < .05”), the effect size (d), and the 95% confidence interval for d. The confidence interval for d is increasingly required by major journals, including those published by the American Psychological Association and the British Psychological Society.
Reporting Power Analysis in the Methods Section
Power analysis belongs in the Participants subsection of the Method section, immediately after describing the recruitment strategy. Report all inputs and the resulting sample size. Use the past tense if reporting a completed study and future tense if reporting a proposal.
Example — A priori power analysis in APA 7 methods section:
“Sample size was determined a priori using G*Power 3.1.9.7 (Faul et al., 2009). Based on a medium effect size (d = 0.50) as reported in comparable intervention studies (Smith & Jones, 2023), a two-tailed alpha of .05, and a target power of .80, a minimum of 64 participants per group (128 total) was required for an independent samples t-test. We recruited 140 participants to allow for an anticipated 10% dropout rate.”
Confidence Intervals for Cohen’s d
Cohen’s d, like any sample statistic, is an estimate subject to sampling error. Providing a 95% confidence interval for d communicates the precision of that estimate. A d of 0.54 with a 95% CI of [0.19, 0.88] tells readers that the true population effect is likely between a small-to-medium and a large effect — a range that still excludes zero, confirming the effect is real, but acknowledges the considerable uncertainty in its exact magnitude. Narrow confidence intervals require large samples. This is another reason that adequately powered studies produce more trustworthy effect size estimates. Confidence intervals are not optional reporting — they are essential for readers to evaluate the precision and replicability of your findings.
Common Reporting Mistakes
Students and even published researchers make the following errors repeatedly in reporting effect sizes and power:
- Reporting only p-values and omitting effect sizes — this is no longer acceptable in major journals.
- Confusing statistical significance with practical significance — a significant p-value at d = 0.06 is not meaningful.
- Performing a post hoc power analysis and citing it as evidence of adequate power when the study was null.
- Using the wrong pooled SD formula — using only one group’s SD instead of the pooled value.
- Failing to report confidence intervals for Cohen’s d — now expected by APA 7th edition and most top-tier journals.
- Using Cohen’s conventional benchmarks without citing them — always cite Cohen (1988) when applying the small/medium/large thresholds.
Applications
Power Analysis and Cohen’s d in Real Research: Education, Psychology, and Health
Education Research
Power analysis and Cohen’s d are fundamental in educational research at institutions like the Institute of Education Sciences (IES) in Washington, D.C. and the What Works Clearinghouse. When researchers evaluate whether a new teaching method, curriculum, or intervention improves student outcomes, they need to design studies with enough power to detect meaningful differences in learning. Education effect sizes are often small — d = 0.2 to 0.4 — because many interventions compete against already-decent baseline instruction. This means education studies typically require large samples and careful power planning. IES What Works Clearinghouse standards explicitly require power analysis documentation for studies seeking high-evidence ratings.
Clinical Psychology and Psychotherapy Research
In clinical psychology, Cohen’s d is the standard way to compare treatment and control group outcomes in randomized controlled trials. A meta-analysis of cognitive behavioral therapy (CBT) for depression, published in JAMA Psychiatry, reported a mean Cohen’s d of approximately 0.73 — a medium-to-large effect that has consistently replicated across studies. Knowing this, a researcher planning a new CBT trial can use d = 0.73 as the expected effect size in G*Power to determine the required sample. The American Psychological Association‘s Division 12 maintains a list of empirically supported treatments where effect sizes are central to the evidence classification.
Public Health and Medicine
Medical research often uses power analysis in clinical trial registration. The U.S. Food and Drug Administration (FDA) and National Institutes of Health (NIH) both require power analysis as part of clinical trial protocols. The NIH’s Rigor and Reproducibility initiative, launched following the replication crisis, specifically cites inadequate power as a leading cause of irreproducible biomedical results. For continuous outcomes in medical trials, Cohen’s d (or its equivalent) is used alongside clinical thresholds to determine the minimum clinically important difference — the smallest effect that would change clinical practice.
The Replication Crisis and the Role of Power
The replication crisis — the widespread failure of published findings to replicate — has its roots in statistical underpowering combined with publication bias. When journals publish only significant results, the published literature becomes biased toward overestimates of effect size. Studies powered at 80% to detect d = 0.5 often only achieve significance when the sample happens to produce a larger observed d. The “winner’s curse” means that published effect sizes from underpowered studies are systematically inflated.
The 2015 Reproducibility Project: Psychology, led by Brian Nosek at the University of Virginia, replicated 100 published psychology studies and found that only about 40% successfully replicated. A major predictor of replication failure was low statistical power in the original study. This finding prompted widespread reform in research methods education, pre-registration requirements, and effect size reporting standards. Factor analysis and data reduction methods face the same underpowering problem in exploratory contexts.
✓ Well-Powered Research Practice
- Effect size estimated from prior meta-analyses, not inflated assumptions
- A priori power analysis with power ≥ 0.80 documented in methods
- Study pre-registered before data collection
- Cohen’s d reported with 95% CI in results
- Sample size accounts for expected dropout
- Sensitivity analysis reported alongside primary power analysis
✗ Common Underpowered Practice
- Effect size assumed to be “large” to justify small sample
- No power analysis reported; sample size selected by convenience
- Study not pre-registered; analysis changed after seeing data
- Only p-values reported; no effect size included
- Post hoc power analysis cited to defend null result
- Single study treated as definitive evidence
Advanced Concepts
Variants of Cohen’s d: Hedge’s g, Glass’s Δ, and dz
When Standard Cohen’s d Is Not Enough
The standard Cohen’s d formula assumes equal sample sizes and approximately equal population standard deviations between groups. When these assumptions are violated, corrected variants provide more accurate effect size estimates. Understanding these variants is important for graduate-level research and for accurate meta-analysis. Regression analysis and other advanced statistical models have their own analogous effect size metrics, but for group mean comparisons, the d family is the standard.
Hedges’ g: The Corrected Cohen’s d
Hedges’ g is a bias-corrected version of Cohen’s d proposed by Larry Hedges at the University of Chicago. The standard Cohen’s d slightly overestimates the population effect size, particularly with small samples. Hedges’ g applies a correction factor (J) that reduces this bias. For samples of n ≥ 20 per group, the difference between d and g is negligible. For smaller samples, g is preferred. In meta-analyses, Hedges’ g is almost always the effect size of choice because it weights effect sizes by their precision and corrects for small-sample bias across all included studies.
g = d × J
Where J = correction factor = 1 − (3 / (4df − 1))
df = degrees of freedom = (n₁ + n₂ − 2)
For large samples, J ≈ 1 and g ≈ d
df = degrees of freedom = (n₁ + n₂ − 2)
For large samples, J ≈ 1 and g ≈ d
Glass’s Δ: When Variances Differ Substantially
Glass’s Δ (Delta), proposed by Gene Glass at the University of Colorado, uses only the control group’s standard deviation in the denominator rather than the pooled SD. This is appropriate when the variances of the two groups differ substantially — for example, in treatment outcome studies where the treatment may affect not only the mean but also the variability of scores (some people respond very well, others not at all). Using Glass’s Δ preserves the control group as the natural baseline reference.
Δ = (M₁ − M₂) / SDcontrol
Uses only the control group’s SD rather than the pooled SD.
Preferred when treatment changes score variability as well as the mean.
Preferred when treatment changes score variability as well as the mean.
dz: Cohen’s d for Within-Subjects (Paired) Designs
When the same participants are measured twice — before and after an intervention, or in two conditions — you use a paired t-test, not an independent t-test. The appropriate effect size is dz, where z refers to the difference scores (D = post − pre). dz is calculated as the mean of the difference scores divided by the standard deviation of the difference scores. Because within-subjects designs control for individual differences, dz is typically larger than the equivalent between-subjects d for the same data. Lakens (2013), in a widely cited methods paper in Frontiers in Psychology, provides a comprehensive treatment of d variants and their calculation. Understanding the paired t-test is essential before interpreting dz.
Tools
Tools for Power Analysis: G*Power, R, and SPSS
G*Power 3.1: The Gold Standard for Students
G*Power is free, widely used, and covers virtually all commonly used statistical tests. It was developed by Franz Faul, Edgar Erdfelder, Albert-Georg Lang, and Axel Buchner at the University of Kiel in Germany. The software handles t-tests, F-tests (ANOVA, regression), chi-square tests, z-tests, and more. For each test, G*Power allows four types of analysis (a priori, post hoc, sensitivity, criterion) and produces detailed output including power curves that visualize how power changes with sample size.
For students writing dissertation methods sections, G*Power is the most cited power analysis tool in published methods sections across U.S. and UK universities. It is also free to download from the University of Düsseldorf.
R: The pwr Package for Reproducible Power Analysis
The pwr package in R, maintained by Stephane Champely and documented comprehensively on CRAN, provides clean, reproducible power analysis with code you can include in your supplementary materials or appendix. The key functions are:
Independent t-test (two groups): pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.80, type = "two.sample")
One-sample t-test: pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.80, type = "one.sample")
Paired t-test: pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.80, type = "paired")
One-way ANOVA: pwr.anova.test(k = 3, f = 0.25, sig.level = 0.05, power = 0.80)
Correlation: pwr.r.test(r = 0.3, sig.level = 0.05, power = 0.80)
Chi-square: pwr.chisq.test(w = 0.3, sig.level = 0.05, power = 0.80, df = 2)
SPSS and Stata
SPSS (developed by IBM) includes a Sample Power add-on for power analysis, though it is rarely used in academic contexts because it is paid and less flexible than G*Power or R. Stata has built-in power commands (power twomeans, power oneproportion, etc.) that are well-documented and used in health economics and epidemiology research. SAS PROC POWER is the standard in pharmaceutical and clinical trial contexts. For most student work, G*Power or R’s pwr package is both sufficient and expected. Running tests in SPSS and connecting those results back to your a priori power analysis completes a methodologically sound research workflow.
Clarifications
Common Misconceptions About Power Analysis and Cohen’s d
Misconception 1: A Large p-value Means No Effect
This is the most consequential statistical misconception in research. A non-significant p-value does not mean there is no effect — it means the study failed to produce sufficient evidence to reject the null hypothesis at the chosen alpha level. In an underpowered study, a real effect will frequently produce a non-significant result. The correct interpretation is always: “We did not find sufficient evidence for an effect in this sample, at this sample size, at this alpha level.” Whether the effect exists is a separate question that requires considering power. Type I and Type II error rates are the formal framework for this distinction.
Misconception 2: Statistical Significance Means Practical Significance
With a large enough sample, virtually any difference between two groups will be statistically significant — even trivially small ones. A study comparing two educational apps on 50,000 students might find a statistically significant difference in exam scores of 0.3 points out of 100, with d = 0.03. The difference is real but meaningless. Effect size — especially Cohen’s d — is the metric that answers the “so what” question. Statistical significance only tells you whether the effect is probably not zero. Effect size tells you whether it matters.
Misconception 3: Power of 80% Is Always Enough
Cohen’s 0.80 threshold was proposed as a reasonable convention for most behavioral research. It is not a universal standard. For clinical trials where missing a treatment effect could harm patients, 90% or 95% power is expected. For pre-registered studies aiming for credible replication, 90% power is increasingly the norm. For exploratory studies generating preliminary effect estimates, 80% may be acceptable. The appropriate power level depends on the stakes of the research, the cost of false negatives, and the field’s conventions.
Misconception 4: Post Hoc Power Analysis Is Informative After a Null Result
Post hoc power calculated using the observed effect size from a null study is mathematically tautological. A non-significant result always corresponds to a low observed effect size, which always produces low post hoc power. This tells you nothing you did not already know from the p-value. Post hoc power analysis only adds information when the assumed effect size differs from the observed effect size — which is only meaningful if you had a clear a priori expectation. Use sensitivity analysis (asking: what is the smallest effect my study could detect at 80% power?) as a more informative alternative.
Misconception 5: Cohen’s Benchmarks Are Universal Standards
Cohen was explicit that his 0.2/0.5/0.8 benchmarks were fallback conventions for when no better information was available. In cardiovascular medicine, an intervention producing d = 0.15 in mortality reduction might be enormously valuable. In social psychology, d = 0.8 from a lab study might not generalize to any real-world behavior. Always contextualize Cohen’s d against the specific research literature, the measurement scale used, and the practical significance threshold of the field. Qualitative vs quantitative research frames this context-dependence differently — quantitative effect sizes need qualitative interpretation to be meaningful.
⚠️ The publication bias problem: Many published Cohen’s d values are inflated. When journals only publish significant results, the effect sizes in print are biased upward. Using published effect sizes to plan a replication study without correcting for this bias will result in an underpowered study. Adjust your expected effect size downward by at least 25–50% when basing it on published values from small studies. Use meta-analytic estimates whenever available.
Advanced Methods
Sensitivity Analysis: What Effect Can Your Study Actually Detect?
What Is a Sensitivity Analysis?
A sensitivity analysis in the power context asks: given the sample size I actually have (or plan to have), the alpha I am using, and the power I want, what is the smallest effect size I can reliably detect? This is the most practically useful form of power analysis when sample size is constrained by budget, time, or population availability.
A sensitivity analysis does not tell you whether your effect is real. It tells you the resolution of your study — the minimum signal your statistical microscope can pick up. If your sensitivity analysis shows that your sample can only detect effects of d ≥ 0.60 at 80% power, and the literature suggests the true effect is around d = 0.30, you should acknowledge this limitation explicitly in your methods section. Model selection in statistics faces analogous resolution trade-offs in terms of what complexity of model a dataset can support.
Running a Sensitivity Analysis in G*Power
In G*Power, select “Sensitivity” under Type of Power Analysis. Enter your fixed sample size, the alpha level, and the desired power. G*Power returns the minimum detectable effect size (Cohen’s d). This is the number you report when your sample size is constrained: “Given our sample of n = 50 per group, our study was powered to detect effects of d ≥ 0.57 at 80% power.”
Example sensitivity statement (for dissertation or journal article):
“Due to recruitment constraints, our final sample consisted of 40 participants per group (N = 80). A sensitivity analysis using G*Power 3.1 indicated that this sample provided 80% power to detect effects of d ≥ 0.64 at a two-tailed alpha of .05. We acknowledge that smaller effects, which are plausible given the literature (d = 0.30–0.50), may not have been detectable in this sample.”
Sensitivity Analysis vs Post Hoc Power: Why Sensitivity Wins
Sensitivity analysis is far more informative than post hoc power because it uses the assumed or planned effect size (not the observed one) to characterize study capabilities. It answers the question: “What could this study have found, in principle?” rather than the circular question “How powerful was this study, given what it found?” Most major methodologists, including those writing for Annual Review of Psychology, now recommend sensitivity analysis as the preferred post hoc characterization of study limitations.
Struggling With Your Statistics Dissertation?
From power analysis and sample justification to results interpretation and APA reporting — our expert statisticians write accurate, method-section-ready content that earns marks.
Start Your Order Log InFrequently Asked Questions
Frequently Asked Questions About Power Analysis and Cohen’s d
What is Cohen’s d in simple terms?
Cohen’s d is a number that tells you how different two groups are from each other, measured in units of standard deviation. If group A has a mean exam score of 78 and group B has a mean of 72, and the typical spread of scores is about 10 points (the standard deviation), then d = (78 − 72) / 10 = 0.6 — a medium-to-large effect. A Cohen’s d of 0 means the groups are identical. A d of 1.0 means the groups differ by exactly one standard deviation. The higher the d, the more practically meaningful the difference between groups.
What is a good Cohen’s d value?
Jacob Cohen proposed three conventions: d = 0.2 is small, d = 0.5 is medium, and d = 0.8 is large. A “good” Cohen’s d depends entirely on context. In pharmaceutical trials, d = 0.2 might represent a clinically meaningful reduction in side effects. In social psychology lab studies, d = 0.8 is common but may not translate to real-world behavior. The honest answer is that the value of d should always be interpreted relative to what is meaningful in your specific research field — not just against Cohen’s universal benchmarks.
What is statistical power in simple terms?
Statistical power is the probability that your study will detect a real effect when one truly exists. Think of it as the sensitivity of your study. A study with 80% power has an 80% chance of producing a statistically significant result if the effect you are looking for is real. A study with only 40% power has a 40% chance — meaning it will miss the real effect 60% of the time. Low power wastes resources, produces unreliable results, and contributes to the replication crisis in science.
How do I calculate sample size for my study?
To calculate sample size, you need four inputs: the expected effect size (Cohen’s d from prior literature), the alpha level (usually 0.05), the desired power (usually 0.80 or 0.90), and the type of test you are running. With these inputs, use G*Power (free software) or R’s pwr package. For an independent t-test expecting a medium effect (d = 0.5), at alpha = 0.05 and 80% power, G*Power returns 64 participants per group (128 total). For dissertation proposals, include the software used, all input parameters, and the output sample size in your methods section.
What is the difference between Cohen’s d and Hedges’ g?
Both Cohen’s d and Hedges’ g measure the standardized difference between two group means. The difference is that Hedges’ g applies a correction factor for small-sample bias — Cohen’s d slightly overestimates the true population effect size when samples are small (under 20 per group). For large samples, d and g are nearly identical. In meta-analyses, Hedges’ g is always preferred because it accounts for the varying precision of effect estimates across studies. For dissertation-level comparisons with samples of 30+ per group, the choice between d and g makes little practical difference.
Can I do a power analysis after collecting data?
Yes, but with major caveats. A post hoc power analysis — calculated using the observed effect size from your completed study — is generally not informative and often misleading. If the study found a null result, a post hoc power analysis using the observed d will automatically show low power — but this tells you nothing new. Instead, use a sensitivity analysis: report the smallest effect your sample could have detected at 80% power. This is a legitimate and informative characterization of your study’s limitations without the circularity problem of post hoc power.
How does effect size relate to sample size?
Effect size and sample size work in opposite directions for required power. If the true effect is large (d = 0.8), you need a relatively small sample to detect it with 80% power — about 26 per group. If the effect is small (d = 0.2), you need approximately 394 per group. This inverse relationship is why studies of subtle effects (common in education, social psychology, and public health) require large samples and why many published studies that used small samples are likely underpowered and unreliable.
What does it mean if Cohen’s d is negative?
A negative Cohen’s d simply means that group 2 scored higher than group 1 — it indicates direction, not quality. If you calculate d = (M₁ − M₂) / SDpooled and group 2 has the higher mean, d will be negative. For most reporting purposes, you report the absolute value of d and describe the direction of the effect in words. Some research fields consistently report directional d values (particularly meta-analyses where direction matters for interpretation), but magnitude comparisons always use the absolute value.
How do I report power analysis in APA format?
In your Methods section, under Participants, report: (1) the software used (e.g., G*Power 3.1); (2) the type of analysis (a priori); (3) the statistical test; (4) the expected effect size and its source (prior literature, meta-analysis, or Cohen’s conventions); (5) the alpha level; (6) the target power; (7) the required sample size per group and total. Example: “An a priori power analysis in G*Power 3.1 indicated that n = 64 per group was required to detect a medium effect (d = 0.50) at α = .05 with 80% power for an independent samples t-test (Faul et al., 2009).”
What is the relationship between alpha, beta, and power?
Alpha (α) is the probability of a Type I error — a false positive. Beta (β) is the probability of a Type II error — a false negative, i.e., missing a real effect. Power equals 1 − β. These three quantities are interconnected: tightening alpha (from 0.05 to 0.01) reduces false positives but also reduces power (increases β) unless you increase sample size. Increasing sample size increases power (reduces β) without changing alpha. This three-way relationship is the mathematical engine of every power analysis.
