How to Calculate Intraclass Correlation Coefficient (ICC)
Statistics & Reliability Analysis
How to Calculate Intraclass Correlation Coefficient (ICC)
The Intraclass Correlation Coefficient (ICC) is the gold-standard metric for measuring agreement between raters, measurement occasions, or instruments — yet most students encounter it only as a number to report, not a concept to understand. This guide changes that. Whether you are completing a reliability study for psychology, clinical research, education, or data science, this is where you start.
We cover all six ICC forms defined by Shrout and Fleiss (1979) — the foundational framework every reliability researcher cites — alongside the full ANOVA-based calculation procedure, step-by-step formulas, and the authoritative interpretation benchmarks from Koo and Li (2016). You will understand not just how each ICC type is calculated, but why the choice between them matters enormously for your conclusions.
The guide draws on peer-reviewed sources from Psychological Bulletin, Journal of Chiropractic Medicine, Psychological Methods, and researchers at leading institutions in the United States and UK. Key entities — Shrout, Fleiss, Koo, Li, McGraw, Wong, and the R packages psych and irr — are placed in their proper scientific context so your assignments demonstrate genuine methodological command.
By the end, you will know exactly how to select the right ICC form for your study design, calculate it by hand and in software, construct 95% confidence intervals, and interpret the result in a way that satisfies peer reviewers and dissertation committees.
What Is ICC & Why It Matters
How to Calculate Intraclass Correlation Coefficient (ICC)
Intraclass Correlation Coefficient (ICC) answers one of the most important questions in empirical research: when two or more raters, instruments, or measurement occasions produce values for the same subject, how much can you trust those measurements? ICC is not a simple correlation. It is a reliability statistic built from variance components — and choosing the wrong form, or calculating it without understanding what it measures, is one of the fastest ways to undermine a study’s credibility. Understanding correlation in statistical relationships gives you the conceptual foundation, but ICC takes that logic considerably further by accounting for whether raters are interchangeable rather than merely co-varying.
The ICC has been the standard reliability metric in clinical, psychological, educational, and social research for decades. Its defining feature is that it partitions total score variance into two components: variance that reflects genuine differences between subjects, and variance that reflects measurement error (disagreement between raters, instruments, or occasions). The proportion of total variance explained by between-subject differences is the ICC. A high ICC means the measurements successfully discriminate between subjects — meaning the instrument is reliable. A low ICC means the error variance dominates — meaning you cannot trust that the measurement reflects anything real about the subject. Understanding variance and data distribution is the technical prerequisite for grasping why ICC is computed this way and what the resulting number means.
6
Forms of ICC defined by Shrout & Fleiss (1979) in the landmark Psychological Bulletin paper
0.90
Koo & Li (2016) threshold above which ICC indicates excellent reliability for clinical measurement
10
Total ICC forms identified by McGraw & Wong (1996) in Psychological Methods
What Exactly Is the Intraclass Correlation Coefficient?
The intraclass correlation coefficient is a reliability index that quantifies the degree of agreement or consistency among multiple measurements of the same subjects — where those measurements come from raters of the same “class” (meaning there is no logical way to distinguish them, unlike distinguishing a test from a retest, or variable X from variable Y). The word “intraclass” distinguishes it from the Pearson product-moment correlation, which measures agreement between two logically distinct variables. Hypothesis testing frameworks underpin ICC significance tests — the null hypothesis is that ICC equals zero (no reliability), tested via an F-statistic derived from the ANOVA table.
ICC operates by modeling total score variance using ANOVA. Scores vary because subjects genuinely differ from each other (this is the signal we want), and because raters or measurement occasions produce inconsistent readings (this is the error we want to minimize). The ICC is, in its most basic form, the ratio of between-subject variance to total variance. This simple idea has remarkable consequences: it means an ICC computed in a very homogeneous subject sample will be artificially low not because the instrument is unreliable, but because there is little true between-subject variance to partition. This is a critical, frequently misunderstood property that affects how reliability studies should be designed and interpreted. Regression model assumptions and ANOVA assumptions both apply to ICC calculation — normality of errors, independence of observations, and homogeneity of variance each affect the validity of the resulting ICC estimate.
Why ICC Replaced Pearson Correlation for Reliability
Before ICC became standard, researchers frequently misused Pearson’s r to assess inter-rater reliability. The problem is that Pearson r is insensitive to systematic rater differences. If one radiologist consistently rates tumor size 20% higher than a colleague, their readings will have a Pearson r of 1.0 — perfect linear association — yet they fundamentally disagree. Agreement-form ICC (which we cover in detail below) correctly penalizes this systematic offset. The correlation-causation distinction is related conceptually: Pearson r measures the direction and strength of linear association, not interchangeability or agreement. ICC measures agreement, making it far more appropriate whenever you need to establish that different raters or instruments can be used interchangeably in practice.
The landmark paper establishing the ICC framework was Shrout and Fleiss (1979), published in Psychological Bulletin, which defined the six canonical forms of ICC and provided formulas based on one-way and two-way ANOVA. This framework has been extended but not superseded. The most widely cited interpretation guide is Koo and Li (2016) in the Journal of Chiropractic Medicine, which provides the benchmarks (poor, moderate, good, excellent) used in virtually every reliability paper published today. These two papers are the primary citations for any ICC-related assignment or research report. Statistics assignment help for reliability analysis topics frequently requires demonstrating familiarity with both papers — citing them correctly and applying their frameworks precisely.
The core conceptual point about ICC: The ICC measures how much of the total variance in scores is attributable to true differences between subjects rather than to measurement error. A high ICC (close to 1.0) means your instrument reliably distinguishes subjects from each other. A low ICC (close to 0) means measurement error dominates — the scores are mostly noise. ICC is not about whether scores are high or low in absolute terms; it is about whether scores consistently rank subjects in the same order (consistency ICC) or produce the same absolute values (agreement ICC).
The Six Forms of ICC
The Six ICC Forms: Shrout and Fleiss (1979) Explained
The most consequential decision in any ICC analysis is choosing the correct form. Shrout and Fleiss (1979) defined six forms of the intraclass correlation coefficient organized by two dimensions: the statistical model (one-way random, two-way random, or two-way mixed) and the unit of measurement (single rater vs. mean of k raters). Choosing incorrectly — say, using a two-way formula when a one-way design was used — produces a biased and uninterpretable result. Understanding your measurement design is the first step, because the ICC form follows logically from the study design, not from the data.
An additional dimension added by McGraw and Wong (1996) in Psychological Methods — who extended the taxonomy to ten forms — is the choice between agreement and consistency within each two-way model. This is functionally the most important distinction for researchers: agreement ICC asks “do the raters produce the same absolute values?”, while consistency ICC asks “do the raters rank subjects the same way?” This matters enormously for clinical interpretation. Confidence intervals for ICC estimates are just as important as the point estimate — and choosing the wrong ICC form produces a CI that is not only wrong in center but in width.
Model 1: One-Way Random Effects (ICC(1,1) and ICC(1,k))
The one-way random effects model — producing ICC(1,1) for individual rater measurements and ICC(1,k) for the average across k raters — applies when each subject is rated by a different set of raters randomly drawn from a larger population, and when those rater identities are not tracked. There is no way to account for rater-specific effects because different subjects were rated by different people. The only source of systematic variance in the model is between-subject variance; everything else collapses into within-subject error. This is the most conservative ICC form — it produces the lowest ICC estimates — because it cannot separate rater-to-rater variability from other sources of within-subject error. Sampling distribution theory underlies why ICC(1,1) is appropriate when raters are randomly selected from a population rather than specifically designated.
ICC(1,1) is rarely appropriate in practice. Most reliability studies use the same raters for all subjects. The one-way model is used when raters are genuinely interchangeable and their identity is not meaningful — for example, when random community members each evaluated a subset of product samples without overlap. The practical implication: if you ran ICC(1,1) when you should have run ICC(2,1) or ICC(3,1), your estimate is more pessimistic than the true reliability — you are penalizing the statistic for rater variability that your design does not actually contain.
Model 2: Two-Way Random Effects (ICC(2,1) and ICC(2,k))
The two-way random effects model produces ICC(2,1) (individual rater) and ICC(2,k) (mean of k raters). It applies when the same set of raters rates all subjects AND both subjects and raters are considered random samples from larger populations. This is the most commonly appropriate model for inter-rater reliability studies in psychology, education, and clinical research, because typically the same raters (e.g., two trained clinicians) rate all participants, and the intent is to generalize the conclusions to other raters from the same population. ICC(2,1) can be calculated in both agreement and consistency forms. The agreement form penalizes systematic mean differences between raters; the consistency form does not. MANOVA concepts apply here because two-way ANOVA partitions variance into multiple sources simultaneously — between subjects, between raters, and the residual interaction error.
When to use agreement vs. consistency in ICC(2): use agreement (ICC(2,1) absolute agreement) when the absolute values produced by each rater matter — for example, when different clinicians will use the instrument interchangeably and their absolute scores will drive clinical decisions. Use consistency (ICC(2,1) consistency) when only the rank ordering matters — for example, when a research team uses ratings for relative comparisons and the same raters will always be used, so systematic offsets are not a practical concern. In clinical measurement, agreement ICC is almost always the appropriate choice.
Model 3: Two-Way Mixed Effects (ICC(3,1) and ICC(3,k))
The two-way mixed effects model produces ICC(3,1) and ICC(3,k). Like Model 2, it requires the same raters to rate all subjects. But unlike Model 2, Model 3 treats those specific raters as fixed effects — the only raters of interest — rather than as a random sample from a larger population of raters. The key distinction: you are not trying to generalize beyond these specific raters. This model is appropriate when the same two or three specific instruments (not randomly selected raters, but specific measurement devices) are used throughout a study and you do not intend to generalize to other instruments. Residual analysis in statistical modeling helps diagnose whether the assumption of no rater-by-subject interaction — which is required for ICC(3) to be well-defined — holds in practice.
ICC(3,1) gives the same consistency estimate as ICC(2,1) consistency — this is not a coincidence but a mathematical consequence of the model structure when the rater-by-subject interaction term is the error term. The difference only emerges in the agreement form: ICC(3,1) agreement is not typically reported because fixed raters are already accounted for, and the question of agreement with other (non-observed) raters is not meaningful in this design. In practice, many researchers use ICC(3,1) when they should use ICC(2,1), leading to inflated estimates. The correct rule: if you plan to generalize your reliability conclusions to other raters, use ICC(2). If your reliability conclusion is limited to these specific raters or instruments, use ICC(3).
| ICC Form | Model | Rater Design | Agreement vs. Consistency | Typical Use Case |
|---|---|---|---|---|
| ICC(1,1) | One-way random | Different raters per subject (random) | Absolute agreement only | Community judges, each rating a different subset |
| ICC(1,k) | One-way random | Different raters per subject (random) | Absolute agreement only | Same as above but reporting mean of k raters |
| ICC(2,1) | Two-way random | Same raters for all subjects (random) | Agreement or Consistency | Inter-rater reliability; generalization to new raters intended |
| ICC(2,k) | Two-way random | Same raters for all subjects (random) | Agreement or Consistency | Same; reporting mean across k raters |
| ICC(3,1) | Two-way mixed | Same specific raters for all subjects (fixed) | Consistency primarily | Intrarater reliability; no generalization beyond these raters |
| ICC(3,k) | Two-way mixed | Same specific raters for all subjects (fixed) | Consistency primarily | Same; reporting mean across k fixed raters |
Formulas & Step-by-Step Calculation
ICC Formulas: ANOVA-Based Calculation from First Principles
Calculating the intraclass correlation coefficient by hand requires running a one-way or two-way ANOVA and extracting the Mean Square components. This is not just a mechanical exercise — understanding why the formula uses these particular components reveals what the ICC actually measures and why it behaves the way it does in small samples. Expected values and variance concepts are the mathematical foundation: each Mean Square in the ANOVA table estimates a specific combination of variance components, and the ICC formulas are derived by solving those equations algebraically. Standard deviation calculation by hand uses the same underlying variance logic — ICC simply extends that to a two-level partition.
Step 1: Organize Your Data
Your data matrix should have n rows (subjects or targets) and k columns (raters or measurement occasions). Every cell contains the rating assigned to subject i by rater j. For ICC to be meaningful, the same construct must be measured by all raters using the same scale. Missing data requires careful handling — most ICC functions require complete cases, though some (like the psych package in R) handle missing data by pairwise deletion. Missing data imputation techniques are relevant when data completeness is a concern — imputing before computing ICC can either help or hurt reliability estimates depending on the imputation method and the pattern of missingness.
Step 2: Run the ANOVA and Extract Mean Squares
For a one-way ANOVA (used for ICC(1) forms): partition total sum of squares into SSB (between subjects) and SSW (within subjects). For a two-way ANOVA (used for ICC(2) and ICC(3) forms): partition total SS into SSB (between subjects), SSR (between raters), and SSE (error/interaction). The Mean Squares are:
MSB = SSB / (n − 1)
MSW = SSW / (n(k − 1))
MSR = SSR / (k − 1)
MSE = SSE / ((n − 1)(k − 1))
MSW = SSW / (n(k − 1))
MSR = SSR / (k − 1)
MSE = SSE / ((n − 1)(k − 1))
Where n = number of subjects, k = number of raters, SSB = sum of squares between subjects, SSW = sum of squares within subjects, SSR = sum of squares between raters, SSE = sum of squares error (interaction).
These Mean Squares estimate specific combinations of variance components. MSB estimates σ²ₑ + kσ²ₛ (for two-way) where σ²ₛ is the between-subject variance and σ²ₑ is the error variance. MSE estimates σ²ₑ. MSR estimates σ²ₑ + nσ²ᵣ where σ²ᵣ is the between-rater variance. The ICC formulas are derived by solving for σ²ₛ and then computing the ratio σ²ₛ / (σ²ₛ + σ²ₑ). Simple linear regression uses the same mean square decomposition logic — the F-test in regression is also a ratio of mean squares, just partitioned differently. This conceptual link helps students who already understand regression grasp ANOVA-based ICC derivation more quickly.
Step 3: Apply the ICC Formula for Your Model
Here are the four most commonly used ICC formulas, expressed in terms of ANOVA Mean Squares:
ICC(1,1) = (MSB − MSW) / (MSB + (k−1)·MSW)
ICC(1,1): One-way random, single rater, absolute agreement. MSB and MSW from one-way ANOVA.
ICC(2,1) Agreement = (MSB − MSE) / (MSB + (k−1)·MSE + k·(MSR − MSE)/n)
ICC(2,1) Absolute Agreement: Two-way random, single rater. Penalizes systematic rater differences via MSR. Most stringent and most commonly appropriate for clinical research.
ICC(2,1) Consistency = (MSB − MSE) / (MSB + (k−1)·MSE)
ICC(2,1) Consistency: Two-way random, single rater. Does not penalize systematic rater mean differences. Numerically equals ICC(3,1) consistency.
ICC(k) = ICC(1) × k / (1 + ICC(1)·(k−1))
Spearman-Brown formula for ICC(1,k) and ICC(2,k): The average-measures ICC can be derived from the single-measure ICC using this formula, analogous to the Spearman-Brown prophecy formula in classical test theory.
Step 4: Construct 95% Confidence Intervals
A point estimate ICC without a confidence interval is incomplete and increasingly unacceptable in peer-reviewed journals. A 2025 arXiv review of ICC reporting in neuroimaging journals found that exclusive reliance on point estimates produces unreliable and sometimes misleading conclusions. Confidence intervals for ICC are derived from the F-distribution. For ICC(1,1), the 95% CI is:
F₀ = MSB / MSW
ICC_lower = (F₀/F_upper − 1) / (F₀/F_upper + k − 1)
ICC_upper = (F₀/F_lower − 1) / (F₀/F_lower + k − 1)
ICC_lower = (F₀/F_upper − 1) / (F₀/F_upper + k − 1)
ICC_upper = (F₀/F_lower − 1) / (F₀/F_lower + k − 1)
Where F_upper = F(α/2, n−1, n(k−1)) and F_lower = F(1−α/2, n−1, n(k−1)) from the F-distribution table with degrees of freedom df1 = n−1 and df2 = n(k−1). For two-way ICC forms, the CI formula is more complex and best computed using software (R, SPSS, or MATLAB). Always report the 95% CI alongside the ICC point estimate.
The width of the CI reveals how much information your sample provides. A narrow CI (e.g., 0.78–0.86) gives confident conclusions. A wide CI (e.g., 0.40–0.91) should prompt caution — the true ICC could be anywhere from moderate to excellent, and the sample was too small to distinguish. Confidence intervals as a foundation for decision-making applies directly here: clinical reliability decisions should be based on the lower bound of the CI, not the point estimate alone.
Step 5: A Worked Numerical Example
Consider a study where n = 6 subjects are each rated by k = 3 raters on a pain scale (0–10). The two-way ANOVA yields the following Mean Squares: MSB = 14.28, MSR = 0.48, MSE = 0.72. We want ICC(2,1) absolute agreement.
ICC(2,1) = (MSB − MSE) / (MSB + (k−1)·MSE + k·(MSR − MSE)/n)
= (14.28 − 0.72) / (14.28 + (3−1)·0.72 + 3·(0.48 − 0.72)/6)
= 13.56 / (14.28 + 1.44 + (−0.12))
= 13.56 / 15.60
= 0.869
= (14.28 − 0.72) / (14.28 + (3−1)·0.72 + 3·(0.48 − 0.72)/6)
= 13.56 / (14.28 + 1.44 + (−0.12))
= 13.56 / 15.60
= 0.869
ICC(2,1) = 0.869, indicating good reliability per Koo & Li (2016). The 95% CI should be computed using software (e.g., the irr package in R), which would typically yield approximately [0.63, 0.96] for this small sample — illustrating why the lower bound must be reported.
Why Sample Size Critically Affects ICC Confidence Intervals
With only 6 subjects (as in the example above), even an ICC point estimate of 0.87 comes with an extremely wide confidence interval — potentially from poor to excellent reliability. This is not an artifact of bad data; it is the natural statistical uncertainty from a small sample. Reliability studies need substantially larger samples than many researchers realize. Walter et al. (1998) showed that achieving a 95% CI width of ±0.10 around an expected ICC of 0.70 with two raters requires approximately 30–35 subjects. For three raters, fewer subjects are needed (approximately 20–25). These figures should guide your study design from the outset — not be discovered after data collection. Power analysis principles apply analogously: underpowered reliability studies are as problematic as underpowered hypothesis tests, producing estimates so uncertain they cannot support clinical or practical decisions.
Need Help Calculating ICC for Your Assignment or Research?
Our statistics experts guide you through model selection, ANOVA-based calculation, software implementation, and proper reporting — step by step, tailored to your course or research requirements.
Get Expert Help Now Log InInterpreting ICC Values
Interpreting the ICC: Koo and Li (2016) Benchmarks and What They Actually Mean
Once you have calculated the intraclass correlation coefficient, interpretation requires more than looking up which benchmark bin it falls into. The numerical value means different things depending on the ICC form you used, the homogeneity of your subject sample, and whether you are reporting a point estimate or a confidence interval. This section gives you the full interpretive framework. Type I and Type II error concepts from hypothesis testing apply here too: interpreting an ICC point estimate without its confidence interval risks Type II error on reliability conclusions — you might falsely conclude an instrument is reliable when the CI extends well below the acceptable threshold.
The Koo and Li (2016) Benchmarks
Terry K. Koo and Mae Y. Li (2016), writing in the Journal of Chiropractic Medicine, provided the most widely cited interpretation framework for ICC values. Their paper was a direct response to inconsistency in the literature — different studies applied different thresholds, making cross-study comparison difficult. The Koo and Li guidelines apply to both the ICC point estimate and, critically, the lower bound of the 95% confidence interval:
Koo & Li (2016) ICC Interpretation Benchmarks (lower bound of 95% CI):
- ICC < 0.50: Poor reliability
- ICC 0.50–0.75: Moderate reliability
- ICC 0.75–0.90: Good reliability
- ICC > 0.90: Excellent reliability
These thresholds apply specifically to the lower bound of the 95% CI, not just the point estimate. A point estimate of 0.85 (good reliability) with a lower CI bound of 0.48 (poor reliability) should be classified as poor-to-good — the data are insufficient to conclude good reliability.
Two important caveats apply to these benchmarks. First, they were developed in the context of clinical measurement and physical rehabilitation. Different fields apply different standards — in some psychometric research, ICC values above 0.70 are acceptable, while in precision clinical measurement, 0.90 is a minimum. Always contextualize your ICC within the norms of your specific field. Second, these benchmarks were designed for ICC(2,1) or ICC(3,1) with single measurements. If you are reporting the average-measures ICC (ICC(k) forms), the values will be higher, and the benchmarks apply differently. Be explicit about which form you are reporting and why. Factor analysis in psychometric research often uses different reliability standards (Cronbach’s alpha thresholds) — understanding how ICC benchmarks relate to and differ from those standards matters for interdisciplinary work.
The Subject Homogeneity Problem
One of the most underappreciated properties of ICC is its sensitivity to the variance of the subject sample. ICC = between-subject variance / (between-subject variance + error variance). If your subjects are very similar to each other — a homogeneous sample — between-subject variance is small, and ICC will be low even if the raters agree perfectly on the relative rankings. Conversely, a highly heterogeneous sample inflates ICC, producing falsely optimistic reliability estimates when the instrument is used in a more typical, less extreme population. This means ICC values are not generalizable across samples with different degrees of subject heterogeneity. A pain scale that shows excellent reliability (ICC = 0.92) in a sample of patients ranging from no pain to severe chronic pain may show only moderate reliability (ICC = 0.65) in a sample of mild-pain patients where the true between-subject variance is much smaller. Sampling distributions are the theoretical context for this phenomenon: the sample ICC is a random variable whose distribution depends on both the true ICC and the ratio of between-subject to within-subject variance in the population sampled.
Agreement vs. Consistency: Which ICC to Report?
When using two-way ICC models, you must choose between agreement and consistency, and this choice should be made before looking at the data — it is a design decision, not a fishing expedition for the higher value. The guideline is practical and straightforward:
Report Agreement ICC When:
- Different raters will be used interchangeably in practice
- Absolute score values matter for clinical or practical decisions
- You are establishing that a measurement tool can replace another without systematic adjustment
- Any systematic rater bias would affect interpretation or action
Report Consistency ICC When:
- The same fixed rater(s) will always be used in practice
- Only the relative ordering of subjects matters, not absolute scores
- A systematic offset between raters can be calibrated out
- You are computing intra-rater reliability for the same person across time
In clinical and health research, agreement ICC is almost always the appropriate choice because instruments are typically used by different clinicians in practice, and those clinicians need to produce the same absolute values for clinical decisions to be consistent. Reporting consistency ICC in a clinical reliability study and failing to disclose that choice is a significant methodological error — one that peer reviewers in clinical journals routinely catch. Causal inference frameworks in clinical research require measurement reliability to be established under conditions that mirror actual clinical use — including the potential for rater differences — which is precisely what agreement ICC assesses.
When Is ICC “Good Enough”?
The practical threshold for acceptable reliability depends on the consequences of measurement error in your specific application. In screening applications, where individuals are classified as at-risk or not-at-risk, an ICC of 0.75 (good reliability) may be sufficient because screening decisions are preliminary and followed by more definitive assessment. In precision clinical measurement, where a score directly drives treatment dosage, medication titration, or surgical decisions, nothing below 0.90 (excellent reliability) is acceptable. In research applications, where individual scores contribute to group averages, moderate reliability (0.50–0.75) may be tolerable if it is acknowledged as a limitation. Statistical power is linked to reliability: low ICC in measurements used as predictors in regression reduces statistical power, because measurement error attenuates the observed relationship between the predictor and outcome. This is the attenuation bias problem — a reason to invest in reliable measurement even when group-level analyses are the goal.
⚠️ Reporting ICC Point Estimates Without Confidence Intervals Is No Longer Acceptable: Major clinical journals including Physical Therapy, Journal of Orthopaedic & Sports Physical Therapy, and BMC Medical Research Methodology now require 95% CIs for all ICC reports. A recent systematic review of neuroimaging studies found widespread failure to report CIs alongside ICC, leading to overconfident reliability claims. Koo and Li (2016) explicitly state that reliability categorization should be based on the lower bound of the 95% CI, not the point estimate. If your sample is small (n < 30), your CI will be wide — report it honestly and discuss its implications for how reliability conclusions should be interpreted.
Decision Framework
How to Select the Right ICC Form: A Decision Framework
Choosing between ICC forms is one of the most frequent sources of error in reliability research. Students and even published researchers sometimes choose an ICC form because it produces a higher value, rather than because it matches their study design. That is a validity problem, not just a technical one — it means the ICC does not measure what it claims to measure. The scientific method demands that analysis choices be determined by research design, not by desired results. The following decision tree walks you through the correct selection process, following Koo and Li (2016).
Decision Question 1: Are the Same Raters Used for All Subjects?
No (different raters per subject): Use ICC(1). Only the one-way random effects model is appropriate when subjects are rated by different raters and those rater identities are not tracked. If yes, proceed to Question 2.
Decision Question 2: Are the Raters of Interest Only These Specific Raters, or Do You Want to Generalize to Other Raters?
These specific raters only (fixed effects): Use ICC(3). The two-way mixed model applies. Appropriate for intrarater reliability studies or when the specific instruments are fixed by the study design and no generalization is intended. Generalize to other raters (random effects): Use ICC(2). The two-way random model applies. Most inter-rater reliability studies should use ICC(2). Proceed to Question 3.
Decision Question 3: Does Systematic Rater Bias Matter in Practice?
Yes (absolute values matter, different raters must produce same scores): Use agreement ICC — ICC(2,1) absolute agreement or ICC(3,1) absolute agreement. No (only rank order matters, systematic offset can be calibrated out): Use consistency ICC — ICC(2,1) consistency or ICC(3,1) consistency.
Decision Question 4: Are You Evaluating Individual Rater Measurements or the Mean of k Raters?
Individual rater: Use the (1) form — e.g., ICC(2,1). Mean of k raters: Use the (k) form — e.g., ICC(2,k). The (k) form is appropriate when the final measurement used in practice will always be the average of k raters. The ICC will be higher for (k) forms (predicted by the Spearman-Brown formula), reflecting that averaging reduces random error.
For the vast majority of inter-rater reliability studies in clinical and psychological research: ICC(2,1) absolute agreement is the recommended default. It is the most conservative, most widely generalizable, and most demanding choice — and it reflects the actual clinical scenario where different raters will produce measurements that drive decisions independently. Logistic regression in clinical prediction modeling frequently uses ICC-validated predictors — the reliability of the predictor measurement directly affects model performance and generalizability.
Software Implementation
Calculating ICC in R, SPSS, and Excel: Complete Code Walkthroughs
Understanding the ICC formula is essential. But in practice, researchers and students compute it using statistical software. The three most common platforms in university settings are R, SPSS, and Excel. This section provides complete, annotated code walkthroughs for each, with notes on what each output element means and how to report it correctly. Excel statistical calculations for simpler statistics like mean and mode form the foundation — ICC in Excel requires additional steps or add-ins. For most reliability analyses, R is the most powerful and flexible option. Data science assignments that require ICC analysis nearly always specify R or SPSS as the expected platform.
Calculating ICC in R: The irr and psych Packages
R has two primary packages for ICC: the irr package (which provides the icc() function with explicit model and type arguments) and the psych package (which provides the ICC() function returning all six Shrout-Fleiss forms simultaneously). Both are free and available on CRAN. The psych package is maintained by William Revelle at Northwestern University and is one of the most widely used psychometrics packages in R.
# ── EXAMPLE: ICC using the irr package ──
install.packages(“irr”) # run once to install
library(irr)
# Data: 6 subjects, 3 raters (columns), pain scale 0–10
ratings <- data.frame(
Rater1 = c(6, 7, 4, 8, 5, 3),
Rater2 = c(6, 8, 4, 7, 5, 4),
Rater3 = c(7, 7, 5, 8, 6, 3)
)
# ICC(2,1) absolute agreement — two-way random, single rater, agreement
icc(ratings, model = “twoway”, type = “agreement”, unit = “single”)
# ICC(2,1) consistency
icc(ratings, model = “twoway”, type = “consistency”, unit = “single”)
# ICC(1,1) — one-way random, single rater
icc(ratings, model = “oneway”, type = “agreement”, unit = “single”)
install.packages(“irr”) # run once to install
library(irr)
# Data: 6 subjects, 3 raters (columns), pain scale 0–10
ratings <- data.frame(
Rater1 = c(6, 7, 4, 8, 5, 3),
Rater2 = c(6, 8, 4, 7, 5, 4),
Rater3 = c(7, 7, 5, 8, 6, 3)
)
# ICC(2,1) absolute agreement — two-way random, single rater, agreement
icc(ratings, model = “twoway”, type = “agreement”, unit = “single”)
# ICC(2,1) consistency
icc(ratings, model = “twoway”, type = “consistency”, unit = “single”)
# ICC(1,1) — one-way random, single rater
icc(ratings, model = “oneway”, type = “agreement”, unit = “single”)
The output from icc() returns the ICC estimate, the F-statistic, degrees of freedom, p-value (testing H₀: ICC = 0), and the 95% confidence interval. Report all of these. The p-value tells you whether ICC is significantly different from zero — but a significant p-value with a small ICC simply means your sample is large enough to detect a weak non-zero effect. The ICC value and CI are what determine practical reliability. T-test concepts apply here — just as a statistically significant t-test doesn’t tell you whether the effect is meaningful, a significant ICC F-test doesn’t guarantee practical reliability.
# ── EXAMPLE: ICC using the psych package — returns all 6 forms ──
install.packages(“psych”)
library(psych)
# Returns ICC1, ICC2, ICC3, ICC1k, ICC2k, ICC3k simultaneously
result <- ICC(ratings)
print(result)
# Access specific estimates
result$results # data frame with all 6 ICC forms, CI, F, p
install.packages(“psych”)
library(psych)
# Returns ICC1, ICC2, ICC3, ICC1k, ICC2k, ICC3k simultaneously
result <- ICC(ratings)
print(result)
# Access specific estimates
result$results # data frame with all 6 ICC forms, CI, F, p
When using the psych package output, identify your required ICC form from the results data frame. The columns are: type (ICC1, ICC2, ICC3, ICC1k, ICC2k, ICC3k), ICC (point estimate), F, df1, df2, p, lower.bound, upper.bound. Report type, ICC, lower.bound, and upper.bound at minimum. Running statistical tests in SPSS follows similar output-reading logic — identifying which row and column contains your statistic of interest in a structured output table.
Calculating ICC in SPSS
IBM SPSS computes ICC through the Reliability Analysis procedure. Navigate to Analyze → Scale → Reliability Analysis. Move your rater columns into the Items box. Click Statistics and check the “Intraclass Correlation Coefficient” checkbox. In the ICC dialog: for Model, choose “Two-Way Mixed” for ICC(3) or “Two-Way Random” for ICC(2); for Type, choose “Absolute Agreement” or “Consistency”; for Confidence Interval, enter 95%. Click OK. SPSS returns the ICC value, 95% CI, F-test, and degrees of freedom in the output viewer.
One important SPSS-specific note: SPSS calls the two-way random model “Two-Way Random” and the two-way mixed model “Two-Way Mixed,” consistent with the Shrout-Fleiss taxonomy. The labels for “Absolute Agreement” and “Consistency” directly correspond to the McGraw-Wong (1996) extension. However, SPSS does not directly offer the one-way random model (ICC(1)) — for that model, you must calculate it manually from the one-way ANOVA output or use R. Excel assignment help for statistical procedures covers the limitations of spreadsheet-based reliability analysis — for ICC, Excel lacks built-in functionality and requires either manual calculation from ANOVA outputs or an add-in like the Real Statistics Resource Pack developed by Charles Zaiontz.
Calculating ICC in Excel (Manual ANOVA Approach)
Excel does not have a built-in ICC function, but you can calculate it by running a two-factor ANOVA (without replication) via the Data Analysis ToolPak and manually applying the ICC formula. Navigate to Data → Data Analysis → Anova: Two-Factor Without Replication. Input your data matrix (subjects as rows, raters as columns). The output ANOVA table gives you SSB (Row SS), SSR (Column SS), and SSE (Error SS), from which you can compute MSB, MSR, and MSE, and then apply the ICC(2,1) formula directly. For ICC(1,1), run Anova: Single Factor instead and extract MSB and MSW.
# ── ICC formula in Excel from two-way ANOVA output ──
# Assuming ANOVA output gives: MSB = B2, MSR = B3, MSE = B4
# n = number of subjects (B6), k = number of raters (B7)
# ICC(2,1) Absolute Agreement formula in Excel cell:
=(B2-B4)/(B2+(B7-1)*B4+B7*(B3-B4)/B6)
# ICC(2,1) Consistency formula in Excel cell:
=(B2-B4)/(B2+(B7-1)*B4)
# Assuming ANOVA output gives: MSB = B2, MSR = B3, MSE = B4
# n = number of subjects (B6), k = number of raters (B7)
# ICC(2,1) Absolute Agreement formula in Excel cell:
=(B2-B4)/(B2+(B7-1)*B4+B7*(B3-B4)/B6)
# ICC(2,1) Consistency formula in Excel cell:
=(B2-B4)/(B2+(B7-1)*B4)
This manual approach is fully transparent and is an excellent way to demonstrate your understanding of the ICC formula in an assignment. For confidence intervals in Excel, you would need to use the F.INV function to find the critical F-values and then apply the CI formula manually. Performing one-way ANOVA in Excel provides the procedural foundation — ICC builds directly on that ANOVA output. Top websites for statistical datasets are useful for finding practice datasets for ICC calculation assignments when your instructor does not provide one.
Statistics Assignment Due? Our Experts Handle ICC, ANOVA, and More.
From selecting the right ICC model to full SPSS/R implementation and proper write-up — our statistics specialists deliver precise, well-structured solutions on deadline.
Start Your Order Log InResearchers, Tools & Institutions
Key Figures, Institutions, and Tools in ICC Research
Academic assignments on reliability analysis earn higher marks when they demonstrate command of the field’s intellectual history. The following entities are the ones that shaped ICC methodology and continue to define the standards for its application in research today.
Patrick E. Shrout and Joseph L. Fleiss — Columbia University and New York University
Patrick E. Shrout is a professor of psychology at New York University who, with Joseph L. Fleiss (then at Columbia University), published “Intraclass Correlations: Uses in Assessing Rater Reliability” in Psychological Bulletin in 1979. This paper defined the six canonical ICC forms, provided ANOVA-based formulas, and established confidence interval procedures. It remains one of the most cited papers in clinical measurement methodology — with over 30,000 citations as of 2026. What makes this paper uniquely significant is that it brought ANOVA-based variance component thinking into mainstream reliability analysis at a time when Pearson r was still widely (and inappropriately) used for inter-rater reliability. The Shrout-Fleiss (1979) framework is the mandatory citation for any paper or assignment reporting an ICC. Writing an exemplary literature review for a reliability study requires situating the ICC choice within the Shrout-Fleiss taxonomy explicitly.
Terry K. Koo and Mae Y. Li — New York Medical College
Terry K. Koo and Mae Y. Li, researchers at New York Medical College (in Valhalla, New York), published “A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research” in the Journal of Chiropractic Medicine in 2016. This paper addressed a growing problem: decades of ICC use had produced inconsistent interpretation standards, and researchers applied different thresholds without justification. Koo and Li synthesized the literature and provided practical decision rules for model selection and benchmarks for interpretation — the poor/moderate/good/excellent framework that has since become the standard. The paper also explicitly recommended reporting the 95% CI lower bound as the reliability classification criterion, a recommendation that significantly raised the bar for reliability claims in small studies. This is the second mandatory citation for any ICC analysis. Mastering academic research writing includes knowing which papers anchor the methodological claims in your field — in reliability research, Shrout & Fleiss (1979) and Koo & Li (2016) are those papers.
Kenneth O. McGraw and S.P. Wong — Texas A&M University
Kenneth O. McGraw and S.P. Wong, then at Texas A&M University, extended the Shrout-Fleiss framework to ten ICC forms in their 1996 Psychological Methods paper “Forming Inferences About Some Intraclass Correlation Coefficients.” Their critical contribution was the formal introduction of the agreement-versus-consistency distinction within two-way models — a conceptual clarification that Shrout and Fleiss had alluded to but not formally operationalized. McGraw and Wong’s paper is the citation for the agreement/consistency distinction in any ICC report. It also provides more general F-based confidence interval formulas applicable to a wider range of ICC forms. Measurement design decisions — whether to treat raters as random or fixed, and whether to assess agreement or consistency — are the practical legacy of McGraw and Wong’s framework.
The irr and psych Packages in R
The irr package for R, developed by Matthias Gamer and colleagues, provides the icc() function implementing Shrout-Fleiss ICC forms with all combination of model (oneway/twoway), type (agreement/consistency), and unit (single/average). The psych package, developed by William Revelle at Northwestern University, provides the ICC() function returning all six Shrout-Fleiss forms simultaneously along with confidence intervals, making it ideal for exploratory reliability analysis or when you are unsure which form is most appropriate and want to compare them. Both packages are freely available on CRAN and are the de facto standard for ICC analysis in academic research in the United States and internationally. Computer science assignment help frequently intersects with statistical programming — R-based ICC analysis for health or behavioral science assignments often requires both statistical knowledge and programming fluency. Statistics assignment support from experts who know both the methodology and the software is essential for complex ICC assignments.
IBM SPSS Statistics
IBM SPSS Statistics is the dominant statistical platform in many clinical, health science, and social science programs in the United States and UK. Its Reliability Analysis procedure implements ICC(2) and ICC(3) with both agreement and consistency options via a graphical user interface — no coding required. This makes it the most accessible ICC platform for students without programming backgrounds. The SPSS reliability output also provides Cronbach’s alpha, which quantifies internal consistency reliability (a related but distinct construct from inter-rater reliability). Understanding when to report alpha versus ICC is a common assignment challenge: alpha applies to multi-item scales measuring the same construct; ICC applies to repeated measurements or multiple raters of the same subject. Running statistical tests in SPSS provides the procedural familiarity that transfers directly to ICC computation in the Reliability Analysis module.
Real-World Applications
Where ICC Is Used: Applications Across Research Fields
The intraclass correlation coefficient appears in reliability studies across virtually every empirical discipline. Understanding its domain-specific applications — and how the choice of ICC form varies by context — is what transforms rote formula knowledge into genuine research competence. Psychology research assignments in U.S. universities routinely require ICC analysis for measurement validation, questionnaire development, and observational coding studies.
Clinical and Health Research: Measurement Reliability
In clinical research, ICC validates the reliability of physical measurements — range of motion assessments, blood pressure readings, pain scale ratings, radiological measurements, biomarker assays — before those measurements are used as outcomes in trials or diagnostic criteria. The standard practice is to run a dedicated reliability substudy: a sample of patients (ideally n ≥ 30, from the same clinical population as the main study) is assessed by two or more raters using the measurement tool, and ICC(2,1) absolute agreement is computed with its 95% CI. Survival analysis and other outcome analyses in clinical trials depend on reliable measurement of their endpoints — low ICC in the outcome measure attenuates the detectable treatment effect and inflates required sample size. The GRAPPA (Group for Research and Assessment of Psoriasis and Psoriatic Arthritis) and OARSI (Osteoarthritis Research Society International) both publish reliability requirements for outcome measures, typically requiring ICC ≥ 0.85 with CI lower bound ≥ 0.70 for clinical trial endpoints.
Psychology and Education: Inter-Rater Reliability for Observational Coding
In developmental psychology, educational research, and organizational behavior, researchers frequently use trained observers to code complex behaviors (e.g., classroom interaction quality, attachment behavior, leadership style) from video recordings. ICC — specifically ICC(2,1) or ICC(3,1) depending on whether the coders are fixed or treated as a random sample — is the appropriate reliability index when the coded variable is continuous or ordinal. Central limit theorem principles apply to the distribution of ICC estimates across large numbers of coded sessions — with enough coded segments, the ICC estimate becomes stable and its sampling uncertainty decreases. Cohen’s kappa is more appropriate for nominal coding (e.g., classifying a behavior as present/absent), while ICC handles ordered or continuous codes. Chi-square tests are the hypothesis-testing analog for nominal categorical agreement — the categorical counterpart to the F-test used in ICC. In educational assessment, ICC is used to validate essay scoring rubrics, portfolio assessments, and performance-based evaluations where human judges assign numerical scores.
Multilevel Modeling: The ICC as Proportion of Variance Explained
In multilevel modeling (hierarchical linear modeling), ICC has a second, conceptually related meaning: it quantifies the proportion of total variance in the outcome that is attributable to the higher-level grouping variable (school, clinic, country), rather than to individual-level differences. This is sometimes called the variance partition coefficient (VPC). An ICC of 0.15 in a school-effects model means 15% of the variance in student test scores is between schools — a meaningful amount that justifies multilevel modeling rather than standard OLS regression. Multiple linear regression ignores this clustering, producing underestimated standard errors and inflated Type I error rates when the ICC is non-negligible. Generalized linear models for clustered binary outcomes (e.g., educational attainment, treatment success in multi-site clinical trials) require awareness of the design ICC to correctly compute required sample sizes and interpret fixed effects. The performance package in R provides the icc() function specifically for extracting multilevel ICC from mixed effects models fitted with lme4 or brms.
Test-Retest Reliability in Questionnaire Development
When developing a psychological questionnaire or clinical scale, test-retest reliability assesses whether the instrument produces consistent scores when administered to the same subjects at two different time points (typically 1–4 weeks apart, with no expectation of genuine change). ICC is the preferred statistic for continuous or ordinal scales; Pearson r or Spearman rho are sometimes misused here, again with the same problem of insensitivity to systematic change over time. ICC(2,1) with the two measurement occasions treated as “raters” is the correct approach when the test administrator is the same both times and you want to detect temporal instability in the instrument. Paired t-tests are related — you can simultaneously compute ICC for test-retest reliability and a paired t-test to check for systematic mean change between occasions. A significant paired t-test with a high ICC indicates that mean scores changed systematically (perhaps due to learning or practice effects) even though relative rankings were stable. Bayesian inference provides an alternative framework for ICC estimation — posterior credible intervals for ICC are available through the performance package in R with brms models, and they behave more interpretably than frequentist CIs in small samples.
Writing About ICC
How to Write About ICC in Assignments, Dissertations, and Research Papers
Writing about the intraclass correlation coefficient in a university assignment or research paper is where conceptual understanding and methodological precision converge. The difference between a student who reports “ICC = 0.82, indicating good reliability” and one who reports “ICC(2,1) absolute agreement = 0.82 (95% CI: 0.67–0.91), indicating good reliability per Koo and Li (2016), based on a two-way random effects model with 34 subjects rated by two trained clinicians” is the difference between surface compliance and genuine methodological literacy. Writing a precise thesis statement for a reliability paper might read: “This study demonstrates that the [Instrument Name] achieves good inter-rater reliability (ICC(2,1) ≥ 0.75, lower bound of 95% CI ≥ 0.60) when administered by trained clinicians in primary care settings.”
The Complete ICC Results Statement
Every ICC result reported in an assignment or paper should contain six elements: the ICC form (e.g., ICC(2,1) absolute agreement), the point estimate, the confidence interval (lower and upper bounds), the sample size (n subjects, k raters), the reliability classification per Koo & Li (2016) or another explicitly stated benchmark, and the model justification. An example complete statement:
“Inter-rater reliability was assessed using the intraclass correlation coefficient with a two-way random effects model and absolute agreement definition [ICC(2,1)], consistent with Shrout and Fleiss (1979) and Koo and Li (2016). Two trained raters independently scored [instrument] for all 40 participants. ICC(2,1) was 0.87 (95% CI: 0.78–0.93), indicating good to excellent reliability. Based on the lower bound of the 95% CI (0.78), reliability was classified as good.”
Notice what this statement accomplishes: it cites the ICC form with its authors; it explains why this form was chosen (two-way random effects, absolute agreement); it gives the complete numerical result including CI; it interprets using both the point estimate and the lower CI bound. This level of completeness is what peer reviewers in clinical, psychological, and educational journals expect — and what statistics examiners at universities in the United States and UK reward. Effective proofreading of statistics assignments should specifically check that every numerical result has a method label, a CI (where applicable), and an interpretation statement. Argumentative writing skills apply to methods justification — you must make the case for your ICC model choice, not simply assert it.
Justifying Your ICC Model Choice in Writing
In a methods section, the ICC model justification follows a predictable but essential structure: describe the study design (same/different raters, number of raters, whether raters are fixed or random), state the ICC form selected as a consequence, cite Shrout and Fleiss (1979) and Koo and Li (2016), and state whether agreement or consistency was the criterion. Do not present the ICC selection as an arbitrary choice — every element should be logically derivable from your research design. Academic writing for research papers demands exactly this: claims are grounded in evidence or logical derivation, not asserted. Research techniques for academic essays include finding and citing the appropriate methodological literature — for ICC, that means the Shrout-Fleiss paper and the Koo-Li guideline, supplemented by any field-specific reliability standards from your discipline’s governing bodies.
⚠️ Common ICC Writing Errors to Avoid
The most frequent marks-losing errors in ICC assignments: (1) reporting the ICC without specifying which of the six forms was used, (2) reporting only the point estimate without the 95% confidence interval, (3) failing to justify the choice of agreement versus consistency, (4) not citing Shrout and Fleiss (1979) or Koo and Li (2016) when interpreting ICC, (5) interpreting ICC in isolation without discussing sample size limitations and CI width, (6) using Pearson r or Spearman rho for reliability when ICC is required, and (7) confusing inter-rater reliability (agreement between different raters) with internal consistency reliability (Cronbach’s alpha for multi-item scales measuring the same construct). Addressing all seven demonstrates exactly the methodological precision that distinguishes excellent from adequate work. Common student writing mistakes in methods-heavy assignments often reduce to missing specificity — the antidote is explicit, complete reporting of every methodological parameter.
Key Terms & Concepts
Essential Vocabulary for ICC: LSI and NLP Keywords You Must Know
Mastering the intraclass correlation coefficient at assignment and research level requires precise vocabulary. The following terms are those that appear on rubrics, in professor feedback, and in peer-reviewed reliability literature. Understanding them — not just as definitions but in terms of their relationships and implications — is what separates surface-level familiarity from genuine command of ICC methodology. Descriptive versus inferential statistics is the broadest conceptual context: ICC is inferential — it estimates a population reliability parameter from a sample — and its CI quantifies the inferential uncertainty.
Core Reliability and ICC Vocabulary
Reliability — the consistency or reproducibility of measurements; the degree to which a measurement procedure produces the same result when applied under the same conditions. Inter-rater reliability — the degree of agreement among different raters assessing the same subjects or stimuli. Intrarater reliability — the consistency of ratings made by the same rater across different occasions. Test-retest reliability — the stability of measurements across time in the absence of genuine change. Agreement — whether multiple measurements produce the same absolute values; assessed by absolute-agreement ICC. Consistency — whether multiple measurements rank subjects in the same order; assessed by consistency ICC; does not penalize systematic offsets. Probability distributions underlie the F-distribution-based CI formulas for ICC — students who understand F-distributions will find the CI derivation intuitive rather than arbitrary.
Between-subject variance (σ²ₛ) — the component of total variance reflecting genuine differences between subjects; the “signal” in ICC. Within-subject variance (error variance, σ²ₑ) — the component reflecting measurement error, rater disagreement, or temporal instability; the “noise.” Rater variance (σ²ᵣ) — the component of variance attributable to systematic differences between raters; present only in two-way models. Subject-by-rater interaction — variance from inconsistent patterns of rater behavior across subjects (e.g., Rater 1 gives higher scores for severe cases but Rater 2 gives higher scores for mild cases); captured by the error term MSE in two-way models. Variance partition coefficient (VPC) — alternative name for ICC used in multilevel modeling contexts; equals the proportion of total variance explained by the clustering (group-level) variable. Probability distributions for ANOVA components underlie the statistical theory of ICC — the between-subjects MS follows a scaled chi-squared distribution under the null hypothesis, producing the F-ratio used for significance testing and CI construction.
Advanced and Related Concepts
ANOVA Mean Squares (MS) — the variance estimates obtained from dividing sums of squares by their degrees of freedom in the ANOVA table; the raw inputs to all ICC formulas. Spearman-Brown prophecy formula — the formula predicting how reliability changes when the number of raters changes; analogous to ICC(k) from ICC(1): ICC(k) = k×ICC(1) / (1 + (k−1)×ICC(1)). Attenuation bias — the downward bias in estimated correlations or regression coefficients caused by measurement error in predictors; equal to the product of true ICC values for the predictor and criterion. Generalizability theory — a framework extending ICC to multiple sources of measurement error simultaneously, developed by Lee Cronbach and colleagues; can be seen as a generalization of the Shrout-Fleiss ICC framework. Intracluster correlation — the ICC in cluster-randomized trial design, where it is used to calculate design effect and required sample size; specifically ICC(2,1) for the outcome variable at baseline. Causal inference in RCTs depends on correctly accounting for clustering, which requires accurate ICC estimation from pilot data or the published literature. Beta distribution concepts are relevant because ICC estimates are bounded between -1 and 1 (practically 0 and 1), making them behave more like correlation coefficients than unbounded statistics — their sampling distributions are often approximated using Fisher’s Z transformation for CI construction in some approaches. Probability density function concepts apply to the theoretical sampling distribution of the ICC estimator, which is needed to derive the F-distribution-based CI approach used in practice. Residual analysis in the context of ICC means examining the ICC ANOVA residuals for violations of normality and homoscedasticity — the same diagnostic tools used in regression. Heteroscedasticity in ICC context refers to rater-specific variance that differs across rating levels (e.g., raters agree well on extreme cases but disagree on borderline ones) — a violation that biases ICC estimates and should be checked with residual plots.
Need Expert Help With Your Reliability Analysis Assignment?
Our statistics experts deliver clear, precise ICC analyses — with proper model selection, ANOVA computation, R or SPSS output, and full APA-style results reporting — tailored to your exact course requirements.
Order Now Log InFrequently Asked Questions
Frequently Asked Questions: Intraclass Correlation Coefficient
What is the intraclass correlation coefficient and what does it measure?
The intraclass correlation coefficient (ICC) is a reliability statistic that measures the degree of agreement or consistency among multiple measurements of the same subjects. It partitions total score variance into two components: variance attributable to genuine differences between subjects (the signal), and variance attributable to measurement error, rater disagreement, or temporal inconsistency (the noise). The ICC is the ratio of between-subject variance to total variance. Values range from 0 (no reliability — error variance dominates) to 1 (perfect reliability — all variance reflects true subject differences). ICC is the gold standard for inter-rater reliability, test-retest reliability, and intrarater reliability in clinical, psychological, and educational research, replacing the previously misused Pearson r for this purpose.
What is the difference between ICC(2,1) and ICC(3,1)?
Both ICC(2,1) and ICC(3,1) require the same raters to evaluate all subjects, but they differ in how those raters are treated statistically. ICC(2,1) uses a two-way random effects model — it treats raters as a random sample from a larger population of possible raters, and the reliability conclusion is meant to generalize to other raters from that population. ICC(3,1) uses a two-way mixed effects model — it treats those specific raters as fixed effects, meaning the reliability conclusion applies only to those specific raters or instruments, with no generalization intended. In practice: if you want to show that your instrument is reliable when used by any trained clinician, use ICC(2,1). If you are only establishing reliability for the specific raters in your study, use ICC(3,1). The consistency-form ICC(2,1) and ICC(3,1) are numerically equal; they only differ for the agreement form.
What is a good ICC value and how do I interpret it?
According to Koo and Li (2016) in the Journal of Chiropractic Medicine — the most widely cited interpretation guideline — ICC values below 0.50 indicate poor reliability, 0.50–0.75 indicate moderate reliability, 0.75–0.90 indicate good reliability, and values above 0.90 indicate excellent reliability. Critically, Koo and Li recommend applying these benchmarks to the lower bound of the 95% confidence interval, not just the point estimate. A point estimate of 0.82 (good reliability) with a lower CI bound of 0.48 (poor reliability) should be classified as poor-to-good — the sample was too small to conclude good reliability with confidence. The appropriate reliability threshold also depends on the application: clinical measurement for individual diagnosis typically requires ICC ≥ 0.90, while research-level measurement may accept ICC ≥ 0.70.
How do I calculate ICC in R?
In R, use the icc() function from the irr package or the ICC() function from the psych package. First, organize your data with subjects as rows and raters as columns. For the irr package: install.packages(“irr”); library(irr); icc(data, model=”twoway”, type=”agreement”, unit=”single”) returns ICC(2,1) absolute agreement with its 95% CI. Change model to “oneway” for ICC(1), type to “consistency” for consistency ICC, and unit to “average” for the k-rater average form. For the psych package: library(psych); ICC(data) returns all six Shrout-Fleiss forms simultaneously, with ICC estimates, F-statistics, degrees of freedom, p-values, and confidence intervals in a single output table. Both packages are freely available on CRAN and provide all information needed for a complete reliability report.
How do I calculate ICC in SPSS?
In SPSS, navigate to Analyze → Scale → Reliability Analysis. Move your rater variables into the Items box. Click Statistics and check “Intraclass correlation coefficient.” In the ICC dialog, select Model: choose “Two-Way Random” for ICC(2) (when raters are a random sample and you want to generalize) or “Two-Way Mixed” for ICC(3) (when specific raters are fixed and no generalization is intended). For Type, select “Absolute Agreement” for agreement ICC or “Consistency” for consistency ICC. Set confidence interval level to 95%. Click Continue, then OK. The output table reports the single-measures ICC (equivalent to the individual-rater form), the average-measures ICC (equivalent to the k-rater average form), the F-statistic, degrees of freedom, and 95% CI. SPSS does not directly compute ICC(1) — for the one-way random model, use R or compute manually from a one-way ANOVA table.
What is the difference between ICC and Cronbach’s alpha?
ICC and Cronbach’s alpha both quantify reliability, but they address different types of reliability for different data structures. Cronbach’s alpha measures internal consistency — the extent to which multiple items on a scale all measure the same underlying construct. It answers: “Do all items on this questionnaire hang together?” Alpha is computed from the inter-item correlation matrix and increases with the number of items. ICC measures agreement or consistency among raters or measurement occasions for the same variable. It answers: “Do different raters produce consistent measurements?” ICC is based on ANOVA variance components. The key distinction: alpha applies when you have multiple different items that are supposed to measure the same construct (e.g., a 10-item anxiety scale); ICC applies when you have multiple measurements of the same variable from different raters or occasions (e.g., three physiotherapists each measuring the same patient’s range of motion). Using alpha when ICC is required — or vice versa — is a methodological error.
How many subjects do I need for a reliable ICC estimate?
Sample size for ICC reliability studies is determined by the desired confidence interval width, not just statistical power. To achieve a 95% CI width of ±0.10 around an expected ICC of 0.70 with two raters, approximately 30–35 subjects are needed — a result derived from Walter et al. (1998) and confirmed by subsequent simulation studies. For a tighter CI of ±0.05, approximately 100 subjects are required. Adding more raters reduces the required number of subjects: with three raters, the same CI width of ±0.10 requires approximately 20–25 subjects. Formal sample size calculation for ICC studies should use dedicated software such as PASS (NCSS) or the R package iccpower. The key message: most published reliability studies with fewer than 30 subjects have CIs so wide that reliability classifications are highly uncertain — a limitation that should be explicitly acknowledged in the write-up.
What does it mean when ICC is very low even though raters seem to agree?
A low ICC despite apparent rater agreement is almost always caused by a homogeneous subject sample. Remember: ICC = between-subject variance / total variance. If all your subjects have similar true scores (low between-subject variance), the ICC will be low even if rater error is small, because the denominator (total variance) is approximately equal to the error variance alone. This is the subject-homogeneity problem. For example, if you assess grip strength reliability in a sample of professional arm wrestlers who all have near-maximum grip, individual rater error will dominate a very small true variance, producing a low ICC — even though the raters are highly consistent in absolute terms. The solution is either to use a more heterogeneous subject sample representative of the population where the instrument will be used, or to report a different reliability index (like the Standard Error of Measurement, SEM) that does not depend on sample variance. Always describe your subject sample’s characteristics when reporting ICC so readers can judge its generalizability.
Can I use ICC for binary or categorical ratings?
ICC is theoretically applicable to any rating scale, including binary (0/1) ratings, but it is generally not recommended for nominal categorical data. For nominal categories (including binary ratings), Cohen’s kappa (for two raters) or Fleiss’s kappa (for more than two raters) is the preferred reliability index because it correctly handles the categorical structure and corrects for chance agreement. ICC applied to binary data produces mathematically valid results but loses the interpretive clarity of the standard benchmarks, which were developed for continuous and ordinal data. For ordinal ratings with multiple categories (e.g., a 5-point Likert scale), ICC is appropriate — the Spearman correlation between raters is sometimes used as an alternative for ordinal data, though ICC is generally preferred because it accounts for the specific variance structure of reliability study designs. The safest rule: use ICC for continuous or ordinal scales, Cohen’s/Fleiss’s kappa for nominal categories.
What is the Standard Error of Measurement (SEM) and how does it relate to ICC?
The Standard Error of Measurement (SEM) is a reliability index derived from ICC that expresses measurement error in the original units of the measurement scale, rather than as a dimensionless proportion. SEM = SD_total × √(1 − ICC), where SD_total is the standard deviation of scores across all subjects and measurement occasions. While ICC is a relative reliability index (it depends on between-subject variance and can be inflated by subject heterogeneity), SEM is an absolute index — it answers “how much error, in scale units, would I expect if I measured this subject again?” A SEM of 2.5 mmHg for a blood pressure instrument means that the true value likely falls within ±2.5 mmHg of the measured value about 68% of the time (one SEM) or within ±5.0 mmHg about 95% of the time (two SEM). For clinical decision-making, SEM is often more interpretable and actionable than ICC because it describes error in clinically meaningful units — particularly relevant when ICC is affected by subject homogeneity.
