How to Calculate Intraclass Correlation Coefficient (ICC)
Statistics & Reliability Analysis
How to Calculate Intraclass Correlation Coefficient (ICC)
All six ICC forms, ANOVA-based formulas, R & SPSS walkthroughs, and Koo & Li (2016) interpretation benchmarks — everything you need for assignments, dissertations, and research reports.
What Is ICC & Why It Matters
How to Calculate Intraclass Correlation Coefficient (ICC)
Intraclass Correlation Coefficient (ICC) answers one of the most important questions in empirical research: when two or more raters, instruments, or measurement occasions produce values for the same subject, how much can you trust those measurements? ICC is not a simple correlation. It is a reliability statistic built from variance components — and choosing the wrong form, or calculating it without understanding what it measures, is one of the fastest ways to undermine a study’s credibility.
The ICC has been the standard reliability metric in clinical, psychological, educational, and social research for decades. Its defining feature is that it partitions total score variance into two components: variance that reflects genuine differences between subjects, and variance that reflects measurement error (disagreement between raters, instruments, or occasions). The proportion of total variance explained by between-subject differences is the ICC. A high ICC means the measurements successfully discriminate between subjects — meaning the instrument is reliable. A low ICC means the error variance dominates — meaning you cannot trust that the measurement reflects anything real about the subject.
6
Forms of ICC defined by Shrout & Fleiss (1979) in the landmark Psychological Bulletin paper
0.90
Koo & Li (2016) threshold above which ICC indicates excellent reliability for clinical measurement
10
Total ICC forms identified by McGraw & Wong (1996) in Psychological Methods
What Exactly Is the Intraclass Correlation Coefficient?
The intraclass correlation coefficient is a reliability index that quantifies the degree of agreement or consistency among multiple measurements of the same subjects — where those measurements come from raters of the same “class” (meaning there is no logical way to distinguish them, unlike distinguishing a test from a retest, or variable X from variable Y). The word “intraclass” distinguishes it from the Pearson product-moment correlation, which measures agreement between two logically distinct variables.
ICC operates by modeling total score variance using ANOVA. Scores vary because subjects genuinely differ from each other (this is the signal we want), and because raters or measurement occasions produce inconsistent readings (this is the error we want to minimize). The ICC is, in its most basic form, the ratio of between-subject variance to total variance. This simple idea has remarkable consequences: it means an ICC computed in a very homogeneous subject sample will be artificially low not because the instrument is unreliable, but because there is little true between-subject variance to partition.
Why ICC Replaced Pearson Correlation for Reliability
Before ICC became standard, researchers frequently misused Pearson’s r to assess inter-rater reliability. The problem is that Pearson r is insensitive to systematic rater differences. If one radiologist consistently rates tumor size 20% higher than a colleague, their readings will have a Pearson r of 1.0 — perfect linear association — yet they fundamentally disagree. Agreement-form ICC correctly penalizes this systematic offset, making it far more appropriate whenever you need to establish that different raters or instruments can be used interchangeably.
The landmark paper establishing the ICC framework was Shrout and Fleiss (1979), published in Psychological Bulletin, which defined the six canonical forms of ICC and provided formulas based on one-way and two-way ANOVA. The most widely cited interpretation guide is Koo and Li (2016) in the Journal of Chiropractic Medicine, which provides the benchmarks (poor, moderate, good, excellent) used in virtually every reliability paper published today.
The core conceptual point about ICC: The ICC measures how much of the total variance in scores is attributable to true differences between subjects rather than to measurement error. A high ICC (close to 1.0) means your instrument reliably distinguishes subjects from each other. A low ICC (close to 0) means measurement error dominates — the scores are mostly noise.
The Six Forms of ICC
The Six ICC Forms: Shrout and Fleiss (1979) Explained
The most consequential decision in any ICC analysis is choosing the correct form. Shrout and Fleiss (1979) defined six forms of the intraclass correlation coefficient organized by two dimensions: the statistical model (one-way random, two-way random, or two-way mixed) and the unit of measurement (single rater vs. mean of k raters). Choosing incorrectly — say, using a two-way formula when a one-way design was used — produces a biased and uninterpretable result.
An additional dimension added by McGraw and Wong (1996) in Psychological Methods — who extended the taxonomy to ten forms — is the choice between agreement and consistency within each two-way model. This is functionally the most important distinction for researchers: agreement ICC asks “do the raters produce the same absolute values?”, while consistency ICC asks “do the raters rank subjects the same way?”
Model 1: One-Way Random Effects (ICC(1,1) and ICC(1,k))
The one-way random effects model applies when each subject is rated by a different set of raters randomly drawn from a larger population, and when those rater identities are not tracked. There is no way to account for rater-specific effects because different subjects were rated by different people. This is the most conservative ICC form — it produces the lowest ICC estimates — because it cannot separate rater-to-rater variability from other sources of within-subject error.
Model 2: Two-Way Random Effects (ICC(2,1) and ICC(2,k))
The two-way random effects model applies when the same set of raters rates all subjects AND both subjects and raters are considered random samples from larger populations. This is the most commonly appropriate model for inter-rater reliability studies in psychology, education, and clinical research. ICC(2,1) can be calculated in both agreement and consistency forms: the agreement form penalizes systematic mean differences between raters; the consistency form does not.
Model 3: Two-Way Mixed Effects (ICC(3,1) and ICC(3,k))
The two-way mixed effects model requires the same raters to rate all subjects but treats those specific raters as fixed effects — the only raters of interest — rather than as a random sample. This model is appropriate when the same two or three specific instruments are used throughout a study and you do not intend to generalize to other instruments. The correct rule: if you plan to generalize your reliability conclusions to other raters, use ICC(2). If limited to these specific raters, use ICC(3).
| ICC Form | Model | Rater Design | Agreement vs. Consistency | Typical Use Case |
|---|---|---|---|---|
| ICC(1,1) | One-way random | Different raters per subject (random) | Absolute agreement only | Community judges, each rating a different subset |
| ICC(1,k) | One-way random | Different raters per subject (random) | Absolute agreement only | Same as above but reporting mean of k raters |
| ICC(2,1) | Two-way random | Same raters for all subjects (random) | Agreement or Consistency | Inter-rater reliability; generalization to new raters intended |
| ICC(2,k) | Two-way random | Same raters for all subjects (random) | Agreement or Consistency | Same; reporting mean across k raters |
| ICC(3,1) | Two-way mixed | Same specific raters for all subjects (fixed) | Consistency primarily | Intrarater reliability; no generalization beyond these raters |
| ICC(3,k) | Two-way mixed | Same specific raters for all subjects (fixed) | Consistency primarily | Same; reporting mean across k fixed raters |
Formulas & Step-by-Step Calculation
ICC Formulas: ANOVA-Based Calculation from First Principles
Calculating the intraclass correlation coefficient by hand requires running a one-way or two-way ANOVA and extracting the Mean Square components. Understanding why the formula uses these particular components reveals what the ICC actually measures and why it behaves the way it does in small samples.
Step 1: Organize Your Data
Your data matrix should have n rows (subjects or targets) and k columns (raters or measurement occasions). Every cell contains the rating assigned to subject i by rater j. For ICC to be meaningful, the same construct must be measured by all raters using the same scale.
Step 2: Run the ANOVA and Extract Mean Squares
MSB = SSB / (n − 1)
MSW = SSW / (n(k − 1))
MSR = SSR / (k − 1)
MSE = SSE / ((n − 1)(k − 1))
MSW = SSW / (n(k − 1))
MSR = SSR / (k − 1)
MSE = SSE / ((n − 1)(k − 1))
Where n = number of subjects, k = number of raters, SSB = sum of squares between subjects, SSW = sum of squares within subjects, SSR = sum of squares between raters, SSE = sum of squares error (interaction).
Step 3: Apply the ICC Formula for Your Model
ICC(1,1) = (MSB − MSW) / (MSB + (k−1)·MSW)
ICC(1,1): One-way random, single rater, absolute agreement.
ICC(2,1) Agreement = (MSB − MSE) / (MSB + (k−1)·MSE + k·(MSR − MSE)/n)
ICC(2,1) Absolute Agreement: Two-way random, single rater. Penalizes systematic rater differences via MSR. Most stringent and most commonly appropriate for clinical research.
ICC(2,1) Consistency = (MSB − MSE) / (MSB + (k−1)·MSE)
ICC(2,1) Consistency: Does not penalize systematic rater mean differences. Numerically equals ICC(3,1) consistency.
ICC(k) = ICC(1) × k / (1 + ICC(1)·(k−1))
Spearman-Brown formula for ICC(1,k) and ICC(2,k): Average-measures ICC derived from single-measure ICC, analogous to the Spearman-Brown prophecy formula in classical test theory.
Step 4: A Worked Numerical Example
Consider a study where n = 6 subjects are each rated by k = 3 raters on a pain scale (0–10). The two-way ANOVA yields: MSB = 14.28, MSR = 0.48, MSE = 0.72. We compute ICC(2,1) absolute agreement:
ICC(2,1) = (14.28 − 0.72) / (14.28 + (3−1)·0.72 + 3·(0.48 − 0.72)/6)
= 13.56 / (14.28 + 1.44 − 0.12) = 13.56 / 15.60 = 0.869
= 13.56 / (14.28 + 1.44 − 0.12) = 13.56 / 15.60 = 0.869
ICC(2,1) = 0.869, indicating good reliability per Koo & Li (2016). The 95% CI computed in R would typically yield approximately [0.63, 0.96] for this small sample — illustrating why the lower bound must be reported.
Why Sample Size Critically Affects ICC Confidence Intervals
With only 6 subjects, even an ICC point estimate of 0.87 comes with an extremely wide confidence interval — potentially from poor to excellent reliability. Walter et al. (1998) showed that achieving a 95% CI width of ±0.10 around an expected ICC of 0.70 with two raters requires approximately 30–35 subjects. For three raters, approximately 20–25 subjects suffice. These figures should guide your study design from the outset — not be discovered after data collection.
Need Help Calculating ICC for Your Assignment or Research?
Our statistics experts guide you through model selection, ANOVA-based calculation, software implementation, and proper reporting — step by step, tailored to your course or research requirements.
Get Expert Help Now Log InInterpreting ICC Values
Interpreting the ICC: Koo and Li (2016) Benchmarks
Once you have calculated the ICC, interpretation requires more than looking up a benchmark bin. The numerical value means different things depending on the ICC form used, the homogeneity of your subject sample, and whether you are reporting a point estimate or a confidence interval.
The Koo and Li (2016) Benchmarks
Koo & Li (2016) ICC Interpretation Benchmarks (lower bound of 95% CI):
- ICC < 0.50: Poor reliability
- ICC 0.50–0.75: Moderate reliability
- ICC 0.75–0.90: Good reliability
- ICC > 0.90: Excellent reliability
These thresholds apply specifically to the lower bound of the 95% CI, not just the point estimate. A point estimate of 0.85 (good reliability) with a lower CI bound of 0.48 (poor reliability) should be classified as poor-to-good.
Agreement vs. Consistency: Which ICC to Report?
Report Agreement ICC When:
- Different raters will be used interchangeably in practice
- Absolute score values matter for clinical or practical decisions
- Establishing that one measurement tool can replace another
- Systematic rater bias would affect interpretation or action
Report Consistency ICC When:
- The same fixed rater(s) will always be used in practice
- Only the relative ordering of subjects matters
- A systematic offset between raters can be calibrated out
- Computing intra-rater reliability across time
⚠️ Reporting ICC Point Estimates Without Confidence Intervals Is No Longer Acceptable: Major clinical journals now require 95% CIs for all ICC reports. Koo and Li (2016) explicitly state that reliability categorization should be based on the lower bound of the 95% CI, not the point estimate. If your sample is small (n < 30), your CI will be wide — report it honestly and discuss its implications.
Decision Framework
How to Select the Right ICC Form: A Decision Framework
Choosing between ICC forms is one of the most frequent sources of error in reliability research. The following decision framework walks you through the correct selection process, following Koo and Li (2016).
Decision Question 1: Are the Same Raters Used for All Subjects?
No → Use ICC(1). Yes → Proceed to Question 2.
Decision Question 2: Generalize to Other Raters, or Only These Specific Raters?
These specific raters only (fixed): Use ICC(3). Generalize to other raters (random): Use ICC(2) — most inter-rater reliability studies.
Decision Question 3: Does Systematic Rater Bias Matter in Practice?
Yes (absolute values drive decisions) → Agreement ICC. No (only rank order matters) → Consistency ICC.
Decision Question 4: Individual Rater or Mean of k Raters?
Individual rater: Use (1) form — e.g., ICC(2,1). Mean of k raters: Use (k) form — e.g., ICC(2,k).
For the vast majority of inter-rater reliability studies in clinical and psychological research: ICC(2,1) absolute agreement is the recommended default. It is the most conservative, most widely generalizable, and most demanding choice — reflecting the actual clinical scenario where different raters produce measurements that drive decisions independently.
Software Implementation
Calculating ICC in R, SPSS, and Excel
Calculating ICC in R: The irr and psych Packages
# ── ICC using the irr package ──
install.packages(“irr”)
library(irr)
ratings <- data.frame(
Rater1 = c(6, 7, 4, 8, 5, 3),
Rater2 = c(6, 8, 4, 7, 5, 4),
Rater3 = c(7, 7, 5, 8, 6, 3)
)
# ICC(2,1) absolute agreement
icc(ratings, model = “twoway”, type = “agreement”, unit = “single”)
# ICC(2,1) consistency
icc(ratings, model = “twoway”, type = “consistency”, unit = “single”)
# ICC(1,1) — one-way random
icc(ratings, model = “oneway”, type = “agreement”, unit = “single”)
install.packages(“irr”)
library(irr)
ratings <- data.frame(
Rater1 = c(6, 7, 4, 8, 5, 3),
Rater2 = c(6, 8, 4, 7, 5, 4),
Rater3 = c(7, 7, 5, 8, 6, 3)
)
# ICC(2,1) absolute agreement
icc(ratings, model = “twoway”, type = “agreement”, unit = “single”)
# ICC(2,1) consistency
icc(ratings, model = “twoway”, type = “consistency”, unit = “single”)
# ICC(1,1) — one-way random
icc(ratings, model = “oneway”, type = “agreement”, unit = “single”)
# ── psych package — returns all 6 ICC forms simultaneously ──
install.packages(“psych”)
library(psych)
result <- ICC(ratings)
result$results # ICC, CI, F, p for all 6 forms
install.packages(“psych”)
library(psych)
result <- ICC(ratings)
result$results # ICC, CI, F, p for all 6 forms
Calculating ICC in SPSS
Navigate to Analyze → Scale → Reliability Analysis. Move rater columns into the Items box. Click Statistics → check “Intraclass Correlation Coefficient.” Select Model (Two-Way Random for ICC(2), Two-Way Mixed for ICC(3)), Type (Absolute Agreement or Consistency), and 95% Confidence Interval. Click OK. SPSS returns the ICC value, 95% CI, F-test, and degrees of freedom. Note: SPSS does not compute ICC(1) — use R for the one-way random model.
Calculating ICC in Excel (Manual ANOVA Approach)
Run Data → Data Analysis → Anova: Two-Factor Without Replication. The output table gives SSB (Row SS), SSR (Column SS), and SSE (Error SS). Compute MSB, MSR, MSE, then apply the formula directly:
# ICC(2,1) Absolute Agreement — Excel cell formula:
=(B2-B4)/(B2+(B7-1)*B4+B7*(B3-B4)/B6)
# ICC(2,1) Consistency — Excel cell formula:
=(B2-B4)/(B2+(B7-1)*B4)
# B2=MSB, B3=MSR, B4=MSE, B6=n subjects, B7=k raters
=(B2-B4)/(B2+(B7-1)*B4+B7*(B3-B4)/B6)
# ICC(2,1) Consistency — Excel cell formula:
=(B2-B4)/(B2+(B7-1)*B4)
# B2=MSB, B3=MSR, B4=MSE, B6=n subjects, B7=k raters
Statistics Assignment Due? Our Experts Handle ICC, ANOVA, and More.
From selecting the right ICC model to full SPSS/R implementation and proper write-up — our statistics specialists deliver precise, well-structured solutions on deadline.
Start Your Order Log InResearchers, Tools & Institutions
Key Figures, Institutions, and Tools in ICC Research
Patrick E. Shrout and Joseph L. Fleiss — Columbia & New York University
Patrick E. Shrout (NYU) and Joseph L. Fleiss (Columbia) published “Intraclass Correlations: Uses in Assessing Rater Reliability” in Psychological Bulletin in 1979. This paper defined the six canonical ICC forms, provided ANOVA-based formulas, and established confidence interval procedures. It remains one of the most cited papers in clinical measurement — with over 30,000 citations as of 2026. The Shrout-Fleiss (1979) framework is the mandatory citation for any paper or assignment reporting an ICC.
Terry K. Koo and Mae Y. Li — New York Medical College
Koo and Li published “A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research” in the Journal of Chiropractic Medicine in 2016. Their paper synthesized the literature, provided practical decision rules for model selection, and established the poor/moderate/good/excellent benchmarks. They explicitly recommended reporting the 95% CI lower bound as the reliability classification criterion — the second mandatory citation for any ICC analysis.
Kenneth O. McGraw and S.P. Wong — Texas A&M University
McGraw and Wong extended the Shrout-Fleiss framework to ten ICC forms in their 1996 Psychological Methods paper, formally introducing the agreement-versus-consistency distinction within two-way models. This is the citation for the agreement/consistency distinction in any ICC report.
The irr and psych Packages in R
The irr package provides icc() with explicit model/type/unit arguments. The psych package, by William Revelle at Northwestern University, provides ICC() returning all six Shrout-Fleiss forms simultaneously with confidence intervals. Both are freely available on CRAN and are the de facto standard for ICC analysis in academic research.
IBM SPSS Statistics
The dominant statistical platform in many clinical and social science programs. Its Reliability Analysis procedure implements ICC(2) and ICC(3) with both agreement and consistency options via a point-and-click interface — the most accessible ICC platform for students without programming backgrounds.
Real-World Applications
Where ICC Is Used: Applications Across Research Fields
Clinical and Health Research: Measurement Reliability
In clinical research, ICC validates the reliability of physical measurements — range of motion, blood pressure, pain scale ratings, radiological measurements — before those measurements are used as outcomes in trials or diagnostic criteria. The standard is a reliability substudy of n ≥ 30 patients assessed by two or more raters, with ICC(2,1) absolute agreement reported alongside its 95% CI. Organizations like GRAPPA and OARSI typically require ICC ≥ 0.85 with CI lower bound ≥ 0.70 for clinical trial endpoints.
Psychology and Education: Observational Coding
In developmental psychology and educational research, trained observers code complex behaviors from video recordings. ICC — specifically ICC(2,1) or ICC(3,1) — is the appropriate reliability index for continuous or ordinal coding. Cohen’s kappa is more appropriate for nominal coding (present/absent), while ICC handles ordered or continuous codes. In educational assessment, ICC validates essay scoring rubrics and performance-based evaluations where human judges assign numerical scores.
Multilevel Modeling: Variance Partition Coefficient
In multilevel modeling, ICC quantifies the proportion of total variance attributable to the higher-level grouping variable (school, clinic, country). An ICC of 0.15 in a school-effects model means 15% of student test score variance is between schools — a meaningful amount justifying multilevel modeling rather than standard OLS regression. The performance package in R provides icc() specifically for extracting multilevel ICC from lme4 or brms models.
Test-Retest Reliability in Questionnaire Development
When developing psychological questionnaires, test-retest reliability assesses whether the instrument produces consistent scores when administered to the same subjects 1–4 weeks apart. ICC(2,1) with the two measurement occasions treated as “raters” is the correct approach. A significant paired t-test alongside a high ICC indicates mean scores changed systematically even though relative rankings were stable — a signal of practice or learning effects.
Writing About ICC
How to Write About ICC in Assignments and Research Papers
The Complete ICC Results Statement
Every ICC result should contain six elements: the ICC form, the point estimate, the confidence interval, the sample size, the reliability classification per Koo & Li (2016), and the model justification. Example:
“Inter-rater reliability was assessed using the intraclass correlation coefficient with a two-way random effects model and absolute agreement definition [ICC(2,1)], consistent with Shrout and Fleiss (1979) and Koo and Li (2016). Two trained raters independently scored all 40 participants. ICC(2,1) was 0.87 (95% CI: 0.78–0.93), indicating good to excellent reliability. Based on the lower bound of the 95% CI (0.78), reliability was classified as good.”
⚠️ Common ICC Writing Errors to Avoid
(1) Reporting ICC without specifying which of the six forms was used. (2) Reporting only the point estimate without the 95% confidence interval. (3) Failing to justify the choice of agreement versus consistency. (4) Not citing Shrout & Fleiss (1979) or Koo & Li (2016). (5) Interpreting ICC without discussing CI width and sample size limitations. (6) Using Pearson r when ICC is required. (7) Confusing inter-rater reliability (ICC) with internal consistency reliability (Cronbach’s alpha). Addressing all seven demonstrates the methodological precision that distinguishes excellent from adequate work.
Key Terms & Concepts
Essential Vocabulary for ICC
Reliability — the consistency or reproducibility of measurements. Inter-rater reliability — degree of agreement among different raters assessing the same subjects. Intrarater reliability — consistency of ratings by the same rater across occasions. Test-retest reliability — stability of measurements across time without genuine change. Agreement — whether measurements produce the same absolute values (absolute-agreement ICC). Consistency — whether measurements rank subjects in the same order (consistency ICC).
Between-subject variance (σ²ₛ) — variance reflecting genuine differences between subjects; the ICC “signal.” Within-subject variance (σ²ₑ) — variance reflecting measurement error; the ICC “noise.” Rater variance (σ²ᵣ) — systematic differences between raters; present only in two-way models. Variance partition coefficient (VPC) — alternative name for ICC in multilevel modeling; proportion of total variance explained by the clustering variable.
Spearman-Brown formula — predicts how reliability changes with number of raters: ICC(k) = k×ICC(1) / (1 + (k−1)×ICC(1)). Attenuation bias — downward bias in correlations caused by measurement error in predictors. Standard Error of Measurement (SEM) — reliability expressed in original scale units: SEM = SD_total × √(1 − ICC). Generalizability theory — a framework extending ICC to multiple simultaneous error sources, developed by Lee Cronbach and colleagues. Intracluster correlation — ICC in cluster-randomized trial design, used to calculate design effect and required sample size.
Need Expert Help With Your Reliability Analysis Assignment?
Our statistics experts deliver clear, precise ICC analyses — with proper model selection, ANOVA computation, R or SPSS output, and full APA-style results reporting.
Order Now Log InFrequently Asked Questions
Frequently Asked Questions: Intraclass Correlation Coefficient
What is the intraclass correlation coefficient and what does it measure?
The intraclass correlation coefficient (ICC) is a reliability statistic that measures the degree of agreement or consistency among multiple measurements of the same subjects. It partitions total score variance into between-subject variance (the signal) and within-subject error variance (the noise). ICC = between-subject variance / total variance. Values range from 0 (no reliability) to 1 (perfect reliability). It is the gold standard for inter-rater reliability, test-retest reliability, and intrarater reliability across clinical, psychological, and educational research, replacing the previously misused Pearson r for this purpose.
What is the difference between ICC(2,1) and ICC(3,1)?
Both require the same raters to evaluate all subjects, but differ in how raters are treated statistically. ICC(2,1) uses a two-way random effects model — raters are treated as a random sample from a larger population, and the reliability conclusion generalizes to other raters. ICC(3,1) uses a two-way mixed effects model — those specific raters are fixed effects, and the reliability conclusion applies only to those specific raters. If you want to generalize reliability to other clinicians, use ICC(2,1). If reliability is limited to these specific raters, use ICC(3,1). The consistency-form ICC(2,1) and ICC(3,1) are numerically equal; they only differ in the agreement form.
What is a good ICC value and how do I interpret it?
According to Koo and Li (2016): ICC below 0.50 = poor reliability; 0.50–0.75 = moderate; 0.75–0.90 = good; above 0.90 = excellent. Critically, these benchmarks apply to the lower bound of the 95% confidence interval, not just the point estimate. A point estimate of 0.82 with a lower CI bound of 0.48 should be classified as poor-to-good. The appropriate threshold also depends on the application: clinical measurement for individual diagnosis typically requires ICC ≥ 0.90, while research-level measurement may accept ICC ≥ 0.70.
How do I calculate ICC in R?
Use the icc() function from the irr package or ICC() from the psych package. For irr: library(irr); icc(data, model=”twoway”, type=”agreement”, unit=”single”) returns ICC(2,1) absolute agreement with 95% CI. Change model to “oneway” for ICC(1), type to “consistency” for consistency ICC, and unit to “average” for the k-rater average form. For psych: library(psych); ICC(data) returns all six Shrout-Fleiss forms simultaneously with ICC estimates, F-statistics, degrees of freedom, p-values, and confidence intervals in one output table. Both packages are freely available on CRAN.
What is the difference between ICC and Cronbach’s alpha?
Cronbach’s alpha measures internal consistency — whether multiple items on a scale all measure the same construct. ICC measures agreement or consistency among raters or measurement occasions for the same variable. Alpha applies when you have multiple different items measuring the same construct (e.g., a 10-item anxiety scale). ICC applies when you have multiple measurements of the same variable from different raters or occasions (e.g., three physiotherapists measuring the same patient’s range of motion). Using alpha when ICC is required — or vice versa — is a methodological error.
How many subjects do I need for a reliable ICC estimate?
Sample size for ICC studies is determined by the desired confidence interval width. To achieve a 95% CI width of ±0.10 around an expected ICC of 0.70 with two raters, approximately 30–35 subjects are needed (Walter et al., 1998). For a tighter CI of ±0.05, approximately 100 subjects are required. Adding more raters reduces the required number of subjects: with three raters, approximately 20–25 subjects suffice. Use PASS or the R package iccpower for formal sample size calculation. Most published reliability studies with fewer than 30 subjects have CIs so wide that reliability classifications are highly uncertain.
What does it mean when ICC is very low even though raters seem to agree?
This is the subject-homogeneity problem. ICC = between-subject variance / total variance. If all subjects have similar true scores (low between-subject variance), the ICC will be low even if rater error is small, because error variance dominates a very small true variance. For example, testing grip strength reliability in a sample of professional arm wrestlers who all have near-maximum grip will produce low ICC — not because raters disagree, but because there is little true between-subject variance to partition. The solution: use a more heterogeneous subject sample representative of where the instrument will be used, or supplement ICC with the Standard Error of Measurement (SEM), which does not depend on sample variance.
Can ICC be negative, and what does that mean?
Yes. ICC can be negative when the between-subject variance is estimated to be less than zero — a statistical artifact caused by small samples, a very homogeneous subject population, or extreme rater disagreement. A negative ICC indicates that within-subject variability (measurement error) is larger than between-subject variability, making measurements essentially meaningless for discriminating between subjects. In practice, a negative ICC is truncated to zero and interpreted as zero reliability. It is a signal to investigate rater training, measurement procedures, or whether the subject population has sufficient variance on the construct being measured.
What is the Standard Error of Measurement (SEM) and how does it relate to ICC?
SEM = SD_total × √(1 − ICC), where SD_total is the standard deviation of scores across all subjects and occasions. While ICC is a relative reliability index affected by sample heterogeneity, SEM is an absolute index expressing measurement error in original scale units. A SEM of 2.5 mmHg for a blood pressure instrument means the true value likely falls within ±2.5 mmHg about 68% of the time. For clinical decision-making, SEM is often more interpretable than ICC because it describes error in clinically meaningful units — particularly valuable when ICC is suppressed by a homogeneous subject sample.
