Statistics

How to Calculate Intraclass Correlation Coefficient (ICC)

Posted by

Byron Otieno

On May 23, 2026

0 comments

How to Calculate Intraclass Correlation Coefficient (ICC) | Ivy League Assignment Help

Statistics & Reliability Analysis

How to Calculate Intraclass Correlation Coefficient (ICC)

All six ICC forms, ANOVA-based formulas, R & SPSS walkthroughs, and Koo & Li (2016) interpretation benchmarks — everything you need for assignments, dissertations, and research reports.

Order Statistics Assignment Help Now

Trustpilot

4.9/5 on Trustpilot

6,200+ assignments completed

Delivered in 3–6 hours

100% plagiarism-free

What Is ICC & Why It Matters

How to Calculate Intraclass Correlation Coefficient (ICC)

Intraclass Correlation Coefficient (ICC) answers one of the most important questions in empirical research: when two or more raters, instruments, or measurement occasions produce values for the same subject, how much can you trust those measurements? ICC is not a simple correlation. It is a reliability statistic built from variance components — and choosing the wrong form, or calculating it without understanding what it measures, is one of the fastest ways to undermine a study’s credibility.

The ICC has been the standard reliability metric in clinical, psychological, educational, and social research for decades. Its defining feature is that it partitions total score variance into two components: variance that reflects genuine differences between subjects, and variance that reflects measurement error (disagreement between raters, instruments, or occasions). The proportion of total variance explained by between-subject differences is the ICC. A high ICC means the measurements successfully discriminate between subjects — meaning the instrument is reliable. A low ICC means the error variance dominates — meaning you cannot trust that the measurement reflects anything real about the subject.

Forms of ICC defined by Shrout & Fleiss (1979) in the landmark Psychological Bulletin paper

0.90

Koo & Li (2016) threshold above which ICC indicates excellent reliability for clinical measurement

Total ICC forms identified by McGraw & Wong (1996) in Psychological Methods

What Exactly Is the Intraclass Correlation Coefficient?

The intraclass correlation coefficient is a reliability index that quantifies the degree of agreement or consistency among multiple measurements of the same subjects — where those measurements come from raters of the same “class” (meaning there is no logical way to distinguish them, unlike distinguishing a test from a retest, or variable X from variable Y). The word “intraclass” distinguishes it from the Pearson product-moment correlation, which measures agreement between two logically distinct variables.

ICC operates by modeling total score variance using ANOVA. Scores vary because subjects genuinely differ from each other (this is the signal we want), and because raters or measurement occasions produce inconsistent readings (this is the error we want to minimize). The ICC is, in its most basic form, the ratio of between-subject variance to total variance. This simple idea has remarkable consequences: it means an ICC computed in a very homogeneous subject sample will be artificially low not because the instrument is unreliable, but because there is little true between-subject variance to partition.

Why ICC Replaced Pearson Correlation for Reliability

Before ICC became standard, researchers frequently misused Pearson’s r to assess inter-rater reliability. The problem is that Pearson r is insensitive to systematic rater differences. If one radiologist consistently rates tumor size 20% higher than a colleague, their readings will have a Pearson r of 1.0 — perfect linear association — yet they fundamentally disagree. Agreement-form ICC correctly penalizes this systematic offset, making it far more appropriate whenever you need to establish that different raters or instruments can be used interchangeably.

The landmark paper establishing the ICC framework was Shrout and Fleiss (1979), published in Psychological Bulletin, which defined the six canonical forms of ICC and provided formulas based on one-way and two-way ANOVA. The most widely cited interpretation guide is Koo and Li (2016) in the Journal of Chiropractic Medicine, which provides the benchmarks (poor, moderate, good, excellent) used in virtually every reliability paper published today.

The core conceptual point about ICC: The ICC measures how much of the total variance in scores is attributable to true differences between subjects rather than to measurement error. A high ICC (close to 1.0) means your instrument reliably distinguishes subjects from each other. A low ICC (close to 0) means measurement error dominates — the scores are mostly noise.

The Six Forms of ICC

The Six ICC Forms: Shrout and Fleiss (1979) Explained

The most consequential decision in any ICC analysis is choosing the correct form. Shrout and Fleiss (1979) defined six forms of the intraclass correlation coefficient organized by two dimensions: the statistical model (one-way random, two-way random, or two-way mixed) and the unit of measurement (single rater vs. mean of k raters). Choosing incorrectly — say, using a two-way formula when a one-way design was used — produces a biased and uninterpretable result.

An additional dimension added by McGraw and Wong (1996) in Psychological Methods — who extended the taxonomy to ten forms — is the choice between agreement and consistency within each two-way model. This is functionally the most important distinction for researchers: agreement ICC asks “do the raters produce the same absolute values?”, while consistency ICC asks “do the raters rank subjects the same way?”

Model 1: One-Way Random Effects (ICC(1,1) and ICC(1,k))

The one-way random effects model applies when each subject is rated by a different set of raters randomly drawn from a larger population, and when those rater identities are not tracked. There is no way to account for rater-specific effects because different subjects were rated by different people. This is the most conservative ICC form — it produces the lowest ICC estimates — because it cannot separate rater-to-rater variability from other sources of within-subject error.

Model 2: Two-Way Random Effects (ICC(2,1) and ICC(2,k))

The two-way random effects model applies when the same set of raters rates all subjects AND both subjects and raters are considered random samples from larger populations. This is the most commonly appropriate model for inter-rater reliability studies in psychology, education, and clinical research. ICC(2,1) can be calculated in both agreement and consistency forms: the agreement form penalizes systematic mean differences between raters; the consistency form does not.

Model 3: Two-Way Mixed Effects (ICC(3,1) and ICC(3,k))

The two-way mixed effects model requires the same raters to rate all subjects but treats those specific raters as fixed effects — the only raters of interest — rather than as a random sample. This model is appropriate when the same two or three specific instruments are used throughout a study and you do not intend to generalize to other instruments. The correct rule: if you plan to generalize your reliability conclusions to other raters, use ICC(2). If limited to these specific raters, use ICC(3).

ICC Form	Model	Rater Design	Agreement vs. Consistency	Typical Use Case
ICC(1,1)	One-way random	Different raters per subject (random)	Absolute agreement only	Community judges, each rating a different subset
ICC(1,k)	One-way random	Different raters per subject (random)	Absolute agreement only	Same as above but reporting mean of k raters
ICC(2,1)	Two-way random	Same raters for all subjects (random)	Agreement or Consistency	Inter-rater reliability; generalization to new raters intended
ICC(2,k)	Two-way random	Same raters for all subjects (random)	Agreement or Consistency	Same; reporting mean across k raters
ICC(3,1)	Two-way mixed	Same specific raters for all subjects (fixed)	Consistency primarily	Intrarater reliability; no generalization beyond these raters
ICC(3,k)	Two-way mixed	Same specific raters for all subjects (fixed)	Consistency primarily	Same; reporting mean across k fixed raters

Formulas & Step-by-Step Calculation

ICC Formulas: ANOVA-Based Calculation from First Principles

Calculating the intraclass correlation coefficient by hand requires running a one-way or two-way ANOVA and extracting the Mean Square components. Understanding why the formula uses these particular components reveals what the ICC actually measures and why it behaves the way it does in small samples.

Step 1: Organize Your Data

Your data matrix should have n rows (subjects or targets) and k columns (raters or measurement occasions). Every cell contains the rating assigned to subject i by rater j. For ICC to be meaningful, the same construct must be measured by all raters using the same scale.

Step 2: Run the ANOVA and Extract Mean Squares

MSB = SSB / (n − 1)
MSW = SSW / (n(k − 1))
MSR = SSR / (k − 1)
MSE = SSE / ((n − 1)(k − 1))

Where n = number of subjects, k = number of raters, SSB = sum of squares between subjects, SSW = sum of squares within subjects, SSR = sum of squares between raters, SSE = sum of squares error (interaction).

Step 3: Apply the ICC Formula for Your Model

ICC(1,1) = (MSB − MSW) / (MSB + (k−1)·MSW)

ICC(1,1): One-way random, single rater, absolute agreement.

ICC(2,1) Agreement = (MSB − MSE) / (MSB + (k−1)·MSE + k·(MSR − MSE)/n)

ICC(2,1) Absolute Agreement: Two-way random, single rater. Penalizes systematic rater differences via MSR. Most stringent and most commonly appropriate for clinical research.

ICC(2,1) Consistency = (MSB − MSE) / (MSB + (k−1)·MSE)

ICC(2,1) Consistency: Does not penalize systematic rater mean differences. Numerically equals ICC(3,1) consistency.

ICC(k) = ICC(1) × k / (1 + ICC(1)·(k−1))

Spearman-Brown formula for ICC(1,k) and ICC(2,k): Average-measures ICC derived from single-measure ICC, analogous to the Spearman-Brown prophecy formula in classical test theory.

Step 4: A Worked Numerical Example

Consider a study where n = 6 subjects are each rated by k = 3 raters on a pain scale (0–10). The two-way ANOVA yields: MSB = 14.28, MSR = 0.48, MSE = 0.72. We compute ICC(2,1) absolute agreement:

ICC(2,1) = (14.28 − 0.72) / (14.28 + (3−1)·0.72 + 3·(0.48 − 0.72)/6)

= 13.56 / (14.28 + 1.44 − 0.12) = 13.56 / 15.60 = 0.869

ICC(2,1) = 0.869, indicating good reliability per Koo & Li (2016). The 95% CI computed in R would typically yield approximately [0.63, 0.96] for this small sample — illustrating why the lower bound must be reported.

Why Sample Size Critically Affects ICC Confidence Intervals

With only 6 subjects, even an ICC point estimate of 0.87 comes with an extremely wide confidence interval — potentially from poor to excellent reliability. Walter et al. (1998) showed that achieving a 95% CI width of ±0.10 around an expected ICC of 0.70 with two raters requires approximately 30–35 subjects. For three raters, approximately 20–25 subjects suffice. These figures should guide your study design from the outset — not be discovered after data collection.

Need Help Calculating ICC for Your Assignment or Research?

Our statistics experts guide you through model selection, ANOVA-based calculation, software implementation, and proper reporting — step by step, tailored to your course or research requirements.

Get Expert Help Now Log In

Interpreting ICC Values

Interpreting the ICC: Koo and Li (2016) Benchmarks

Once you have calculated the ICC, interpretation requires more than looking up a benchmark bin. The numerical value means different things depending on the ICC form used, the homogeneity of your subject sample, and whether you are reporting a point estimate or a confidence interval.

The Koo and Li (2016) Benchmarks

Koo & Li (2016) ICC Interpretation Benchmarks (lower bound of 95% CI):

ICC < 0.50: Poor reliability
ICC 0.50–0.75: Moderate reliability
ICC 0.75–0.90: Good reliability
ICC > 0.90: Excellent reliability

These thresholds apply specifically to the lower bound of the 95% CI, not just the point estimate. A point estimate of 0.85 (good reliability) with a lower CI bound of 0.48 (poor reliability) should be classified as poor-to-good.

Agreement vs. Consistency: Which ICC to Report?

Report Agreement ICC When:

Different raters will be used interchangeably in practice
Absolute score values matter for clinical or practical decisions
Establishing that one measurement tool can replace another
Systematic rater bias would affect interpretation or action

Report Consistency ICC When:

The same fixed rater(s) will always be used in practice
Only the relative ordering of subjects matters
A systematic offset between raters can be calibrated out
Computing intra-rater reliability across time

⚠️ Reporting ICC Point Estimates Without Confidence Intervals Is No Longer Acceptable: Major clinical journals now require 95% CIs for all ICC reports. Koo and Li (2016) explicitly state that reliability categorization should be based on the lower bound of the 95% CI, not the point estimate. If your sample is small (n < 30), your CI will be wide — report it honestly and discuss its implications.

Decision Framework

How to Select the Right ICC Form: A Decision Framework

Choosing between ICC forms is one of the most frequent sources of error in reliability research. The following decision framework walks you through the correct selection process, following Koo and Li (2016).

Decision Question 1: Are the Same Raters Used for All Subjects?

No → Use ICC(1). Yes → Proceed to Question 2.

Decision Question 2: Generalize to Other Raters, or Only These Specific Raters?

These specific raters only (fixed): Use ICC(3). Generalize to other raters (random): Use ICC(2) — most inter-rater reliability studies.

Decision Question 3: Does Systematic Rater Bias Matter in Practice?

Yes (absolute values drive decisions) → Agreement ICC. No (only rank order matters) → Consistency ICC.

Decision Question 4: Individual Rater or Mean of k Raters?

Individual rater: Use (1) form — e.g., ICC(2,1). Mean of k raters: Use (k) form — e.g., ICC(2,k).

For the vast majority of inter-rater reliability studies in clinical and psychological research: ICC(2,1) absolute agreement is the recommended default. It is the most conservative, most widely generalizable, and most demanding choice — reflecting the actual clinical scenario where different raters produce measurements that drive decisions independently.

Software Implementation

Calculating ICC in R, SPSS, and Excel

Calculating ICC in R: The irr and psych Packages

        # ── ICC using the irr package ──

        install.packages(“irr”)

        library(irr)

        ratings <- data.frame(

          Rater1 = c(6, 7, 4, 8, 5, 3),

          Rater2 = c(6, 8, 4, 7, 5, 4),

          Rater3 = c(7, 7, 5, 8, 6, 3)

        )

        # ICC(2,1) absolute agreement

        icc(ratings, model = “twoway”, type = “agreement”, unit = “single”)

        # ICC(2,1) consistency

        icc(ratings, model = “twoway”, type = “consistency”, unit = “single”)

        # ICC(1,1) — one-way random

        icc(ratings, model = “oneway”, type = “agreement”, unit = “single”)

        # ── psych package — returns all 6 ICC forms simultaneously ──

        install.packages(“psych”)

        library(psych)

        result <- ICC(ratings)

        result$results  # ICC, CI, F, p for all 6 forms

Calculating ICC in SPSS

Navigate to Analyze → Scale → Reliability Analysis. Move rater columns into the Items box. Click Statistics → check “Intraclass Correlation Coefficient.” Select Model (Two-Way Random for ICC(2), Two-Way Mixed for ICC(3)), Type (Absolute Agreement or Consistency), and 95% Confidence Interval. Click OK. SPSS returns the ICC value, 95% CI, F-test, and degrees of freedom. Note: SPSS does not compute ICC(1) — use R for the one-way random model.

Calculating ICC in Excel (Manual ANOVA Approach)

Run Data → Data Analysis → Anova: Two-Factor Without Replication. The output table gives SSB (Row SS), SSR (Column SS), and SSE (Error SS). Compute MSB, MSR, MSE, then apply the formula directly:

        # ICC(2,1) Absolute Agreement — Excel cell formula:

        =(B2-B4)/(B2+(B7-1)*B4+B7*(B3-B4)/B6)

        # ICC(2,1) Consistency — Excel cell formula:

        =(B2-B4)/(B2+(B7-1)*B4)

        # B2=MSB, B3=MSR, B4=MSE, B6=n subjects, B7=k raters

Statistics Assignment Due? Our Experts Handle ICC, ANOVA, and More.

From selecting the right ICC model to full SPSS/R implementation and proper write-up — our statistics specialists deliver precise, well-structured solutions on deadline.

Start Your Order Log In

Researchers, Tools & Institutions

Key Figures, Institutions, and Tools in ICC Research

Patrick E. Shrout and Joseph L. Fleiss — Columbia & New York University

Patrick E. Shrout (NYU) and Joseph L. Fleiss (Columbia) published “Intraclass Correlations: Uses in Assessing Rater Reliability” in Psychological Bulletin in 1979. This paper defined the six canonical ICC forms, provided ANOVA-based formulas, and established confidence interval procedures. It remains one of the most cited papers in clinical measurement — with over 30,000 citations as of 2026. The Shrout-Fleiss (1979) framework is the mandatory citation for any paper or assignment reporting an ICC.

Terry K. Koo and Mae Y. Li — New York Medical College

Koo and Li published “A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research” in the Journal of Chiropractic Medicine in 2016. Their paper synthesized the literature, provided practical decision rules for model selection, and established the poor/moderate/good/excellent benchmarks. They explicitly recommended reporting the 95% CI lower bound as the reliability classification criterion — the second mandatory citation for any ICC analysis.

Kenneth O. McGraw and S.P. Wong — Texas A&M University

McGraw and Wong extended the Shrout-Fleiss framework to ten ICC forms in their 1996 Psychological Methods paper, formally introducing the agreement-versus-consistency distinction within two-way models. This is the citation for the agreement/consistency distinction in any ICC report.

The irr and psych Packages in R

The irr package provides icc() with explicit model/type/unit arguments. The psych package, by William Revelle at Northwestern University, provides ICC() returning all six Shrout-Fleiss forms simultaneously with confidence intervals. Both are freely available on CRAN and are the de facto standard for ICC analysis in academic research.

IBM SPSS Statistics

The dominant statistical platform in many clinical and social science programs. Its Reliability Analysis procedure implements ICC(2) and ICC(3) with both agreement and consistency options via a point-and-click interface — the most accessible ICC platform for students without programming backgrounds.

Real-World Applications

Where ICC Is Used: Applications Across Research Fields

Clinical and Health Research: Measurement Reliability

In clinical research, ICC validates the reliability of physical measurements — range of motion, blood pressure, pain scale ratings, radiological measurements — before those measurements are used as outcomes in trials or diagnostic criteria. The standard is a reliability substudy of n ≥ 30 patients assessed by two or more raters, with ICC(2,1) absolute agreement reported alongside its 95% CI. Organizations like GRAPPA and OARSI typically require ICC ≥ 0.85 with CI lower bound ≥ 0.70 for clinical trial endpoints.

Psychology and Education: Observational Coding

In developmental psychology and educational research, trained observers code complex behaviors from video recordings. ICC — specifically ICC(2,1) or ICC(3,1) — is the appropriate reliability index for continuous or ordinal coding. Cohen’s kappa is more appropriate for nominal coding (present/absent), while ICC handles ordered or continuous codes. In educational assessment, ICC validates essay scoring rubrics and performance-based evaluations where human judges assign numerical scores.

Multilevel Modeling: Variance Partition Coefficient

In multilevel modeling, ICC quantifies the proportion of total variance attributable to the higher-level grouping variable (school, clinic, country). An ICC of 0.15 in a school-effects model means 15% of student test score variance is between schools — a meaningful amount justifying multilevel modeling rather than standard OLS regression. The performance package in R provides icc() specifically for extracting multilevel ICC from lme4 or brms models.

Test-Retest Reliability in Questionnaire Development

When developing psychological questionnaires, test-retest reliability assesses whether the instrument produces consistent scores when administered to the same subjects 1–4 weeks apart. ICC(2,1) with the two measurement occasions treated as “raters” is the correct approach. A significant paired t-test alongside a high ICC indicates mean scores changed systematically even though relative rankings were stable — a signal of practice or learning effects.

Writing About ICC

How to Write About ICC in Assignments and Research Papers

The Complete ICC Results Statement

Every ICC result should contain six elements: the ICC form, the point estimate, the confidence interval, the sample size, the reliability classification per Koo & Li (2016), and the model justification. Example:

“Inter-rater reliability was assessed using the intraclass correlation coefficient with a two-way random effects model and absolute agreement definition [ICC(2,1)], consistent with Shrout and Fleiss (1979) and Koo and Li (2016). Two trained raters independently scored all 40 participants. ICC(2,1) was 0.87 (95% CI: 0.78–0.93), indicating good to excellent reliability. Based on the lower bound of the 95% CI (0.78), reliability was classified as good.”

⚠️ Common ICC Writing Errors to Avoid

(1) Reporting ICC without specifying which of the six forms was used. (2) Reporting only the point estimate without the 95% confidence interval. (3) Failing to justify the choice of agreement versus consistency. (4) Not citing Shrout & Fleiss (1979) or Koo & Li (2016). (5) Interpreting ICC without discussing CI width and sample size limitations. (6) Using Pearson r when ICC is required. (7) Confusing inter-rater reliability (ICC) with internal consistency reliability (Cronbach’s alpha). Addressing all seven demonstrates the methodological precision that distinguishes excellent from adequate work.

Key Terms & Concepts

Essential Vocabulary for ICC

Reliability — the consistency or reproducibility of measurements. Inter-rater reliability — degree of agreement among different raters assessing the same subjects. Intrarater reliability — consistency of ratings by the same rater across occasions. Test-retest reliability — stability of measurements across time without genuine change. Agreement — whether measurements produce the same absolute values (absolute-agreement ICC). Consistency — whether measurements rank subjects in the same order (consistency ICC).

Between-subject variance (σ²ₛ) — variance reflecting genuine differences between subjects; the ICC “signal.” Within-subject variance (σ²ₑ) — variance reflecting measurement error; the ICC “noise.” Rater variance (σ²ᵣ) — systematic differences between raters; present only in two-way models. Variance partition coefficient (VPC) — alternative name for ICC in multilevel modeling; proportion of total variance explained by the clustering variable.

Spearman-Brown formula — predicts how reliability changes with number of raters: ICC(k) = k×ICC(1) / (1 + (k−1)×ICC(1)). Attenuation bias — downward bias in correlations caused by measurement error in predictors. Standard Error of Measurement (SEM) — reliability expressed in original scale units: SEM = SD_total × √(1 − ICC). Generalizability theory — a framework extending ICC to multiple simultaneous error sources, developed by Lee Cronbach and colleagues. Intracluster correlation — ICC in cluster-randomized trial design, used to calculate design effect and required sample size.

Need Expert Help With Your Reliability Analysis Assignment?

Our statistics experts deliver clear, precise ICC analyses — with proper model selection, ANOVA computation, R or SPSS output, and full APA-style results reporting.

Order Now Log In

Frequently Asked Questions

Frequently Asked Questions: Intraclass Correlation Coefficient

What is the intraclass correlation coefficient and what does it measure? +

The intraclass correlation coefficient (ICC) is a reliability statistic that measures the degree of agreement or consistency among multiple measurements of the same subjects. It partitions total score variance into between-subject variance (the signal) and within-subject error variance (the noise). ICC = between-subject variance / total variance. Values range from 0 (no reliability) to 1 (perfect reliability). It is the gold standard for inter-rater reliability, test-retest reliability, and intrarater reliability across clinical, psychological, and educational research, replacing the previously misused Pearson r for this purpose.

What is the difference between ICC(2,1) and ICC(3,1)? +

Both require the same raters to evaluate all subjects, but differ in how raters are treated statistically. ICC(2,1) uses a two-way random effects model — raters are treated as a random sample from a larger population, and the reliability conclusion generalizes to other raters. ICC(3,1) uses a two-way mixed effects model — those specific raters are fixed effects, and the reliability conclusion applies only to those specific raters. If you want to generalize reliability to other clinicians, use ICC(2,1). If reliability is limited to these specific raters, use ICC(3,1). The consistency-form ICC(2,1) and ICC(3,1) are numerically equal; they only differ in the agreement form.

What is a good ICC value and how do I interpret it? +

According to Koo and Li (2016): ICC below 0.50 = poor reliability; 0.50–0.75 = moderate; 0.75–0.90 = good; above 0.90 = excellent. Critically, these benchmarks apply to the lower bound of the 95% confidence interval, not just the point estimate. A point estimate of 0.82 with a lower CI bound of 0.48 should be classified as poor-to-good. The appropriate threshold also depends on the application: clinical measurement for individual diagnosis typically requires ICC ≥ 0.90, while research-level measurement may accept ICC ≥ 0.70.

How do I calculate ICC in R? +

Use the icc() function from the irr package or ICC() from the psych package. For irr: library(irr); icc(data, model=”twoway”, type=”agreement”, unit=”single”) returns ICC(2,1) absolute agreement with 95% CI. Change model to “oneway” for ICC(1), type to “consistency” for consistency ICC, and unit to “average” for the k-rater average form. For psych: library(psych); ICC(data) returns all six Shrout-Fleiss forms simultaneously with ICC estimates, F-statistics, degrees of freedom, p-values, and confidence intervals in one output table. Both packages are freely available on CRAN.

What is the difference between ICC and Cronbach’s alpha? +

Cronbach’s alpha measures internal consistency — whether multiple items on a scale all measure the same construct. ICC measures agreement or consistency among raters or measurement occasions for the same variable. Alpha applies when you have multiple different items measuring the same construct (e.g., a 10-item anxiety scale). ICC applies when you have multiple measurements of the same variable from different raters or occasions (e.g., three physiotherapists measuring the same patient’s range of motion). Using alpha when ICC is required — or vice versa — is a methodological error.

How many subjects do I need for a reliable ICC estimate? +

Sample size for ICC studies is determined by the desired confidence interval width. To achieve a 95% CI width of ±0.10 around an expected ICC of 0.70 with two raters, approximately 30–35 subjects are needed (Walter et al., 1998). For a tighter CI of ±0.05, approximately 100 subjects are required. Adding more raters reduces the required number of subjects: with three raters, approximately 20–25 subjects suffice. Use PASS or the R package iccpower for formal sample size calculation. Most published reliability studies with fewer than 30 subjects have CIs so wide that reliability classifications are highly uncertain.

What does it mean when ICC is very low even though raters seem to agree? +

This is the subject-homogeneity problem. ICC = between-subject variance / total variance. If all subjects have similar true scores (low between-subject variance), the ICC will be low even if rater error is small, because error variance dominates a very small true variance. For example, testing grip strength reliability in a sample of professional arm wrestlers who all have near-maximum grip will produce low ICC — not because raters disagree, but because there is little true between-subject variance to partition. The solution: use a more heterogeneous subject sample representative of where the instrument will be used, or supplement ICC with the Standard Error of Measurement (SEM), which does not depend on sample variance.

Can ICC be negative, and what does that mean? +

Yes. ICC can be negative when the between-subject variance is estimated to be less than zero — a statistical artifact caused by small samples, a very homogeneous subject population, or extreme rater disagreement. A negative ICC indicates that within-subject variability (measurement error) is larger than between-subject variability, making measurements essentially meaningless for discriminating between subjects. In practice, a negative ICC is truncated to zero and interpreted as zero reliability. It is a signal to investigate rater training, measurement procedures, or whether the subject population has sufficient variance on the construct being measured.

What is the Standard Error of Measurement (SEM) and how does it relate to ICC? +

SEM = SD_total × √(1 − ICC), where SD_total is the standard deviation of scores across all subjects and occasions. While ICC is a relative reliability index affected by sample heterogeneity, SEM is an absolute index expressing measurement error in original scale units. A SEM of 2.5 mmHg for a blood pressure instrument means the true value likely falls within ±2.5 mmHg about 68% of the time. For clinical decision-making, SEM is often more interpretable than ICC because it describes error in clinically meaningful units — particularly valuable when ICC is suppressed by a homogeneous subject sample.

Blog

How to Calculate Intraclass Correlation Coefficient (ICC)

How to Calculate Intraclass Correlation Coefficient (ICC)

What Exactly Is the Intraclass Correlation Coefficient?

Why ICC Replaced Pearson Correlation for Reliability

The Six ICC Forms: Shrout and Fleiss (1979) Explained

Model 1: One-Way Random Effects (ICC(1,1) and ICC(1,k))

Model 2: Two-Way Random Effects (ICC(2,1) and ICC(2,k))

Model 3: Two-Way Mixed Effects (ICC(3,1) and ICC(3,k))

ICC Formulas: ANOVA-Based Calculation from First Principles

Step 1: Organize Your Data

Step 2: Run the ANOVA and Extract Mean Squares

Step 3: Apply the ICC Formula for Your Model

Step 4: A Worked Numerical Example

Why Sample Size Critically Affects ICC Confidence Intervals

Need Help Calculating ICC for Your Assignment or Research?

Interpreting the ICC: Koo and Li (2016) Benchmarks

The Koo and Li (2016) Benchmarks

Agreement vs. Consistency: Which ICC to Report?

Report Agreement ICC When:

Report Consistency ICC When:

How to Select the Right ICC Form: A Decision Framework

Decision Question 1: Are the Same Raters Used for All Subjects?

Decision Question 2: Generalize to Other Raters, or Only These Specific Raters?

Decision Question 3: Does Systematic Rater Bias Matter in Practice?

Decision Question 4: Individual Rater or Mean of k Raters?

Calculating ICC in R, SPSS, and Excel

Calculating ICC in R: The irr and psych Packages

Calculating ICC in SPSS

Calculating ICC in Excel (Manual ANOVA Approach)

Statistics Assignment Due? Our Experts Handle ICC, ANOVA, and More.

Key Figures, Institutions, and Tools in ICC Research

Patrick E. Shrout and Joseph L. Fleiss — Columbia & New York University

Terry K. Koo and Mae Y. Li — New York Medical College

Kenneth O. McGraw and S.P. Wong — Texas A&M University

The irr and psych Packages in R

IBM SPSS Statistics

Where ICC Is Used: Applications Across Research Fields

Clinical and Health Research: Measurement Reliability

Psychology and Education: Observational Coding

Multilevel Modeling: Variance Partition Coefficient

Test-Retest Reliability in Questionnaire Development

How to Write About ICC in Assignments and Research Papers

The Complete ICC Results Statement

⚠️ Common ICC Writing Errors to Avoid

Essential Vocabulary for ICC

Need Expert Help With Your Reliability Analysis Assignment?

Frequently Asked Questions: Intraclass Correlation Coefficient

About Byron Otieno

Leave a Reply Cancel reply