How to Perform Descriptive Statistics in R
Statistics in R — Complete Guide
How to Perform Descriptive Statistics in R
Descriptive statistics in R is the foundation of every data analysis workflow — and one of the most frequently tested skills in college-level statistics, data science, and research methods courses. R gives you a powerful, flexible toolkit to summarize, describe, and visualize data far beyond what spreadsheets can handle, and mastering it puts you ahead in academia and industry alike.
This guide walks you through everything you need to perform descriptive statistics in R: from installing packages and loading data, to computing mean, median, mode, variance, standard deviation, IQR, skewness, and kurtosis, all the way to grouped summaries and professional-quality visualizations with ggplot2. Every major step includes working R code you can run immediately.
You’ll find two reference tables, step-by-step walkthroughs, annotated code blocks, and practical tips for avoiding the most common R errors that trip up students. We cover base R functions, the psych package, the dplyr package, and the e1071 package — the four pillars of descriptive analysis in R that every serious analyst should know.
Whether you’re working on a university statistics assignment, a data science project, or your first academic research paper, this guide gives you the code, the concepts, and the context to produce accurate, publication-ready descriptive statistics in R.
Introduction
How to Perform Descriptive Statistics in R: Where Every Analysis Begins
Descriptive statistics in R is the first thing you do with any dataset — before models, before hypothesis tests, before predictions. You describe what you have. What’s the center of the data? How spread out is it? Does it lean left or right? Are there outliers lurking in the tails? R handles all of this with remarkable elegance, and the functions involved are genuinely not hard to learn once you understand what each one measures.
R is the language of choice for statistical computing across universities, research institutions, and data-driven industries in the United States and United Kingdom. The Comprehensive R Archive Network (CRAN), maintained by the R Foundation for Statistical Computing in Vienna, hosts over 20,000 packages — and dozens of them are dedicated to descriptive analysis. You don’t need all of them. You need the right five. This guide will teach you exactly which ones, and how to use them. Statistics assignment help often starts with this exact skill set — building a clean, complete descriptive summary of your data.
6
Numbers returned by summary() for every numeric variable — the fastest overview in R
13
Statistics generated by describe() from the psych package in a single line
4
Core R packages every statistics student must know: base R, psych, dplyr, ggplot2
What Is Descriptive Statistics?
Descriptive statistics describes the basic features of a dataset using quantitative summaries. It does not make predictions or test hypotheses — that is inferential statistics. Descriptive statistics answers the question: what does this data look like? It condenses potentially thousands of observations into a handful of interpretable numbers and charts. The three pillars are central tendency (where is the middle of the data?), variability (how spread out is it?), and distribution shape (is it symmetric, skewed, or heavy-tailed?). Understanding the difference between descriptive and inferential statistics is essential before running any analysis in R.
In R, descriptive statistics is typically the first section of any analysis script. You run it immediately after loading and cleaning your data. It tells you if your data makes sense — if the ranges are plausible, if there are suspicious spikes, if important variables are heavily skewed in ways that might violate the assumptions of your planned statistical tests. For college students working on research papers or lab reports, understanding your descriptive statistics is not optional: it is a required section of nearly every empirical methods assignment. Finding the right dataset for your statistical project is the natural precursor to running these analyses.
Why R for Descriptive Statistics?
Excel computes means. SPSS generates tables. Python with pandas can do most of what R does. So why use R specifically? Because R was built by statisticians, for statisticians. Its syntax is tightly aligned with statistical thinking. The psych package, developed by William Revelle at Northwestern University, produces a descriptive table in a single line that would take fifteen Excel formulas to replicate. The ggplot2 package, created by Hadley Wickham (now Chief Scientist at Posit PBC, formerly RStudio), produces publication-quality visualization with a grammar of graphics that makes statistical communication genuinely beautiful. R is also the language most commonly used in peer-reviewed statistics and social science journals — learning it now pays dividends throughout your academic career.
Quick Orientation: If you’re brand new to R, the two most important resources are the official documentation at r-project.org and the free online textbook R for Data Science by Hadley Wickham and Garrett Grolemund. For descriptive statistics specifically, the psych package vignette on CRAN is the most thorough technical reference available.
Setting Up R
Installing R, RStudio, and the Packages You Need
Before performing descriptive statistics in R, your environment needs to be ready. This section covers exactly what to install and why. You need two downloads: R itself (the language engine) and RStudio (the interface that makes working with R human-friendly). Both are free and open source.
1
Download and Install R
Go to CRAN and download the latest version of R for your operating system (Windows, macOS, or Linux). As of 2026, R 4.4.x is the current stable release. Install with all defaults.
2
Download and Install RStudio Desktop
Go to Posit’s website and download RStudio Desktop (free version). RStudio gives you a script editor, a console, an environment pane showing your objects, and a plot viewer — all in one window.
3
Install Required Packages
Open RStudio and run the following in the Console. You only need to do this once per machine. The install.packages() function downloads packages from CRAN.
# Install all packages needed for this guide install.packages(c( "psych", # comprehensive describe() function "dplyr", # grouped summaries with group_by() + summarise() "ggplot2", # visualization "e1071", # skewness() and kurtosis() "skimr", # clean skim() summary output "Hmisc" # additional describe() variant ))
4
Load Packages at the Start of Every Script
Once installed, you load packages with library() at the top of each new R script. Installing is permanent (once); loading is per-session (every time you open R).
# Load packages at the top of your analysis script library(psych) library(dplyr) library(ggplot2) library(e1071) library(skimr)
Loading Your Dataset
R comes with built-in datasets that are perfect for practicing descriptive statistics. The mtcars dataset (Motor Trend Car Road Tests, 1974) and the iris dataset (Edgar Anderson’s iris measurements) are the most commonly used. For your own data, the most common import functions are shown below.
# Use a built-in dataset (no import needed) data(mtcars) head(mtcars) # view first 6 rows str(mtcars) # structure: variable types, dimensions dim(mtcars) # rows x columns: [1] 32 11 # Import a CSV file from your computer my_data <- read.csv("path/to/your/file.csv", header = TRUE) # Import an Excel file (requires readxl package) library(readxl) my_data <- read_excel("path/to/your/file.xlsx", sheet = 1)
Tip: Always Inspect Your Data Before Running Descriptive Statistics
Run str(data) to check variable types. Run head(data) and tail(data) to spot obvious data entry errors. Run colSums(is.na(data)) to count missing values per column. Skipping this step and going straight to summary() is one of the most common mistakes beginners make — your statistics will be misleading if the wrong variables are coded as numeric or if unexpected NAs exist. Understanding qualitative vs quantitative data is essential here: descriptive statistics functions only make sense on numeric (quantitative) variables.
The summary() Function
The summary() Function: Your First Stop in Descriptive Statistics in R
The fastest way to perform descriptive statistics in R on an entire dataset is a single command: summary(). This base R function requires no packages, no setup beyond loading your data, and returns a structured overview of every variable in your data frame in under a second. For any statistics assignment, this is always the first thing you run.
# summary() on the entire mtcars dataset summary(mtcars)
mpg cyl disp
Min. :10.40 Min. :4.000 Min. : 71.1
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8
Median :19.20 Median :6.000 Median :196.3
Mean :20.09 Mean :6.188 Mean :230.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0
Max. :33.90 Max. :8.000 Max. :472.0
For each numeric variable, summary() returns six statistics: minimum, first quartile (Q1), median, mean, third quartile (Q3), and maximum. When the mean and median differ substantially, this signals skewness — a key thing to flag in your descriptive analysis write-up. Notice that when mean > median (as in the disp variable above), the distribution is positively skewed. This is the kind of immediate insight summary() gives you.
Running summary() on a Single Variable
You don’t always need the full dataset summary. For a single variable — say, miles per gallon in the mtcars dataset — extract it using the $ operator.
# Summary for a single variable summary(mtcars$mpg) # Output: # Min. 1st Qu. Median Mean 3rd Qu. Max. # 10.40 15.43 19.20 20.09 22.80 33.90
This is clean and immediately interpretable. The median fuel efficiency is 19.2 mpg while the mean is 20.09 — a modest difference suggesting mild positive skewness. The interquartile range (IQR = 22.80 − 15.43 = 7.37) tells you the spread of the middle 50% of cars in the dataset. When writing up descriptive statistics for an assignment, these are the exact numbers you would report in a results table. For more on how variance and standard deviation complement this picture, see expected values and variance in statistics.
What summary() Does Not Tell You
There is no standard deviation, no skewness, no kurtosis, and no count of non-missing observations in base summary() output. For a college assignment or research paper that requires a complete descriptive statistics table, you need to supplement summary() with additional functions — or use the psych package described in the next section. The official R introduction manual on CRAN provides a detailed reference for the summary function and related base R tools.
Need Help With Your R Statistics Assignment?
Our R and statistics experts provide step-by-step guidance on descriptive analysis, code debugging, and report writing — available 24/7 for college and university students.
Get Assignment Help Now Log InCentral Tendency
Measures of Central Tendency in R: Mean, Median, and Mode
Central tendency describes the center of a distribution. It answers: where is the “typical” value in this dataset? In R, you compute the three measures of central tendency — mean, median, and mode — with simple functions. Each tells you something different, and knowing when to use which one is as important as knowing how to compute it.
Mean in R: mean()
The arithmetic mean is the sum of all values divided by the count of values. It is the most widely used measure of central tendency but is sensitive to outliers. In R, mean() computes it directly. The critical argument to remember is na.rm = TRUE, which tells R to ignore missing values (NA) rather than returning NA for the whole result.
# Mean of a single variable mean(mtcars$mpg) # [1] 20.09062 # Mean ignoring missing values mean(mtcars$mpg, na.rm = TRUE) # Mean for all numeric columns in a data frame sapply(mtcars, mean) # Trimmed mean: removes top/bottom 10% before computing mean(mtcars$mpg, trim = 0.10)
The trimmed mean is worth knowing — it is a compromise between the regular mean and the median, and it is more robust to outliers without completely ignoring magnitude the way the median does. In research papers analyzing income distributions or test scores, a 10% or 20% trimmed mean often better represents the “typical” case than either the raw mean or the median alone.
Median in R: median()
The median is the middle value when data is ordered from smallest to largest. For an even number of observations, R averages the two middle values. The median is the preferred measure of central tendency when your data is skewed or contains outliers — house prices, income data, and reaction times are classic examples where the median is more representative than the mean.
# Median of a single variable median(mtcars$mpg) # [1] 19.2 # Median with missing value handling median(mtcars$mpg, na.rm = TRUE) # Median for all columns sapply(mtcars, median)
Notice that the mean of mpg (20.09) is slightly higher than the median (19.2). This difference tells you the distribution is mildly right-skewed — a handful of high-mpg cars (notably the Toyota Corolla at 33.9 mpg and Fiat 128 at 32.4 mpg) are pulling the mean upward. This kind of interpretation is exactly what your professor is looking for in a descriptive statistics write-up.
Mode in R: No Built-in Function — Write Your Own
R’s built-in mode() function returns the storage type of an object (e.g., “numeric”), not the statistical mode. This is a notorious source of confusion for beginners. To find the most frequently occurring value — the statistical mode — you need either a custom function or the DescTools package.
# Custom function for statistical mode get_mode <- function(x) { ux <- unique(x) ux[which.max(tabulate(match(x, ux)))] } # Apply to cylinder variable (categorical-style numeric) get_mode(mtcars$cyl) # [1] 8 (most cars in the dataset have 8 cylinders) # Using DescTools package (handles ties more explicitly) install.packages("DescTools") library(DescTools) Mode(mtcars$cyl) # [1] 8 # attr(,"freq") # [1] 14 (14 out of 32 cars have 8 cylinders)
Common Mistake: Never write
mode(mtcars$mpg) expecting the statistical mode. R will return "numeric" — the data type, not the most frequent value. This trips up almost every R beginner. Use the custom get_mode() function above or install DescTools for a proper Mode() function. Avoiding common analytical errors like this is the difference between a solid result and a misleading one.
When to Use Mean vs. Median vs. Mode
Use Mean When…
- Data is roughly normally distributed (symmetric)
- No significant outliers are present
- You need to use the result in further calculations (e.g., variance)
- Examples: height, weight, test scores in large samples
Use Median When…
- Data is skewed (income, house prices, reaction times)
- Outliers are present and you don’t want them to dominate
- Your variable is ordinal rather than truly continuous
- Examples: salary distributions, property values, clinical measurements
The mode is most useful for categorical or discrete data — the most common response category in a survey, the most common number of children per household, or the most common diagnostic code in a clinical dataset. For continuous data, mode is rarely reported in academic descriptive statistics tables. Computing mean, median, and mode in Excel follows similar logic, so if you are transitioning to R from Excel, the conceptual framework is the same — the R syntax is just more powerful and reproducible.
Measures of Variability
Measures of Variability in R: Variance, Standard Deviation, Range, and IQR
Knowing where the center of your data is tells you only half the story. Variability tells you how spread out the values are around that center. Two datasets can have identical means and medians but wildly different spreads. In descriptive statistics in R, you report variability using variance, standard deviation, range, and interquartile range — each capturing a different aspect of data spread. Understanding the relationship between expected values and variance deepens your theoretical grasp of why these measures matter.
Variance: var()
Variance measures the average squared deviation from the mean. R’s var() function computes the sample variance (divides by n−1, applying Bessel’s correction), not the population variance (divides by n). Unless you are working with a complete population, always use var() — not a manual n-denominator formula.
# Sample variance var(mtcars$mpg) # [1] 36.3241 # Variance for all columns sapply(mtcars, var) # Variance with NA handling var(mtcars$mpg, na.rm = TRUE)
Variance is expressed in squared units — for mpg, that’s squared miles-per-gallon, which is not directly interpretable. That is why standard deviation is usually reported instead. Variance is primarily used as a building block in downstream analyses: ANOVA, regression, and factor analysis all rely on variance decomposition.
Standard Deviation: sd()
The standard deviation is the square root of variance — it rescales variance back to the original units of the variable, making it interpretable. A standard deviation of 6.03 mpg (the SD of mpg in mtcars) means that, on average, cars in this dataset deviate from the mean fuel efficiency by about 6 miles per gallon. This is the single most commonly reported measure of variability in academic research papers.
# Standard deviation sd(mtcars$mpg) # [1] 6.026948 # Coefficient of Variation (CV): SD as % of mean # Useful for comparing variability across variables on different scales cv <- (sd(mtcars$mpg) / mean(mtcars$mpg)) * 100 round(cv, 2) # [1] 29.99 (mpg varies by about 30% of its mean)
The Coefficient of Variation (CV) is worth including in your descriptive tables when you need to compare variability across variables measured on different scales. A CV of 30% for mpg tells you there is moderate relative variability in fuel efficiency — about 30 cents of variation for every dollar of mean, so to speak. For variables like income or reaction time where the scale varies enormously, CV is far more informative than raw standard deviation. The foundational treatment of variance in Lehmann’s statistical estimation theory underpins why sample variance uses n−1 — worth reading if you are in a formal statistics course.
Range: range() and diff(range())
The range is the simplest spread measure: the distance from minimum to maximum. R’s range() returns both the minimum and maximum as a vector; wrapping it in diff() gives you the single range value.
# range() returns c(min, max) range(mtcars$mpg) # [1] 10.4 33.9 # diff(range()) returns the single range value diff(range(mtcars$mpg)) # [1] 23.5 # min() and max() separately min(mtcars$mpg) # [1] 10.4 max(mtcars$mpg) # [1] 33.9
Interquartile Range: IQR()
The IQR (Interquartile Range) measures the spread of the middle 50% of data (Q3 − Q1). It is robust to outliers in a way that range and standard deviation are not. When your data contains extreme values — or when you’re reporting on skewed distributions — the IQR is the preferred measure of spread. It is also the basis for identifying outliers in box plots: any observation more than 1.5 × IQR above Q3 or below Q1 is flagged as a potential outlier.
# IQR IQR(mtcars$mpg) # [1] 7.375 # Quartiles using quantile() quantile(mtcars$mpg) # 0% 25% 50% 75% 100% # 10.40 15.43 19.20 22.80 33.90 # Custom percentiles (e.g., 10th, 25th, 75th, 90th) quantile(mtcars$mpg, probs = c(0.10, 0.25, 0.75, 0.90))
Rule of Thumb for Reporting Variability: In most academic assignments, you report standard deviation alongside the mean, and IQR alongside the median. These pairs make sense together: mean + SD assumes a roughly symmetric distribution; median + IQR is robust to skew. Mixing them (e.g., reporting mean + IQR) creates a table that is internally inconsistent and will be flagged by your professor. For more on the theoretical properties that underpin these choices, the classical statistical literature on robust estimation remains the authoritative reference.
The psych Package
The psych Package: describe() and describeBy() for Full Descriptive Tables
If you need a comprehensive descriptive statistics table in a single command — the kind of table that belongs in an academic paper’s methods section — the psych package’s describe() function is the most efficient tool available for descriptive statistics in R. Developed by William Revelle at Northwestern University, psych is the most-downloaded statistics-specific package on CRAN and is widely used across psychology, education, and social sciences research in the United States and United Kingdom.
library(psych) # Full descriptive table for entire dataset describe(mtcars)
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.4 33.90 23.50 0.61 -0.37 1.07
cyl 2 32 6.19 1.79 6.00 6.23 2.97 4.0 8.00 4.00 -0.17 -1.76 0.32
disp 3 32 230.72 123.94 196.30 222.52 140.48 71.1 472.00 400.90 0.38 -1.21 21.91
hp 4 32 146.69 68.56 123.00 141.19 77.10 52.0 335.00 283.00 0.73 -0.14 12.12
…
In a single line, describe() returns: n (sample size), mean, sd (standard deviation), median, trimmed (trimmed mean), mad (median absolute deviation), min, max, range, skew, kurtosis, and se (standard error). This is a complete descriptive statistics table. Copy it, format it, and you have the core of a publishable results section. For students producing research papers or lab reports, mastering academic research paper writing includes knowing how to present exactly this kind of table clearly and correctly.
Selecting Specific Variables with describe()
# describe() on a subset of variables describe(mtcars[, c("mpg", "hp", "wt")]) # Or using dplyr select() for cleaner code library(dplyr) mtcars %>% select(mpg, hp, wt) %>% describe()
Grouped Descriptive Statistics: describeBy()
When your research question involves comparing groups — e.g., how do cars with 4, 6, and 8 cylinders differ in fuel efficiency? — describeBy() generates the full descriptive table separately for each level of a grouping variable. This is one of the most requested features in a statistics assignment and the function that makes psych worth installing for any grouped analysis.
# Descriptive statistics split by number of cylinders describeBy(mtcars$mpg, group = mtcars$cyl) # Full dataset split by group describeBy(mtcars[, c("mpg", "hp", "wt")], group = mtcars$cyl)
Descriptive statistics by group
group: 4
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 11 26.66 4.51 26.0 26.36 6.67 21.4 33.9 12.5 0.26 -1.65 1.36
group: 6
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 7 19.74 1.45 19.7 19.74 1.93 17.8 21.4 3.6 -0.35 -1.46 0.55
group: 8
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 14 15.10 2.56 15.2 15.10 2.52 10.4 19.2 8.8 0.03 -0.79 0.68
The output tells a clear story: as cylinders increase from 4 to 8, mean fuel efficiency drops from 26.7 to 15.1 mpg. The standard deviation is highest for 4-cylinder cars (4.51), suggesting more variability within that group. This kind of grouped descriptive analysis is the precursor to an independent samples t-test or one-way ANOVA — confirming you understand your data before applying inferential tests. See t-test definitions and applications for the natural next step after describing grouped data.
Struggling With Your R Code or Statistics Assignment?
Our expert tutors debug R code, explain statistical concepts, and write clear results sections — available any time of day, for any university or college level.
Start Your Order LoginDistribution Shape
Skewness and Kurtosis in R: Measuring Distribution Shape
A complete descriptive analysis of descriptive statistics in R always addresses distribution shape — not just where the data centers and how spread it is, but how that spread is structured. Two distributions can share the same mean and standard deviation but have radically different shapes. Skewness captures asymmetry; kurtosis captures tail heaviness. Both are essential for evaluating normality before applying parametric tests like t-tests, ANOVA, or Pearson correlation. Understanding normal distribution, kurtosis, and skewness applications is foundational to this step.
Skewness: e1071 and psych
Skewness quantifies the degree to which a distribution leans left (negative skew) or right (positive skew). A perfectly normal distribution has skewness = 0. As a rule of thumb in the academic literature: skewness between −0.5 and +0.5 is approximately symmetric; between −1 and −0.5 or +0.5 and +1 is moderately skewed; beyond ±1 is highly skewed and warrants attention.
library(e1071) # Skewness of mpg variable skewness(mtcars$mpg) # [1] 0.6106550 (mild positive/right skew) # skewness() from psych describe() output (already included) # For all variables at once: sapply(mtcars, skewness) # Interpretation helper interpret_skew <- function(sk) { if(abs(sk) < 0.5) "Approximately symmetric" else if(abs(sk) < 1) "Moderately skewed" else "Highly skewed — consider transformation" } interpret_skew(skewness(mtcars$mpg)) # [1] "Moderately skewed"
Kurtosis: Measuring Tail Heaviness
Kurtosis measures the heaviness of distribution tails relative to a normal distribution. The e1071 package returns excess kurtosis (also called Fisher’s kurtosis), where the normal distribution has excess kurtosis = 0. A value greater than 0 (leptokurtic) means heavier-than-normal tails — more extreme values than expected. A value less than 0 (platykurtic) means lighter tails. The mtcars mpg variable has excess kurtosis of −0.37, indicating slightly lighter tails than normal.
# Kurtosis (excess kurtosis, normal = 0) kurtosis(mtcars$mpg) # [1] -0.3718876 (slightly platykurtic) # All variables sapply(mtcars, kurtosis) # Note: psych's describe() also returns skew and kurtosis # psych uses the same excess kurtosis (Fisher) convention as e1071
Testing for Normality with shapiro.test()
Once you have skewness and kurtosis, the natural next step is a formal normality test. The Shapiro-Wilk test (shapiro.test()) is the most powerful test for normality for small to moderate samples (n < 5,000) and is built into base R — no packages required.
# Shapiro-Wilk normality test shapiro.test(mtcars$mpg) # Output: # Shapiro-Wilk normality test # data: mtcars$mpg # W = 0.94778, p-value = 0.1229 # p > 0.05: fail to reject normality — mpg is approximately normal # Test all numeric columns sapply(mtcars, function(x) shapiro.test(x)$p.value)
A p-value above 0.05 means you cannot reject the null hypothesis of normality — your variable is approximately normally distributed, which supports the use of parametric tests. A p-value below 0.05 means the distribution departs significantly from normality. In that case, consider log-transformation (log(x)), square-root transformation (sqrt(x)), or switching to non-parametric tests.
Understanding p-values and significance levels is essential for interpreting the Shapiro-Wilk test correctly — do not simply report the test result without explaining what the p-value means in the context of your analysis decision.
Grouped Statistics with dplyr
Grouped Descriptive Statistics in R Using dplyr
One of the most common analytical tasks in college and research settings is comparing descriptive statistics across groups — by treatment vs. control, by gender, by school year, or by geographic region. The dplyr package from the tidyverse (developed by Posit PBC and the R community) provides the most elegant and readable approach to grouped descriptive statistics in R. Its pipe operator (%>%) makes code read almost like English, making it easier to understand, debug, and explain in a methods section.
library(dplyr) # Grouped descriptive statistics: mpg by cylinder count mtcars %>% group_by(cyl) %>% summarise( n = n(), mean_mpg = round(mean(mpg), 2), sd_mpg = round(sd(mpg), 2), median_mpg = median(mpg), iqr_mpg = IQR(mpg), min_mpg = min(mpg), max_mpg = max(mpg) )
# A tibble: 3 × 8
cyl n mean_mpg sd_mpg median_mpg iqr_mpg min_mpg max_mpg
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 11 26.7 4.51 26 6.67 21.4 33.9
2 6 7 19.7 1.45 19.7 1.93 17.8 21.4
3 8 14 15.1 2.56 15.2 3.88 10.4 19.2
This output is already close to a publication-ready table. For a college assignment, you would copy these numbers into your results section and use them to support an argument about how engine size relates to fuel efficiency. Notice how the standard deviation drops sharply as cylinders increase — 6-cylinder cars are remarkably consistent in fuel efficiency (SD = 1.45), while 4-cylinder cars vary more widely (SD = 4.51), possibly because this group includes everything from small economy cars to sports models.
Adding Multiple Variables to a Group Summary
# Summarise multiple variables by group using across() mtcars %>% group_by(cyl) %>% summarise( across(c(mpg, hp, wt), list(mean = mean, sd = sd), .names = "{.col}_{.fn}") )
The across() helper function is one of the most useful modern dplyr additions — it applies a list of functions (here, mean and sd) to a list of columns simultaneously, generating neatly named output columns. This is the most efficient way to produce a multi-variable grouped descriptive table for a research paper or lab report. Social statistics coursework frequently requires exactly this kind of grouped summary, and dplyr’s pipe syntax makes the code both readable and reproducible.
Filtering and Summarising Simultaneously
# Descriptive stats for automatic transmission cars only mtcars %>% filter(am == 0) %>% # am=0 is automatic group_by(cyl) %>% summarise( n = n(), mean_mpg = mean(mpg), sd_mpg = sd(mpg) )
Additional Packages
skimr and Hmisc: Alternative Approaches to Descriptive Statistics in R
Beyond summary() and psych’s describe(), two additional packages are worth knowing for descriptive statistics in R: skimr and Hmisc. Both produce rich summary output with a single command, and both handle missing data, factor variables, and character variables in ways that summary() alone does not.
skimr: Clean, Readable Summaries
The skimr package (developed by Elin Waring and colleagues at the rOpenSci community) produces a beautifully organized summary using skim(). It separates numeric and character variable summaries, includes a small histogram for each numeric variable, and clearly reports the count of missing values. For quick data exploration, skimr is arguably the most readable single-command overview available in R.
library(skimr) # Full skim summary skim(mtcars) # skim() integrated with dplyr group_by mtcars %>% group_by(cyl) %>% skim(mpg, hp)
Hmisc: The describe() Variant with Extended Detail
The Hmisc package (developed by Frank Harrell at Vanderbilt University) provides its own describe() function that returns — for each variable — the number of observations, missing values, unique values, five lowest and five highest values, and frequency counts for categorical variables. It is particularly strong for variables with many categories or for clinical data where the extreme observed values matter as much as the central tendency.
library(Hmisc) # Note: Hmisc's describe() may mask psych's describe() # If using both, call explicitly: Hmisc::describe(mtcars$mpg) # Output includes: n, missing, unique values, mean, quantiles, # plus the 5 lowest and 5 highest observed values
Package Masking Warning: Both psych and Hmisc export a function named
describe(). If you load both packages, whichever was loaded last will mask the other. To use a specific version, always call it explicitly: psych::describe(data) or Hmisc::describe(data). This is a common source of “unexpected output” bugs in R scripts that use multiple packages. Avoiding misuse of statistical tools includes being deliberate about which functions and packages you are calling.
Handling Missing Data
Handling Missing Values in Descriptive Statistics in R
Real-world datasets have missing values. It’s just how it is. When you perform descriptive statistics in R without handling NAs, most functions return NA for the entire result — a frustrating experience if you don’t know why it is happening and how to fix it. This section shows you the standard approaches.
# Create a vector with missing values x <- c(12, 15, NA, 18, 22, NA, 9) # Without na.rm — returns NA mean(x) # [1] NA # With na.rm = TRUE — ignores NAs mean(x, na.rm = TRUE) # [1] 15.2 sd(x, na.rm = TRUE) # [1] 4.764452 median(x, na.rm = TRUE) # [1] 15 # Count missing values per column in a data frame colSums(is.na(mtcars)) # Remove all rows with any NA (complete case analysis) clean_data <- na.omit(mtcars) # Proportion of missing values per column colMeans(is.na(mtcars)) * 100 # expressed as percentage
Multiple Imputation for Serious Missing Data Problems
For datasets where missing data is substantial (more than 5% of values in any variable), simply removing rows with na.omit() can introduce bias and reduce your effective sample size significantly. The mice package (Multivariate Imputation by Chained Equations) implements multiple imputation — the gold standard method for handling missing data in academic research. While a full treatment is beyond this guide’s scope, understanding that the option exists is important for any serious statistical project. The mice package documentation published in the Journal of Statistical Software by van Buuren and Groothuis-Oudshoorn is the authoritative reference.
# Quick preview of mice for multiple imputation install.packages("mice") library(mice) # Generate 5 imputed datasets imputed <- mice(my_data_with_NAs, m = 5, method = "pmm") # Extract one complete dataset complete_data <- complete(imputed, 1)
Visualization with ggplot2
Visualizing Descriptive Statistics in R with ggplot2
Numbers tell you what; charts tell you why it matters. Visualizing your descriptive statistics in R with ggplot2 is not optional in a professional or academic context — it is the standard. The ggplot2 package uses a grammar of graphics framework where you build plots layer by layer: data, aesthetics (what variables map to x and y), geometry (what type of plot), and optional themes and labels. Once you understand this logic, every type of plot follows the same pattern.
Histogram: Visualizing Distribution Shape
library(ggplot2) # Histogram of mpg with custom bins, color, and theme ggplot(mtcars, aes(x = mpg)) + geom_histogram(binwidth = 3, fill = "#2563EB", color = "white", alpha = 0.85) + geom_vline(aes(xintercept = mean(mpg)), color = "#AA4646", linewidth = 1.2, linetype = "dashed") + labs(title = "Distribution of Fuel Efficiency (MPG)", subtitle = "Dashed line = mean (20.09 mpg) | mtcars dataset", x = "Miles Per Gallon", y = "Count") + theme_minimal(base_size = 13)
Adding a vertical line at the mean (geom_vline()) on a histogram is standard practice in academic reports — it allows readers to immediately see where the center falls relative to the distribution shape. If the mean line sits noticeably off-center, you have visual evidence of skewness to discuss. Creating professional charts and graphs for assignments is a skill in itself, and ggplot2 is the tool that makes that skill achievable for any R user.
Box Plot: Five-Number Summary Visualized
# Box plot: mpg by cylinder count (grouped comparison) ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) + geom_boxplot(alpha = 0.75, outlier.color = "#AA4646", outlier.size = 2.5) + scale_fill_manual(values = c("#93c5fd", "#2563EB", "#1a3480")) + labs(title = "Fuel Efficiency by Number of Cylinders", x = "Cylinders", y = "Miles Per Gallon", fill = "Cylinders") + theme_minimal(base_size = 13)
The box plot visualizes the five-number summary (min, Q1, median, Q3, max) as a box with whiskers, and flags outliers as individual points. This is the single most information-dense chart for displaying descriptive statistics visually — a professor reviewing a statistics assignment can immediately see the median, the IQR, the range, and any outliers for each group simultaneously.
Density Plot: Visualizing Distribution Shape Smoothly
# Overlapping density plots by group ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) + geom_density(alpha = 0.45) + scale_fill_manual(values = c("#93c5fd", "#2563EB", "#1a3480")) + labs(title = "Density of MPG by Cylinder Count", x = "Miles Per Gallon", y = "Density", fill = "Cylinders") + theme_minimal(base_size = 13)
A Complete Descriptive Statistics Visualization Panel
# Combine multiple plots using patchwork package install.packages("patchwork") library(patchwork) p1 <- ggplot(mtcars, aes(x = mpg)) + geom_histogram(fill = "#2563EB", bins = 12) + theme_minimal() + labs(title = "Histogram") p2 <- ggplot(mtcars, aes(x = "", y = mpg)) + geom_boxplot(fill = "#93c5fd") + theme_minimal() + labs(title = "Box Plot") p3 <- ggplot(mtcars, aes(x = mpg)) + geom_density(fill = "#AA4646", alpha = 0.6) + theme_minimal() + labs(title = "Density") (p1 | p2 | p3) + plot_annotation(title = "Descriptive Statistics Panel: MPG")
Reference Tables
Quick Reference: Essential Functions for Descriptive Statistics in R
The following two tables summarize all the key functions and packages covered in this guide for performing descriptive statistics in R. Use these as a cheat sheet when writing your analysis scripts or preparing your assignment submissions.
| Statistic | R Function | Package | Key Argument | Notes |
|---|---|---|---|---|
| Mean | mean(x) |
base R | na.rm = TRUE |
Sensitive to outliers; use trimmed mean for robustness |
| Median | median(x) |
base R | na.rm = TRUE |
Preferred for skewed distributions |
| Mode | custom / Mode(x) |
DescTools | — | base R mode() returns storage type, not stat mode |
| Variance | var(x) |
base R | na.rm = TRUE |
Sample variance (n−1); squared units |
| Standard Deviation | sd(x) |
base R | na.rm = TRUE |
Same units as variable; pair with mean in reports |
| Range | diff(range(x)) |
base R | na.rm = TRUE |
Affected by outliers; use IQR for robustness |
| IQR | IQR(x) |
base R | na.rm = TRUE |
Robust to outliers; pair with median in reports |
| Quantiles | quantile(x) |
base R | probs = c(...) |
Default returns 0%, 25%, 50%, 75%, 100% |
| Skewness | skewness(x) |
e1071 / psych | — | 0 = symmetric; >0 right skew; <0 left skew |
| Kurtosis | kurtosis(x) |
e1071 / psych | — | Excess kurtosis; 0 = normal; >0 heavy tails |
| Full summary | summary(data) |
base R | — | Min, Q1, median, mean, Q3, max for all variables |
| Complete descriptive table | describe(data) |
psych | — | 13 statistics per variable including skew, kurtosis, SE |
| Grouped descriptive | describeBy(data, group) |
psych | group = |
Full psych table split by grouping variable |
| Grouped summary | group_by() %>% summarise() |
dplyr | across() |
Most flexible approach for custom multi-variable tables |
| Normality test | shapiro.test(x) |
base R | — | p > 0.05 = approximately normal; best for n < 5,000 |
| Package | Developed By | Key Functions | Best For | Install Command |
|---|---|---|---|---|
| base R | R Foundation / CRAN | mean, sd, var, summary, quantile, shapiro.test |
Core descriptive stats; no install required | Pre-installed |
| psych | William Revelle, Northwestern University | describe(), describeBy(), pairs.panels() |
Complete academic descriptive tables; grouped stats | install.packages("psych") |
| dplyr | Hadley Wickham, Posit PBC | group_by(), summarise(), filter(), select(), across() |
Flexible grouped summaries; data manipulation | install.packages("dplyr") |
| ggplot2 | Hadley Wickham, Posit PBC | geom_histogram(), geom_boxplot(), geom_density() |
Publication-quality visualization of distributions | install.packages("ggplot2") |
| e1071 | TU Wien / CRAN | skewness(), kurtosis() |
Distribution shape measures for normality assessment | install.packages("e1071") |
| skimr | rOpenSci community | skim() |
Quick, clean exploratory data overview with mini-histograms | install.packages("skimr") |
| Hmisc | Frank Harrell, Vanderbilt University | describe() |
Clinical data; extreme value reporting; many categories | install.packages("Hmisc") |
| DescTools | Andri Signorell, CRAN | Mode(), Desc(), MeanCI() |
Statistical mode; confidence intervals for descriptives | install.packages("DescTools") |
Reporting Your Results
How to Report Descriptive Statistics in R for Academic Assignments
Computing the numbers is only half the task. The other half is presenting them clearly in a way your professor can read, understand, and evaluate. For descriptive statistics in R assignments, reporting conventions follow the norms of your discipline — APA style for psychology and social science, AMA for health sciences, and field-specific conventions for econometrics and education research. Here are the key rules that apply across most academic contexts.
Reporting Central Tendency and Variability
Always report central tendency and variability together. The conventional format in APA style is: M = [value], SD = [value] for approximately normal variables, and Mdn = [value], IQR = [value] for skewed variables. Never report just the mean without its standard deviation, or just the median without the IQR — a center without a spread measure is incomplete and will be marked down in any rigorous assignment. Reporting statistical results transparently is a professional and ethical obligation in academic work, not just a formatting convention.
Example write-up: “Fuel efficiency ranged from 10.4 to 33.9 mpg (M = 20.09, SD = 6.03, Mdn = 19.2, IQR = 7.38). The distribution was moderately positively skewed (skewness = 0.61), with a Shapiro-Wilk test indicating approximate normality, W(32) = 0.95, p = .12.”
Presenting a Descriptive Statistics Table
For multi-variable analyses, present a table rather than reporting each variable in prose. A typical descriptive statistics table for a college assignment should include, at minimum: variable name, n, mean, SD, and range. For skewed variables, replace mean + SD with median + IQR. For papers requiring APA format, use professional tables and figures formatting with no vertical lines, minimal horizontal lines, and notes below explaining abbreviations.
Mentioning the R Package and Version
In a formal academic paper, you must cite the software and packages you used. The standard citation format for R is: “All analyses were conducted in R (version 4.4.x; R Core Team, 2025). Descriptive statistics were computed using the psych package (Revelle, 2024) and visualizations were produced using ggplot2 (Wickham, 2016).” Use citation("psych") in R to get the exact citation format for any package. Skipping this step is a minor but genuine academic integrity issue in methods sections of research papers. Writing a thorough literature review includes citing all methodological tools and software correctly.
Key Tip: Choose the Right Statistic for Your Distribution Shape
Before reporting any descriptive statistics, run skewness() and shapiro.test(). If skewness is between −0.5 and +0.5 and p > 0.05 on Shapiro-Wilk, report mean + SD. If the variable is skewed or fails the normality test, report median + IQR. This decision tree keeps your descriptive statistics internally consistent and shows your professor that you understand what the numbers mean — not just how to compute them. Choosing the right statistical test for your data follows the same logic: distribution shape determines method, not the other way around.
Need Your R Statistics Assignment Done Right?
From descriptive tables to full data analysis reports — our expert R tutors deliver accurate, well-explained work for college and university students at any level.
Order Now Log InFrequently Asked Questions
Frequently Asked Questions: Descriptive Statistics in R
What is descriptive statistics in R?
Descriptive statistics in R refers to using R programming functions and packages to summarize, organize, and describe the main features of a dataset. It includes computing measures of central tendency (mean, median, mode), measures of variability (variance, standard deviation, IQR), and measures of distribution shape (skewness, kurtosis). R provides built-in functions like mean(), median(), sd(), and summary() alongside specialized packages like psych and DescTools for comprehensive descriptive analysis. It is always the first step in any data analysis workflow, before hypothesis testing or modeling.
How do I calculate mean, median, and mode in R?
In R: mean is computed with mean(x, na.rm = TRUE), median with median(x, na.rm = TRUE). There is no built-in statistical mode function — use a custom function: get_mode <- function(x) { ux <- unique(x); ux[which.max(tabulate(match(x, ux)))] }, or install the DescTools package and use Mode(x). Never use R’s built-in mode() function expecting the statistical mode — it returns the storage type (“numeric”), not the most frequent value. For all three simultaneously across a data frame, psych’s describe() function is the most efficient approach.
What does the summary() function return in R?
The summary() function in R returns a six-number summary for each numeric variable: minimum, first quartile (Q1), median, mean, third quartile (Q3), and maximum. For factor variables, it returns frequency counts per level. For character variables, it returns length, class, and mode. It is the fastest way to get a broad overview of your dataset in one command. Its limitation is that it does not return standard deviation, skewness, kurtosis, or sample size — for those, use psych’s describe() or compute them individually with sd(), skewness(), etc.
What is the difference between var() and sd() in R?
Both var() and sd() measure spread, but at different scales. var() returns the sample variance — the average of squared deviations from the mean — expressed in squared units of the original variable (e.g., squared mpg). sd() is the square root of variance, returning a value in the same units as your variable. Standard deviation is almost always reported in academic papers because it is directly interpretable: an SD of 6 mpg means values typically deviate from the mean by 6 mpg. Variance is used in formulas (ANOVA, regression), but rarely reported alone.
How do I compute grouped descriptive statistics in R?
Two main approaches: (1) dplyr: data %>% group_by(group_var) %>% summarise(mean_x = mean(variable), sd_x = sd(variable), n = n()) — this is the most flexible and readable approach. (2) psych: describeBy(data, group = data$group_var) — this generates the full 13-statistic psych table split by group in one line. For quick multi-variable grouped summaries, dplyr’s across() helper lets you apply a list of functions to multiple columns simultaneously. Both methods handle missing values when na.rm = TRUE is included.
How do I handle NA (missing values) in R descriptive statistics?
The key argument is na.rm = TRUE in most base R functions: mean(x, na.rm = TRUE), sd(x, na.rm = TRUE), median(x, na.rm = TRUE). Without it, functions return NA if any missing value exists in the input. Check for missing values first with colSums(is.na(data)). For complete case analysis, use na.omit(data) to remove all rows containing any NA. For substantial missing data (>5%), use multiple imputation via the mice package rather than simple deletion, which can introduce bias and reduce statistical power.
What packages are best for descriptive statistics in R?
The core four are: psych (for describe() and describeBy() — the most comprehensive descriptive table available), dplyr (for grouped summaries with group_by() + summarise()), ggplot2 (for visualizations), and e1071 (for skewness() and kurtosis()). Additionally: skimr for a clean, readable exploratory summary; Hmisc for clinical/extreme value detail; and DescTools for statistical mode and confidence intervals on descriptive statistics. Base R’s summary() and shapiro.test() require no installation. For most college assignments, psych + dplyr + ggplot2 covers everything needed.
What is skewness and kurtosis in R and why do they matter?
Skewness measures distribution asymmetry: 0 = symmetric; positive = right tail (mean > median); negative = left tail (mean < median). Kurtosis measures tail heaviness: excess kurtosis of 0 matches a normal distribution; above 0 (leptokurtic) means heavier tails with more extreme values; below 0 (platykurtic) means lighter tails. They matter because most parametric tests (t-test, ANOVA, Pearson correlation) assume approximate normality. High skewness or kurtosis flags potential violations that require transformation or non-parametric alternatives. Use skewness() and kurtosis() from e1071 or psych, then confirm with shapiro.test().
How do I visualize descriptive statistics in R with ggplot2?
ggplot2 is the standard for descriptive visualization in R. Use geom_histogram() for distribution shape, geom_boxplot() for five-number summary with outliers, geom_density() for smooth distribution curves, and geom_bar() for categorical frequency distributions. Add geom_vline(aes(xintercept = mean(x))) to overlay mean lines on histograms. For grouped comparisons, use fill = factor(group_variable) inside aes() to create colored groups. The patchwork package arranges multiple ggplot2 plots in a grid for a complete descriptive panel — ideal for assignment figures.
What is the difference between descriptive and inferential statistics in R?
Descriptive statistics summarizes and describes the observed dataset — it tells you the shape, center, and spread of data you have collected. It makes no claims beyond the sample. Inferential statistics uses the sample to make inferences about a larger population — it uses t.test(), chisq.test(), aov(), lm(), and other functions to test hypotheses and estimate population parameters. Descriptive statistics always comes first: you must fully understand your data’s distribution before applying any inferential procedure. Running a t-test on data you have not first described is a methodological error in academic work.
How do I perform descriptive statistics on a data frame with multiple variables in R?
The three most efficient approaches: (1) summary(data) applies the six-number summary to every column simultaneously; (2) psych::describe(data) generates 13 statistics per column in a structured table; (3) sapply(data, function_name) applies any function (mean, sd, skewness) to all columns at once. For subset of columns: psych::describe(data[, c(“var1”, “var2”)]) or data %>% select(var1, var2) %>% describe(). For named output: sapply(data, function(x) round(c(mean=mean(x, na.rm=TRUE), sd=sd(x, na.rm=TRUE)), 2)) produces a clean named matrix.
Should I cite R and its packages in my academic assignment?
Yes — in any formal academic paper or lab report, you must cite your statistical software and packages. For R itself: R Core Team (current year). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. For any package, run citation(“package_name”) in R to get the exact formatted reference. Example: citation(“psych”) returns the Revelle (2024) reference for the psych package. Failing to cite your analytical tools is treated the same way as failing to cite any other methodological source — as an incomplete methods section.
