BackDescriptive Statistics and Data Visualization Using R
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Introduction to R Markdown and Data Analysis
R Markdown is a tool for creating dynamic documents with embedded R code, allowing for reproducible statistical analysis and reporting. This guide introduces key concepts in descriptive statistics and data visualization using R, focusing on summarizing and exploring data sets.
Descriptive Statistics
Measures of Central Tendency
Measures of central tendency describe the center or typical value of a data set.
Mean (Arithmetic Average): The sum of all values divided by the number of observations. Formula: Example: For the data set {1, 2, 4, 4, 5}, Note: The mean is sensitive to extreme values (outliers) and is not robust.
Median: The middle value when data are ordered. If the number of observations is even, the median is the average of the two middle values. Example: For {1, 2, 4, 4, 5}, the median is 4. For {1, 2, 2, 4}, the median is . Note: The median is robust to outliers.
Mode: The most frequently occurring value in a data set. Example: For {1, 2, 2, 4, 5}, the mode is 2.
Measures of Variability (Dispersion)
These measures describe the spread or variability of the data.
Range: The difference between the maximum and minimum values. Formula: Note: The range is not robust to outliers.
Variance: The average of the squared differences from the mean. Formula: Note: Variance is not robust to outliers.
Standard Deviation (SD): The square root of the variance. Formula: Note: SD is not robust to outliers.
Interquartile Range (IQR): The difference between the third quartile (Q3) and the first quartile (Q1). Formula: Note: IQR is robust to outliers.
Five-Number Summary
The five-number summary provides a quick overview of the distribution of a data set:
Minimum
First Quartile (Q1)
Median (Q2)
Third Quartile (Q3)
Maximum
Data Visualization
Barplots
Barplots are used to display the frequency of categorical variables.
Example: A barplot showing the number of individuals with and without asthma.
No | Yes | |
|---|---|---|
Female | 2496 | 351 |
Male | 2515 | 184 |
Calculation: of females and of males have asthma.
Histograms
Histograms are used to display the distribution of a numeric variable by grouping data into bins.
Example: A histogram of the variable "Time until Death" shows the frequency of different time intervals.
Boxplots
Boxplots (box-and-whisker plots) graphically display the five-number summary and identify outliers.
Outliers: Any value smaller than or greater than is considered an outlier and is indicated on the boxplot as a circle or asterisk.
Example: A boxplot of "Time until Death" visualizes the spread and potential outliers in the data.
Working with R for Data Analysis
Importing and Summarizing Data
Use read_excel() or read.csv() to import data sets.
Use summary() to obtain the five-number summary and other descriptive statistics.
Use table() to create frequency tables for categorical variables.
Use barplot() and hist() for graphical displays.
R Packages for Data Analysis
tidyverse: A collection of R packages for data manipulation and visualization, including dplyr, ggplot2, readr, tibble, and tidyr.
ggplot2: Used for advanced data visualization, such as grouped barplots and customized graphics.
Examples and Applications
Asthma Prevalence: Calculating the percentage of females and males with asthma using frequency tables.
BMI Categories: Visualizing the distribution of BMI categories (Normal weight, Obese, Overweight, Underweight) using a colored barplot.
Time until Death: Using boxplots and histograms to explore the distribution and identify outliers.
Summary Table: Comparison of Measures of Central Tendency and Variability
Measure | Definition | Robust to Outliers? | Best Use |
|---|---|---|---|
Mean | Arithmetic average | No | Symmetric distributions without outliers |
Median | Middle value | Yes | Skewed distributions or with outliers |
Mode | Most frequent value | Yes | Categorical data |
Range | Max - Min | No | Quick estimate of spread |
Standard Deviation | Square root of variance | No | Symmetric distributions |
IQR | Q3 - Q1 | Yes | Skewed distributions or with outliers |
Additional info:
When choosing a measure of central tendency or variability, consider the shape of the distribution and the presence of outliers.
Data visualization is essential for understanding the distribution and identifying patterns or anomalies in the data.
R provides powerful tools for both statistical analysis and graphical representation of data.