BackFoundations of Statistics: Data Collection, Organization, and Summarization
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Introduction to the Practice of Statistics
Definitions and Key Concepts
Statistics is the science of collecting, organizing, analyzing, and interpreting data to make informed decisions. Understanding the foundational terminology is essential for further study in statistics.
Data: Information from observations, counts, measurements, or responses.
Population: The entire group being studied.
Sample: A subset of the population selected for analysis.
Parameter: A numerical description of a population characteristic.
Statistic: A numerical description of a sample characteristic.
Qualitative Variable: A variable that categorizes or describes an element of a population.
Quantitative Variable: A variable that quantifies an element of a population.
Discrete Variable: A quantitative variable with a countable number of values.
Continuous Variable: A quantitative variable with an infinite number of possible values within a range.
Example: Surveying 100 students about their favorite color (qualitative) or their height (quantitative).
Distinguishing Data Sets
Population Data Set: Includes all outcomes, responses, measurements, or counts of interest.
Sample Data Set: Includes only part of the population.
Example: Measuring the average height of all students in a school (population) vs. a selected class (sample).
Parameters vs. Statistics
Parameter: Describes a characteristic of a population.
Statistic: Describes a characteristic of a sample.
Example: The mean age of all U.S. senators (parameter) vs. the mean age of a sample of 10 senators (statistic).
Qualitative vs. Quantitative Data
Qualitative: Descriptive, non-numeric (e.g., colors, labels).
Quantitative: Numeric, measurable (e.g., height, weight).
Discrete vs. Continuous Variables
Discrete: Countable values (e.g., number of students).
Continuous: Infinite possible values within a range (e.g., height, time).
Observational Studies vs. Designed Experiments
Definitions
Observational Study: Observes outcomes without influencing them.
Experiment: Applies a treatment and observes its effect.
Confounding: Occurs when the effects of multiple variables are mixed.
Lurking Variable: An unmeasured variable that affects the outcome.
Example: Studying the effect of a new drug (experiment) vs. observing health outcomes in a population (observational study).
Sampling Methods
Simple Random Sample
Each member of the population has an equal chance of being selected.
Use random number tables or generators to select samples.
Other Sampling Methods
Stratified Sampling: Divides the population into strata and samples from each.
Systematic Sampling: Selects every k-th individual from a list.
Cluster Sampling: Divides the population into clusters, then randomly selects clusters.
Convenience Sampling: Uses readily available subjects (not recommended for valid inference).
How to Do a Systematic Sample
Know the sample size and population size.
Pick a random starting point.
Select every k-th individual.
Bias in Sampling
Types of Bias
Sampling Bias: Sample does not represent the population.
Nonresponse Bias: Selected individuals do not respond.
Response Bias: Survey answers do not reflect true feelings.
Data-Entry Error: Mistakes in recording data.
Funding Bias: Results influenced by the source of funding.
Design of Experiments
Characteristics of an Experiment
Controlled study with treatments applied to experimental units.
May include control groups, blinding, and randomization.
Steps in Conducting an Experiment
Identify the problem.
Determine factors affecting the response variable.
Determine the number of experimental units.
Assign treatments to units (randomized or block design).
Conduct the experiment and collect data.
Randomized and Block Designs
Randomized Design: Units are randomly assigned to treatments.
Block Design: Units are grouped by similarity, then randomized within blocks.
Organizing Qualitative Data
Frequency Distributions
Lists each category and the number of occurrences (frequency).
Relative frequency is the proportion of observations in each category.
Formula:
Color | Tally | Frequency | Relative Frequency |
|---|---|---|---|
Brown | |||| |||| | 12 | 0.2667 |
Yellow | |||| | | 10 | 0.2222 |
Red | |||| | 9 | 0.2 |
Orange | || | 6 | 0.1333 |
Blue | | | 3 | 0.0667 |
Graphs for Qualitative Data
Bar Graph: Rectangles represent frequencies for each category.
Pareto Chart: Bars in decreasing order of frequency.
Pie Chart: Circle divided into sectors proportional to category frequencies.
Organizing Quantitative Data: Popular Displays
Frequency Distributions and Histograms
Frequency Distribution: Table showing classes or intervals and their frequencies.
Histogram: Rectangles for each class, height represents frequency or relative frequency.
Dot Plot: Dots above a number line for each data point.
Shapes of Distributions
Uniform: Evenly spread frequencies.
Bell-shaped (Symmetric): Highest frequency in the middle.
Skewed Right: Tail on the right.
Skewed Left: Tail on the left.
Additional Displays of Quantitative Data
Stem-and-Leaf Plot
Shows data values while preserving the original data.
Frequency Polygon
Uses points connected by lines to represent class frequencies.
Ogive Graph
Represents cumulative frequency or cumulative relative frequency.
Time Series Graph
Plots data measured at successive points in time.
Graphical Misrepresentations of Data
Guidelines for Good Graphics
Label axes and provide units.
Avoid distortion and misleading scales.
Use appropriate graph types for the data.
Avoid three-dimensional effects that distort perception.
Measures of Central Tendency
Mean
Population Mean:
Sample Mean:
Median
The middle value when data are ordered.
Mode
The value that occurs most frequently.
Measures of Dispersion
Definitions
Range: Max value minus min value.
Variance: Average of squared deviations from the mean.
Standard Deviation: Square root of the variance.
Population Variance:
Sample Variance:
Standard Deviation: or
Empirical Rule
For bell-shaped distributions:
~68% of data within 1 standard deviation of the mean
~95% within 2 standard deviations
~99.7% within 3 standard deviations
Measures of Position
Z-scores
Represents the number of standard deviations a value is from the mean.
Formula:
Percentiles and Quartiles
Percentile: Value below which a given percentage of observations fall.
Quartiles: Divide data into four equal parts (Q1, Q2, Q3).
Interquartile Range (IQR)
Range of the middle 50% of the data:
Outliers
Values outside are considered outliers.
The Five-Number Summary and Boxplots
Consists of minimum, Q1, median, Q3, and maximum.
Boxplots visually display the five-number summary and identify outliers.
Example: Drawing a boxplot for exam scores using the five-number summary: 60, 68, 77, 89, 98.
Additional info: These notes provide a comprehensive overview of the foundational concepts in statistics, including data collection, sampling, bias, experimental design, and the organization and summarization of data. They are suitable for exam preparation and as a reference for introductory statistics courses.