Comprehensive Study Notes for Statistics & Probability: Principles, Probability, Random Variables, Distributions, and Bivariate Analysis

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Statistics, Data & Statistical Thinking

Introduction to Statistics

Statistics is the science of collecting, analyzing, interpreting, and presenting data. It provides tools and methods to understand the real world through data.

Statistics (discipline): Reasoning, tools, and methods for analyzing data.
Statistics (plural): Results of calculations made with data.
Data: Any collection of numbers, characters, images, or other items that provide information about something.

Statistical Methods

Descriptive Statistics: Collecting, presenting, and characterizing data.
Inferential Statistics: Making decisions or predictions about a population based on sample data.

Types of Data

Quantitative Data: Measured by numbers, often with units (e.g., weight, salary).
Qualitative (Categorical) Data: Describes groups or categories (e.g., hair color, campus).
Ordinal Variables: Categorical variables with a meaningful order but no natural units (e.g., survey ranks).
Identifier Variables: Unique identifiers for cases (e.g., student ID).

Sampling Techniques

Simple Random Sampling: Every sample of size n has an equal chance of selection.
Systematic Sampling: Selecting every kth unit.
Cluster Sampling: Sampling all units within randomly selected clusters.
Stratified Sampling: Sampling within identified subgroups (strata).
Convenience Sampling: Choosing samples that are easy to select.

Describing Data with Tables and Graphs

Describing Qualitative Data

Frequency Table: Lists categories and counts.
Pie Chart: Shows breakdown of total quantity into categories.
Bar Graph: Vertical bars for qualitative variables; height shows frequency or percentage.
Pareto Diagram: Bar graph with categories ordered by frequency.

Describing Quantitative Data

Stem-and-Leaf Display: Splits data into stems and leaves for visualization.
Histogram: Partitions data into bins; bars show frequency or relative frequency.

Numerical Measures of Central Tendency

Mean (x̄): Arithmetic average; sensitive to outliers.
Median: Middle value; resistant to outliers.
Mode: Most frequent value; can be used for quantitative or qualitative data.

Numerical Measures of Variability

Range: Difference between largest and smallest values.
Variance (s²): Average squared deviation from the mean.
Standard Deviation (s): Square root of variance; measures spread.
Coefficient of Variation: Ratio of standard deviation to mean.

Shape of a Distribution

Symmetric: Mean = Median.
Right-Skewed: Mean > Median.
Left-Skewed: Mean < Median.

Five-Number Summary & Boxplot

Minimum, Q1, Median, Q3, Maximum.
Boxplots visualize these values and detect outliers.

Measures of Relative Standing

z-Score: Number of standard deviations a value is from the mean.
Percentiles: Value below which a given percentage of data falls.
Quartiles: Q1 (25th), Q2 (Median, 50th), Q3 (75th).
Interquartile Range (IQR): Q3 - Q1; spread of middle 50%.

Probability

Basic Principles

Probability quantifies uncertainty and is foundational to statistical inference.

Random Phenomenon: Situation with uncertain outcomes.
Trial: Single observation of a random phenomenon.
Outcome: Result of a trial.
Sample Space: Set of all possible outcomes.
Event: Combination of outcomes.

Probability Rules

Rule 1: 0 ≤ P(A) ≤ 1
Rule 2: P(Sample Space) = 1
Rule 3 (Complement): P(AC) = 1 - P(A)
Rule 4 (Addition): For disjoint events, P(A or B) = P(A) + P(B)
Rule 5 (Multiplication): For independent events, P(A and B) = P(A) × P(B)

General Addition Rule

For any two events A and B:

Conditional Probability

Probability of B given A has occurred.

Independence

Events A and B are independent if and .

Tree Diagrams

Tree diagrams visually represent sequences of events and their probabilities. Probability tree diagram for binge drinking and accidents

Bayes' Rule

Used to reverse conditioning.

Describing Data Numerically

Contingency Tables

Contingency tables organize counts for combinations of two categorical variables.

Rows: Variable 1
Columns: Variable 2
Marginal Distribution: Totals for each variable
Conditional Distribution: Distribution of one variable given a category of the other

Contingency table: Goals by Sex

Discrete Random Variables & Probability Distributions

Random Variables

Random Variable: Numerical value from a random experiment.
Discrete Random Variable: Takes countable values.
Continuous Random Variable: Takes any value within a range.

Probability Model

Collection of all possible values and their probabilities.
Requirements: for all x,

Expected Value and Standard Deviation

Variance:
Standard Deviation:

Bernoulli Trials

Two outcomes: success (p), failure (q = 1-p)
Trials are independent

Combinations Formula

Pascal's Triangle for combinations

Binomial Distribution

Number of successes in n Bernoulli trials
Expected value:
Standard deviation:

Binomial probability formula

Poisson Distribution

Number of rare events in an interval
Mean:
Standard deviation:

Bivariate Analysis – Correlation & Regression

Scatterplots and Association

Scatterplots visualize the relationship between two quantitative variables.

Explanatory Variable (X): Predictor
Response Variable (Y): Predicted

Scatterplot: Cost of Women's Clothes vs Food Costs

Describing Scatterplots

Direction: Positive, negative, or none
Shape: Linear, curved, clusters
Strength: Strong, moderate, weak
Unusual Features: Outliers, clusters

Scatterplot: Negative association Scatterplot: Positive association

Correlation

Correlation coefficient (r): Measures linear association, ranges from -1 to 1
Conditions: Quantitative variables, linear relationship, no outliers
Correlation does not imply causation

Linear Regression

Regression line:
(slope)
(intercept)
Interpolation: Predicting within domain
Extrapolation: Predicting outside domain (use caution)

Regression line: Manatees & Boats

Least Squares Method & Residuals

Residual: Difference between observed and predicted value ()
Line of best fit minimizes sum of squared residuals
(coefficient of determination): Proportion of variation explained by model

r squared value: Manatees & Boats

Residual Analysis

Residuals should be randomly scattered, no pattern
Standard error of regression (se): Measures typical deviation from regression line

Histogram of residuals

Bivariate Analysis – Contingency Tables

Contingency Tables

Organize counts for two categorical variables
Marginal and conditional distributions
Row and column percentages
Visualized with segmented bar charts and pie charts

Independence in Contingency Tables

Variables are independent if conditional distributions are similar across categories

Chi-Square Tests & Cramer's V

Chi-squared statistic: Measures association in contingency tables
Cramer's V: Scaled measure of association, ranges from 0 (independent) to 1 (perfect association)

Simpson's Paradox

Trend observed in groups disappears when groups are combined due to a lurking variable

Summary Table: Types of Data and Analysis

Type of Data	Graphical Methods	Numerical Methods
Qualitative (Categorical)	Pie Chart, Bar Graph, Pareto Diagram	Frequency, Mode
Quantitative	Histogram, Stem-and-Leaf, Boxplot	Mean, Median, Mode, Range, Variance, SD, IQR, z-score

Key Formulas

Mean:
Variance:
Standard Deviation:
z-score:
Binomial Probability:
Poisson Probability:
Conditional Probability:
Regression Line:
Correlation: