Skip to main content
Back

Comprehensive Study Notes for Statistics & Probability: Principles, Probability, Random Variables, Distributions, and Bivariate Analysis

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Statistics, Data & Statistical Thinking

Introduction to Statistics

Statistics is the science of collecting, analyzing, interpreting, and presenting data. It provides tools and methods to understand the real world through data.

  • Statistics (discipline): Reasoning, tools, and methods for analyzing data.

  • Statistics (plural): Results of calculations made with data.

  • Data: Any collection of numbers, characters, images, or other items that provide information about something.

Statistical Methods

  • Descriptive Statistics: Collecting, presenting, and characterizing data.

  • Inferential Statistics: Making decisions or predictions about a population based on sample data.

Types of Data

  • Quantitative Data: Measured by numbers, often with units (e.g., weight, salary).

  • Qualitative (Categorical) Data: Describes groups or categories (e.g., hair color, campus).

  • Ordinal Variables: Categorical variables with a meaningful order but no natural units (e.g., survey ranks).

  • Identifier Variables: Unique identifiers for cases (e.g., student ID).

Sampling Techniques

  • Simple Random Sampling: Every sample of size n has an equal chance of selection.

  • Systematic Sampling: Selecting every kth unit.

  • Cluster Sampling: Sampling all units within randomly selected clusters.

  • Stratified Sampling: Sampling within identified subgroups (strata).

  • Convenience Sampling: Choosing samples that are easy to select.

Describing Data with Tables and Graphs

Describing Qualitative Data

  • Frequency Table: Lists categories and counts.

  • Pie Chart: Shows breakdown of total quantity into categories.

  • Bar Graph: Vertical bars for qualitative variables; height shows frequency or percentage.

  • Pareto Diagram: Bar graph with categories ordered by frequency.

Describing Quantitative Data

  • Stem-and-Leaf Display: Splits data into stems and leaves for visualization.

  • Histogram: Partitions data into bins; bars show frequency or relative frequency.

Numerical Measures of Central Tendency

  • Mean (x̄): Arithmetic average; sensitive to outliers.

  • Median: Middle value; resistant to outliers.

  • Mode: Most frequent value; can be used for quantitative or qualitative data.

Numerical Measures of Variability

  • Range: Difference between largest and smallest values.

  • Variance (s²): Average squared deviation from the mean.

  • Standard Deviation (s): Square root of variance; measures spread.

  • Coefficient of Variation: Ratio of standard deviation to mean.

Shape of a Distribution

  • Symmetric: Mean = Median.

  • Right-Skewed: Mean > Median.

  • Left-Skewed: Mean < Median.

Five-Number Summary & Boxplot

  • Minimum, Q1, Median, Q3, Maximum.

  • Boxplots visualize these values and detect outliers.

Measures of Relative Standing

  • z-Score: Number of standard deviations a value is from the mean.

  • Percentiles: Value below which a given percentage of data falls.

  • Quartiles: Q1 (25th), Q2 (Median, 50th), Q3 (75th).

  • Interquartile Range (IQR): Q3 - Q1; spread of middle 50%.

Probability

Basic Principles

Probability quantifies uncertainty and is foundational to statistical inference.

  • Random Phenomenon: Situation with uncertain outcomes.

  • Trial: Single observation of a random phenomenon.

  • Outcome: Result of a trial.

  • Sample Space: Set of all possible outcomes.

  • Event: Combination of outcomes.

Probability Rules

  • Rule 1: 0 ≤ P(A) ≤ 1

  • Rule 2: P(Sample Space) = 1

  • Rule 3 (Complement): P(AC) = 1 - P(A)

  • Rule 4 (Addition): For disjoint events, P(A or B) = P(A) + P(B)

  • Rule 5 (Multiplication): For independent events, P(A and B) = P(A) × P(B)

General Addition Rule

For any two events A and B:

Conditional Probability

  • Probability of B given A has occurred.

Independence

  • Events A and B are independent if and .

Tree Diagrams

Tree diagrams visually represent sequences of events and their probabilities. Probability tree diagram for binge drinking and accidents

Bayes' Rule

  • Used to reverse conditioning.

Describing Data Numerically

Contingency Tables

Contingency tables organize counts for combinations of two categorical variables.

  • Rows: Variable 1

  • Columns: Variable 2

  • Marginal Distribution: Totals for each variable

  • Conditional Distribution: Distribution of one variable given a category of the other

Contingency table: Goals by Sex

Discrete Random Variables & Probability Distributions

Random Variables

  • Random Variable: Numerical value from a random experiment.

  • Discrete Random Variable: Takes countable values.

  • Continuous Random Variable: Takes any value within a range.

Probability Model

  • Collection of all possible values and their probabilities.

  • Requirements: for all x,

Expected Value and Standard Deviation

  • Variance:

  • Standard Deviation:

Bernoulli Trials

  • Two outcomes: success (p), failure (q = 1-p)

  • Trials are independent

Combinations Formula

Pascal's Triangle for combinations

Binomial Distribution

  • Number of successes in n Bernoulli trials

  • Expected value:

  • Standard deviation:

Binomial probability formula

Poisson Distribution

  • Number of rare events in an interval

  • Mean:

  • Standard deviation:

Bivariate Analysis – Correlation & Regression

Scatterplots and Association

Scatterplots visualize the relationship between two quantitative variables.

  • Explanatory Variable (X): Predictor

  • Response Variable (Y): Predicted

Scatterplot: Cost of Women's Clothes vs Food Costs

Describing Scatterplots

  • Direction: Positive, negative, or none

  • Shape: Linear, curved, clusters

  • Strength: Strong, moderate, weak

  • Unusual Features: Outliers, clusters

Scatterplot: Negative association Scatterplot: Positive association

Correlation

  • Correlation coefficient (r): Measures linear association, ranges from -1 to 1

  • Conditions: Quantitative variables, linear relationship, no outliers

  • Correlation does not imply causation

Linear Regression

  • Regression line:

  • (slope)

  • (intercept)

  • Interpolation: Predicting within domain

  • Extrapolation: Predicting outside domain (use caution)

Regression line: Manatees & Boats

Least Squares Method & Residuals

  • Residual: Difference between observed and predicted value ()

  • Line of best fit minimizes sum of squared residuals

  • (coefficient of determination): Proportion of variation explained by model

r squared value: Manatees & Boats

Residual Analysis

  • Residuals should be randomly scattered, no pattern

  • Standard error of regression (se): Measures typical deviation from regression line

Histogram of residuals

Bivariate Analysis – Contingency Tables

Contingency Tables

  • Organize counts for two categorical variables

  • Marginal and conditional distributions

  • Row and column percentages

  • Visualized with segmented bar charts and pie charts

Independence in Contingency Tables

  • Variables are independent if conditional distributions are similar across categories

Chi-Square Tests & Cramer's V

  • Chi-squared statistic: Measures association in contingency tables

  • Cramer's V: Scaled measure of association, ranges from 0 (independent) to 1 (perfect association)

Simpson's Paradox

  • Trend observed in groups disappears when groups are combined due to a lurking variable

Summary Table: Types of Data and Analysis

Type of Data

Graphical Methods

Numerical Methods

Qualitative (Categorical)

Pie Chart, Bar Graph, Pareto Diagram

Frequency, Mode

Quantitative

Histogram, Stem-and-Leaf, Boxplot

Mean, Median, Mode, Range, Variance, SD, IQR, z-score

Key Formulas

  • Mean:

  • Variance:

  • Standard Deviation:

  • z-score:

  • Binomial Probability:

  • Poisson Probability:

  • Conditional Probability:

  • Regression Line:

  • Correlation:

Pearson Logo

Study Prep