BackSTA2023 Review: Data Collection, Summarizing Data, and Probability
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Section 1: Data Collection
Key Concepts in Data Collection
Data collection is the foundational step in statistics, involving the gathering of information to answer research questions. Understanding the types of data, variables, and sampling methods is essential for valid statistical analysis.
Population: The entire group of individuals or items of interest in a study.
Sample: A subset of the population selected for analysis.
Individuals: The objects described by a set of data (e.g., people, animals, things).
Parameter: A numerical summary describing a characteristic of a population.
Statistic: A numerical summary describing a characteristic of a sample.
Example: If a nutritionist selects 80 participants to test a new diet, the 80 participants are the sample, and all people who could use the diet are the population.
Types of Variables and Data
Qualitative (Categorical) Variables: Describe qualities or categories (e.g., type of car transmission).
Quantitative Variables: Represent numerical values (e.g., number of pets owned).
Discrete Variables: Countable values (e.g., number of pets).
Continuous Variables: Any value within a range (e.g., temperature).
Levels of Measurement:
Nominal: Categories with no order (e.g., type of transmission).
Ordinal: Categories with a logical order (e.g., ranking tennis players).
Interval: Ordered, equal intervals, no true zero (e.g., temperature in Celsius).
Ratio: Ordered, equal intervals, true zero (e.g., number of pets).
Example: Ranking the top 5 tennis players is ordinal; temperature is interval; number of pets is ratio.
Bias in Data Collection
Sampling Bias: Occurs when the sample is not representative of the population.
Nonresponse Bias: When selected individuals do not respond.
Response Bias: When respondents give inaccurate answers (e.g., to please management).
Example: If a survey is only sent to employees at headquarters, sampling bias may occur.
Section 2: Organizing and Summarizing Data
Frequency Distributions and Relative Frequency
Organizing data into tables and charts helps summarize and visualize information.
Frequency: The number of times a value or category occurs.
Relative Frequency: The proportion of the total represented by each category.
Formula:
Degree Measure: Used in pie charts to represent categories as angles.
Formula:
Example Table:
Activity | Frequency | Relative Frequency | Degree Measure |
|---|---|---|---|
Sports | 60 | 0.30 | 108° |
Movies | 50 | 0.25 | 90° |
Shopping | 40 | 0.20 | 72° |
Reading | 30 | 0.15 | 54° |
Other | 20 | 0.10 | 36° |
Additional info: Degree measures calculated as relative frequency × 360°.
Class Intervals and Frequency Distributions
Class Interval: A range of values grouped together in a frequency distribution.
Class Limits: The smallest and largest values in each class.
Class Boundaries: The values that separate classes without gaps.
Example: For quiz scores grouped as 40–49, 50–59, etc., the lower class limit of the 3rd class (60–69) is 60, and the upper class limit is 69.
Shapes of Distributions
Bell-shaped (Normal): Symmetrical, most data near the center.
Uniform: All values equally likely.
Skewed Left: Tail on the left; most data on the right.
Skewed Right: Tail on the right; most data on the left.
Example: Salaries of employees where a few executives earn millions are skewed right.
Section 3: Numerically Summarizing Data
Population Parameters
Population parameters are numerical values that summarize data for an entire population.
Population Mean ():
Population Variance ():
Population Standard Deviation ():
Population Median: The middle value when data are ordered.
Population Mode: The value that occurs most frequently.
Example: For commute times: 18, 22, 25, 28, 30, 31, 33, 35, 36, 40, calculate the mean, median, and mode using the formulas above.
Sample Statistics
Sample Mean ():
Sample Variance ():
Sample Standard Deviation ():
Example: For the sample 18, 22, 25, 28, 30, calculate the sample mean and variance.
Five-Number Summary and Boxplot
Five-Number Summary: Minimum, Q1 (first quartile), Median (Q2), Q3 (third quartile), Maximum.
Interquartile Range (IQR):
Example: For a data set, if Q1 = 25, Q3 = 38, then IQR = 13.
Empirical Rule (68-95-99.7 Rule)
For bell-shaped (normal) distributions:
About 68% of data within 1 standard deviation of the mean.
About 95% within 2 standard deviations.
About 99.7% within 3 standard deviations.
Example: If mean commute time is 30 minutes and standard deviation is 6 minutes, about 95% of times are between 18 and 42 minutes.
Section 5: Probability
Basic Probability Concepts
Probability quantifies the likelihood of events occurring. It is calculated as the ratio of favorable outcomes to total possible outcomes.
Probability of an Event:
Complementary Events:
Joint Probability (AND): For independent events,
Conditional Probability:
With Replacement: Each selection is independent; probabilities remain the same.
Without Replacement: Probabilities change after each selection.
Example Table: Education Levels of Residents
Education Level | Males | Females | Total |
|---|---|---|---|
High school only | 12 | 15 | 27 |
Some college | 10 | 14 | 24 |
Bachelor's degree | 8 | 11 | 19 |
Graduate degree | 6 | 7 | 13 |
Total | 36 | 47 | 83 |
Example Probability Calculations:
Probability a resident is male:
Probability a resident has a graduate degree:
Probability a resident is female AND has some college:
Probability at least one of two residents has a graduate degree (with replacement):
Additional info: For 'with replacement', multiply probabilities; for 'without replacement', adjust denominators after each selection.