STA2023 Review: Data Collection, Summarizing Data, and Probability

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Section 1: Data Collection

Key Concepts in Data Collection

Data collection is the foundational step in statistics, involving the gathering of information to answer research questions. Understanding the types of data, variables, and sampling methods is essential for valid statistical analysis.

Population: The entire group of individuals or items of interest in a study.
Sample: A subset of the population selected for analysis.
Individuals: The objects described by a set of data (e.g., people, animals, things).
Parameter: A numerical summary describing a characteristic of a population.
Statistic: A numerical summary describing a characteristic of a sample.

Example: If a nutritionist selects 80 participants to test a new diet, the 80 participants are the sample, and all people who could use the diet are the population.

Types of Variables and Data

Qualitative (Categorical) Variables: Describe qualities or categories (e.g., type of car transmission).
Quantitative Variables: Represent numerical values (e.g., number of pets owned).
Discrete Variables: Countable values (e.g., number of pets).
Continuous Variables: Any value within a range (e.g., temperature).
Levels of Measurement:
- Nominal: Categories with no order (e.g., type of transmission).
- Ordinal: Categories with a logical order (e.g., ranking tennis players).
- Interval: Ordered, equal intervals, no true zero (e.g., temperature in Celsius).
- Ratio: Ordered, equal intervals, true zero (e.g., number of pets).

Example: Ranking the top 5 tennis players is ordinal; temperature is interval; number of pets is ratio.

Bias in Data Collection

Sampling Bias: Occurs when the sample is not representative of the population.
Nonresponse Bias: When selected individuals do not respond.
Response Bias: When respondents give inaccurate answers (e.g., to please management).

Example: If a survey is only sent to employees at headquarters, sampling bias may occur.

Section 2: Organizing and Summarizing Data

Frequency Distributions and Relative Frequency

Organizing data into tables and charts helps summarize and visualize information.

Frequency: The number of times a value or category occurs.
Relative Frequency: The proportion of the total represented by each category.
- Formula:
Degree Measure: Used in pie charts to represent categories as angles.
- Formula:

Example Table:

Activity	Frequency	Relative Frequency	Degree Measure
Sports	60	0.30	108°
Movies	50	0.25	90°
Shopping	40	0.20	72°
Reading	30	0.15	54°
Other	20	0.10	36°

Additional info: Degree measures calculated as relative frequency × 360°.

Class Intervals and Frequency Distributions

Class Interval: A range of values grouped together in a frequency distribution.
Class Limits: The smallest and largest values in each class.
Class Boundaries: The values that separate classes without gaps.

Example: For quiz scores grouped as 40–49, 50–59, etc., the lower class limit of the 3rd class (60–69) is 60, and the upper class limit is 69.

Shapes of Distributions

Bell-shaped (Normal): Symmetrical, most data near the center.
Uniform: All values equally likely.
Skewed Left: Tail on the left; most data on the right.
Skewed Right: Tail on the right; most data on the left.

Example: Salaries of employees where a few executives earn millions are skewed right.

Section 3: Numerically Summarizing Data

Population Parameters

Population parameters are numerical values that summarize data for an entire population.

Population Mean ():
Population Variance ():
Population Standard Deviation ():
Population Median: The middle value when data are ordered.
Population Mode: The value that occurs most frequently.

Example: For commute times: 18, 22, 25, 28, 30, 31, 33, 35, 36, 40, calculate the mean, median, and mode using the formulas above.

Sample Statistics

Sample Mean ():
Sample Variance ():
Sample Standard Deviation ():

Example: For the sample 18, 22, 25, 28, 30, calculate the sample mean and variance.

Five-Number Summary and Boxplot

Five-Number Summary: Minimum, Q1 (first quartile), Median (Q2), Q3 (third quartile), Maximum.
Interquartile Range (IQR):

Example: For a data set, if Q1 = 25, Q3 = 38, then IQR = 13.

Empirical Rule (68-95-99.7 Rule)

For bell-shaped (normal) distributions:
About 68% of data within 1 standard deviation of the mean.
About 95% within 2 standard deviations.
About 99.7% within 3 standard deviations.

Example: If mean commute time is 30 minutes and standard deviation is 6 minutes, about 95% of times are between 18 and 42 minutes.

Section 5: Probability

Basic Probability Concepts

Probability quantifies the likelihood of events occurring. It is calculated as the ratio of favorable outcomes to total possible outcomes.

Probability of an Event:
Complementary Events:
Joint Probability (AND): For independent events,
Conditional Probability:
With Replacement: Each selection is independent; probabilities remain the same.
Without Replacement: Probabilities change after each selection.

Example Table: Education Levels of Residents

Education Level	Males	Females	Total
High school only	12	15	27
Some college	10	14	24
Bachelor's degree	8	11	19
Graduate degree	6	7	13
Total	36	47	83

Example Probability Calculations:

Probability a resident is male:
Probability a resident has a graduate degree:
Probability a resident is female AND has some college:
Probability at least one of two residents has a graduate degree (with replacement):

Additional info: For 'with replacement', multiply probabilities; for 'without replacement', adjust denominators after each selection.