MA 113: Statistics Midterm 1 Study Guide – Data Collection, Summarizing, and Relationships

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Data Collection

Section 1.1: Introduction to Data

This section introduces the foundational concepts of statistics, focusing on types of data, variables, and levels of measurement.

Population: The entire group of individuals or items of interest in a study.
Sample: A subset of the population selected for analysis.
Individual: A single member of the population or sample.
Parameter: A numerical summary describing a population.
Statistic: A numerical summary describing a sample.
Variable: A characteristic or attribute that can assume different values among individuals.
Qualitative (Categorical) Variable: Describes qualities or categories (e.g., color, type).
Quantitative Variable: Describes numerical measurements (e.g., height, age).
Discrete Variable: Quantitative variable with countable values (e.g., number of clocks).
Continuous Variable: Quantitative variable with infinitely many possible values within a range (e.g., temperature).
Levels of Measurement:
- Nominal: Categories with no order (e.g., gender).
- Ordinal: Categories with a meaningful order but not equal intervals (e.g., rankings).
- Interval: Ordered, equal intervals, but no true zero (e.g., temperature in °F).
- Ratio: Like interval, but with a meaningful zero (e.g., weight, height).

Example: The number of books on a shelf (discrete, quantitative, ratio); temperature outside (continuous, quantitative, interval).

Section 1.2: Observational Studies vs Designed Experiments

This section distinguishes between methods of data collection and the implications for causation.

Observational Study: Observes individuals without influencing them.
Designed Experiment: Applies treatments to individuals and observes responses.
Explanatory Variable: Variable manipulated or categorized to observe its effect.
Response Variable: Outcome measured in the study.
Causation vs Association: Experiments can establish causation; observational studies can only suggest association (correlation).
Confounding: When effects of two variables cannot be distinguished.
Lurking Variable: A variable not included in the study that affects the response variable.

Example: A clinical trial (experiment) vs a survey of eating habits (observational study).

Section 1.3: Simple Random Sampling

Simple random sampling ensures every member of the population has an equal chance of selection.

Random Sampling: Selection based on chance.
Simple Random Sample: Every possible sample of a given size has the same chance of being chosen.

Example: Drawing names from a hat or using a random number generator.

Section 1.4: Other Sampling Methods

Alternative sampling methods are used when simple random sampling is impractical.

Stratified Sample: Population divided into subgroups (strata), and random samples taken from each.
Systematic Sample: Every k-th individual is selected from a list after a random start.
Cluster Sample: Population divided into clusters, some clusters are randomly selected, and all individuals in chosen clusters are sampled.
Convenience Sample: Individuals are chosen based on ease of access.

Example: Surveying every 10th person entering a store (systematic sampling).

Formula: Systematic sampling steps: Number the population, choose a random starting point, then select every k-th member.

Section 1.5: Bias

Bias refers to systematic errors in data collection that can affect the validity of results.

Sampling Bias: Sample is not representative due to poor methodology.
Nonresponse Bias: Individuals selected do not respond.
Response Bias: Responses are inaccurate due to question wording or respondent behavior.
Undercoverage: Some groups are inadequately represented.
Sampling Error: Natural variability from using a sample to estimate a population parameter.
Non-sampling Error: Errors from poor data collection, processing, or measurement.

Example: Only surveying people at a gym about exercise habits (sampling bias).

Section 1.6: Experimental Design

Experimental design involves planning how to conduct an experiment to ensure valid and reliable results.

Experiment: Study where treatments are applied to subjects.
Factors: Explanatory variables manipulated in the experiment.
Treatment: Specific condition applied to subjects.
Experimental Unit/Subject: The individual receiving the treatment.
Control Group: Group receiving no treatment or a standard treatment for comparison.
Placebo: Inactive treatment used to control for psychological effects.
Blinding: Subjects (single-blind) or both subjects and experimenters (double-blind) do not know treatment assignments.

Example: Testing a new drug with a placebo group and double-blinding.

Organizing and Summarizing Data

Section 2.1: Qualitative Data

Qualitative data is summarized using tables and graphical displays.

Raw Data: Original, unprocessed data.
Frequency Distribution: Table showing counts for each category.
Relative Frequency: Proportion of observations in each category.
Bar Graph: Visual display of frequencies for categories.
Pareto Chart: Bar graph with categories ordered by frequency.
Pie Chart: Circular chart showing proportions.
Side-by-Side Bar Graphs: Compare frequencies across groups.

Formula: Relative frequency = frequency / total number of observations.

Example: Survey results on favorite ice cream flavors displayed in a bar graph.

Section 2.2: Quantitative Data

Quantitative data is organized into classes and displayed using histograms and frequency tables.

Class: Interval grouping of data values.
Histogram: Bar graph for quantitative data; bars touch to indicate continuous intervals.
Lower/Upper Class Limit: Smallest/largest value in a class.
Class Width: Difference between lower limits of consecutive classes.
Uniform Distribution: All values occur with similar frequency.
Bell-Shaped Distribution: Symmetrical, most data near the center.
Left/Right Skew: Tail extends to the left (negative) or right (positive).

Example: Heights of students grouped into intervals and displayed in a histogram.

Note: All classes in a histogram should have equal width.

Section 2.3: Other Displays of Quantitative Data

Additional methods for summarizing quantitative data include cumulative tables and time-series graphs.

Cumulative Frequency Distribution: Shows the total number of observations up to each class.
Cumulative Relative Frequency Distribution: Shows the proportion of observations up to each class.
Ogive: Line graph of cumulative frequencies.
Time-Series Data: Data collected over time.
Time-Series Graph: Line graph showing trends over time.

Formulas:

Cumulative frequency of class i = cumulative frequency of class (i-1) + frequency of class i
Cumulative relative frequency = cumulative frequency / total number of observations

Example: Monthly sales data plotted as a time-series graph.

Numerically Summarizing Data

Section 3.1: Measures of Central Tendency

Measures of central tendency describe the center of a data set.

Arithmetic Mean (Average): Sum of all values divided by the number of values.
- Population mean:
- Sample mean:
Median (M): Middle value when data is ordered.
- If n is odd:
- If n is even:
Mode: Value(s) that occur most frequently. Data can be bimodal, multimodal, or have no mode.
Resistant Statistic: Not affected by extreme values (e.g., median).

Example: Test scores: 70, 80, 80, 90. Mean = 80, Median = 80, Mode = 80.

Note: Mean is sensitive to outliers; median is preferred for skewed data.

Section 3.2: Measures of Spread (Dispersion)

Measures of spread describe the variability in a data set.

Range (R):
Population Standard Deviation ():
Sample Standard Deviation (s):
Variance: Square of the standard deviation.
Degrees of Freedom: For sample variance, denominator is n-1.
Bias: Sample variance is an unbiased estimator of population variance when dividing by n-1.

Example: Data: 2, 4, 6. Mean = 4, s =

Section 3.4: Measures of Position and Outliers

Measures of position locate a value within a data set and help identify outliers.

z-score: Number of standard deviations a value is from the mean.
- Population:
- Sample:
Percentile (Pk): Value below which k% of data falls.
Quartiles: Q1 (25th percentile), Q3 (75th percentile), Median (50th percentile).
Interquartile Range (IQR):
Outliers: Values outside the fences.
- Upper fence:
- Lower fence:

Example: Data: 1, 2, 3, 4, 5, 6, 100. 100 is an outlier.

Section 3.5: 5 Number Summaries and Box Plots

The five-number summary and boxplots provide a concise summary and visual representation of data distribution.

Five-Number Summary: Minimum, Q1, Median, Q3, Maximum.
Boxplot: Graphical display of the five-number summary, showing spread and skewness.

Example: Data: 2, 4, 6, 8, 10. Five-number summary: 2, 4, 6, 8, 10.

Describing the Relation Between Two Variables

Section 4.1: Scatter Diagrams and Correlation

Scatter diagrams and correlation coefficients are used to analyze the relationship between two quantitative variables.

Scatter Plot/Diagram: Graph of paired (x, y) data points.
Linear vs Nonlinear Relations: Linear if points form a straight pattern; nonlinear otherwise.
Positive/Negative Association: Positive if y increases with x; negative if y decreases as x increases.
Linear Correlation Coefficient (r): Measures strength and direction of linear relationship.
Critical Value: Used to assess significance of r; if , the correlation is significant.
Properties of r: ; r = 1 or -1 indicates perfect linear relation; r = 0 indicates no linear relation.
Correlation ≠ Causation: A significant r does not imply causation.

Example: Height and weight plotted on a scatter diagram; r = 0.85 indicates strong positive association.

Section 4.2: Least Squares Regression

The least-squares regression line models the linear relationship between two variables and is used for prediction.

Least-Squares Regression Line: Line that minimizes the sum of squared residuals.
- Equation:
- Slope:
- Intercept:
Predicted Value (): Value on the regression line for a given x.
Residual: Difference between observed and predicted value:
Interpretation of Slope: Change in y for a one-unit increase in x.
Extrapolation: Predicting for x-values outside the data range is unreliable.

Example: Given data on study hours (x) and test scores (y), the regression line can predict test scores for a given number of study hours.

Appendix: Key Tables

Level of Measurement	Type	Example
Nominal	Qualitative	Gender, Color
Ordinal	Qualitative/Quantitative	Rankings, Letter Grades
Interval	Quantitative	Temperature (°F)
Ratio	Quantitative	Height, Weight

Additional info: This table summarizes the main levels of measurement, their types, and examples for quick reference.