MA 113: Midterm 1 Study Guide – Data Collection, Summarizing, and Exploring Data

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Data Collection

Section 1.1: Introduction to Data

This section introduces the foundational vocabulary and concepts necessary for understanding how data is collected and classified in statistics.

Population: The entire group of individuals or items of interest in a study.
Sample: A subset of the population selected for analysis.
Statistic: A numerical summary of a sample.
Parameter: A numerical summary of a population.
Variable: A characteristic or attribute that can assume different values among individuals in a population.
Qualitative (Categorical) Variable: Describes qualities or categories (e.g., color, type).
Quantitative Variable: Describes quantities or amounts (e.g., height, age).
Discrete Variable: Quantitative variable with countable values (e.g., number of clocks).
Continuous Variable: Quantitative variable with infinitely many possible values within a range (e.g., temperature).
Levels of Measurement:
- Nominal: Categories with no inherent order (qualitative).
- Ordinal: Categories with a meaningful order but not evenly spaced (qualitative or quantitative).
- Interval: Quantitative, differences have meaning, but zero is arbitrary (e.g., temperature in °F).
- Ratio: Quantitative, differences and ratios have meaning, and zero indicates absence (e.g., test scores).

Example: The number of books in a library (discrete, quantitative, ratio); the color of cars (qualitative, nominal).

Section 1.2: Observational Studies vs Designed Experiments

This section distinguishes between methods of data collection and the implications for causation and association.

Observational Study: Researchers observe outcomes without assigning treatments.
Designed Experiment: Researchers assign treatments to study their effects.
Explanatory Variable: Variable manipulated or categorized to observe its effect.
Response Variable: Outcome measured in the study.
Causation vs Association: Experiments can establish causation; observational studies can only suggest association.
Confounding Variable: Variable related to both explanatory and response variables, potentially distorting results.
Lurking Variable: Unmeasured variable influencing the relationship between studied variables.

Example: Studying the effect of a new drug (experiment) vs observing health outcomes in a population (observational study).

Section 1.3: Simple Random Sampling

Simple random sampling ensures every member of the population has an equal chance of being selected.

Random Sampling: Selection based on chance.
Simple Random Sample: Every possible sample of a given size has the same probability of selection.

Example: Drawing names from a hat to select a sample.

Section 1.4: Other Sampling Methods

Alternative sampling methods are used when simple random sampling is impractical.

Stratified Sample: Population divided into subgroups (strata), and random samples taken from each.
Systematic Sample: Every k-th individual is selected from a list after a random start.
Cluster Sample: Population divided into clusters, some clusters are randomly selected, and all individuals in chosen clusters are sampled.
Convenience Sample: Individuals are chosen based on ease of access.

Systematic Sampling Steps: (1) Number the population, (2) Choose a random starting point, (3) Select every k-th member.

Section 1.5: Bias

Bias refers to systematic errors in data collection that can affect the validity of results.

Sampling Bias: Sample is not representative due to flawed methodology.
Nonresponse Bias: Individuals selected do not respond.
Response Bias: Responses are inaccurate due to question wording or respondent behavior.
Undercoverage: Some groups are inadequately represented.
Sampling Error: Natural variability from using a sample to estimate a population parameter.
Non-sampling Error: Errors from poor data collection, processing, or measurement.

Example: Surveying only daytime shoppers (sampling bias); poorly worded questions (response bias).

Section 1.6: Experimental Design

Experimental design involves planning how to assign treatments and measure outcomes to ensure valid conclusions.

Experiment: Study where treatments are assigned to observe effects.
Factors: Explanatory variables in an experiment.
Treatment: Specific condition applied to subjects.
Experimental Unit/Subject: The individual receiving the treatment.
Control Group: Group receiving no treatment or a standard treatment for comparison.
Placebo: Inactive treatment used to control for psychological effects.
Blinding: Subjects (single-blind) or both subjects and researchers (double-blind) do not know treatment assignments.

Example: Testing a new medication with a placebo group and double-blind design.

Organizing and Summarizing Data

Section 2.1: Qualitative Data

Qualitative data is summarized using tables and graphical displays to reveal patterns and distributions.

Raw Data: Unprocessed data as collected.
Frequency Distribution: Table showing counts for each category.
Relative Frequency: Proportion of observations in each category.
Bar Graph: Visual display of frequencies for categories.
Pareto Chart: Bar graph with categories ordered by frequency.
Pie Chart: Circular chart showing relative frequencies as sectors.

Formula:

Relative frequency = frequency / total number of observations

Example: Survey results on favorite ice cream flavors displayed in a bar graph and pie chart.

Section 2.2: Quantitative Data

Quantitative data is organized into classes and displayed using histograms to reveal distribution shapes.

Class: Range of values grouped together for analysis.
Histogram: Bar graph for quantitative data; bars touch to indicate continuous intervals.
Lower/Upper Class Limit: Smallest/largest value in a class.
Class Width: Difference between lower limits of consecutive classes.
Uniform Distribution: All values occur with similar frequency.
Bell-shaped Distribution: Symmetrical, with most values near the center.
Left/Right Skew: Tail extends to the left (negative skew) or right (positive skew).

Example: Heights of students grouped into intervals and displayed in a histogram.

Section 2.3: Other Displays of Quantitative Data

Additional displays help summarize cumulative information and trends over time.

Cumulative Frequency Distribution: Table showing the number of observations below each class boundary.
Cumulative Relative Frequency Distribution: Proportion of observations below each class boundary.
Ogive: Graph of cumulative frequencies.
Time-Series Data: Data collected over time.
Time-Series Graph: Line graph showing trends over time.

Formulas:

Cumulative frequency of class i = (Cumulative frequency of class i-1) + (frequency of class i), for i > 1; for i = 1, it is just the frequency of class 1.
Cumulative relative frequency = cumulative frequency / total number of observations

Example: Monthly sales data plotted as a time-series graph.

Numerically Summarizing Data

Section 3.1: Measures of Central Tendency

Measures of central tendency describe the center of a data set.

Arithmetic Mean (Average): Sum of all values divided by the number of values.
Population Mean (\( \mu \)): Mean of all values in the population.
Sample Mean (\( \overline{x} \)): Mean of values in a sample.
Median (M): Middle value when data is ordered.
Mode: Value(s) that occur most frequently.
Resistant Statistic: Not affected by extreme values (e.g., median).
Bimodal/Multimodal: Two or more modes.

Formulas:

Population mean:
Sample mean:
Median (odd n):
Median (even n):

Example: Data set: 2, 4, 4, 5, 7. Mean = 4.4, Median = 4, Mode = 4.

Section 3.2: Measures of Spread (Dispersion)

Measures of spread describe the variability in a data set.

Range (R): Difference between maximum and minimum values.
Population Standard Deviation (\( \sigma \)): Average distance of data points from the population mean.
Sample Standard Deviation (s): Average distance of data points from the sample mean.
Variance: Square of the standard deviation.
Degrees of Freedom: Number of values free to vary when estimating a parameter (n-1 for sample standard deviation).

Formulas:

Range:
Population standard deviation:
Sample standard deviation:

Example: Data set: 2, 4, 4, 5, 7. Range = 5, Standard deviation can be calculated using the above formulas.

Section 3.4: Measures of Position and Outliers

Measures of position locate a value within a data set and help identify outliers.

z-score: Number of standard deviations a value is from the mean.
Percentile (Pk): Value below which k% of data falls.
Quartiles (Q1, Q3): 25th and 75th percentiles, respectively.
Interquartile Range (IQR): Difference between Q3 and Q1.
Outlier: Value outside the range defined by the lower and upper fences.
Upper Fence:
Lower Fence:

Formulas:

Population z-score:
Sample z-score:
IQR:
Upper fence:
Lower fence:

Example: Data set: 2, 4, 4, 5, 7. Q1 = 4, Q3 = 5, IQR = 1, Upper fence = 6.5, Lower fence = 2.5.

Section 3.5: 5 Number Summaries and Box Plots

The five-number summary and box plots provide a concise summary and visual representation of data distribution.

Five-number summary: Minimum, Q1, Median, Q3, Maximum.
Boxplot: Graphical display of the five-number summary, showing spread and skewness.

Example: Data set: 2, 4, 4, 5, 7. Five-number summary: 2, 4, 4, 5, 7.

Describing the Relation Between Two Variables

Section 4.1: Scatter Diagrams and Correlation

Scatter diagrams and correlation coefficients are used to assess the relationship between two quantitative variables.

Scatter Plot/Diagram: Graph of paired (x, y) data points.
Linear vs Nonlinear Relations: Linear if points form a straight pattern; nonlinear otherwise.
Positive/Negative Association: Positive if y increases with x; negative if y decreases with x.
Linear Correlation Coefficient (r): Measures strength and direction of linear relationship (range: -1 to 1).
Critical Value: Threshold for determining statistical significance of r.

Formulas:

Correlation coefficient:

Example: Plotting students' heights vs weights and calculating r to assess the relationship.

Section 4.2: Least Squares Regression Line

The least-squares regression line models the linear relationship between two variables and is used for prediction.

Predicted Value (\( \hat{y} \)): Value estimated by the regression line for a given x.
Residual: Difference between observed and predicted values (\( y_i - \hat{y}_i \)).
Least-Squares Regression Line: Line that minimizes the sum of squared residuals.
Extrapolation: Predicting for x-values outside the observed range (unreliable).
Interpretation of Slope (b1): Change in y for a one-unit increase in x.
Interpretation of Intercept (b0): Predicted value when x = 0 (may not always be meaningful).

Formulas:

Line through two points:
Least-squares regression line:
Slope:
Intercept:

Example: Given data on study hours (x) and test scores (y), calculate the regression line to predict scores based on hours studied.