BackMA 113: Statistics Midterm 1 Study Guide – Data Collection, Summarizing, and Relationships
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Data Collection
Section 1.1: Introduction to Data
This section introduces the foundational concepts of statistics, focusing on types of data, variables, and levels of measurement.
Population: The entire group of individuals or items of interest in a study.
Sample: A subset of the population selected for analysis.
Individual: A single member of the population or sample.
Parameter: A numerical summary describing a population.
Statistic: A numerical summary describing a sample.
Variable: A characteristic or attribute that can assume different values among individuals.
Qualitative (Categorical) Variable: Describes qualities or categories (e.g., color, type).
Quantitative Variable: Describes numerical measurements (e.g., height, age).
Discrete Variable: Quantitative variable with countable values (e.g., number of clocks).
Continuous Variable: Quantitative variable with infinitely many possible values within a range (e.g., temperature).
Levels of Measurement:
Nominal: Categories with no order (e.g., gender).
Ordinal: Categories with a meaningful order but not equal intervals (e.g., rankings).
Interval: Ordered, equal intervals, but no true zero (e.g., temperature in °F).
Ratio: Like interval, but with a meaningful zero (e.g., weight, height).
Example: The number of books on a shelf (discrete, quantitative, ratio); temperature outside (continuous, quantitative, interval).
Section 1.2: Observational Studies vs Designed Experiments
This section distinguishes between methods of data collection and the implications for causation.
Observational Study: Observes individuals without influencing them.
Designed Experiment: Applies treatments to individuals and observes responses.
Explanatory Variable: Variable manipulated or categorized to observe its effect.
Response Variable: Outcome measured in the study.
Causation vs Association: Experiments can establish causation; observational studies can only suggest association (correlation).
Confounding: When effects of two variables cannot be distinguished.
Lurking Variable: A variable not included in the study that affects the response variable.
Example: A clinical trial (experiment) vs a survey of eating habits (observational study).
Section 1.3: Simple Random Sampling
Simple random sampling ensures every member of the population has an equal chance of selection.
Random Sampling: Selection based on chance.
Simple Random Sample: Every possible sample of a given size has the same chance of being chosen.
Example: Drawing names from a hat or using a random number generator.
Section 1.4: Other Sampling Methods
Alternative sampling methods are used when simple random sampling is impractical.
Stratified Sample: Population divided into subgroups (strata), and random samples taken from each.
Systematic Sample: Every k-th individual is selected from a list after a random start.
Cluster Sample: Population divided into clusters, some clusters are randomly selected, and all individuals in chosen clusters are sampled.
Convenience Sample: Individuals are chosen based on ease of access.
Example: Surveying every 10th person entering a store (systematic sampling).
Formula: Systematic sampling steps: Number the population, choose a random starting point, then select every k-th member.
Section 1.5: Bias
Bias refers to systematic errors in data collection that can affect the validity of results.
Sampling Bias: Sample is not representative due to poor methodology.
Nonresponse Bias: Individuals selected do not respond.
Response Bias: Responses are inaccurate due to question wording or respondent behavior.
Undercoverage: Some groups are inadequately represented.
Sampling Error: Natural variability from using a sample to estimate a population parameter.
Non-sampling Error: Errors from poor data collection, processing, or measurement.
Example: Only surveying people at a gym about exercise habits (sampling bias).
Section 1.6: Experimental Design
Experimental design involves planning how to conduct an experiment to ensure valid and reliable results.
Experiment: Study where treatments are applied to subjects.
Factors: Explanatory variables manipulated in the experiment.
Treatment: Specific condition applied to subjects.
Experimental Unit/Subject: The individual receiving the treatment.
Control Group: Group receiving no treatment or a standard treatment for comparison.
Placebo: Inactive treatment used to control for psychological effects.
Blinding: Subjects (single-blind) or both subjects and experimenters (double-blind) do not know treatment assignments.
Example: Testing a new drug with a placebo group and double-blinding.
Organizing and Summarizing Data
Section 2.1: Qualitative Data
Qualitative data is summarized using tables and graphical displays.
Raw Data: Original, unprocessed data.
Frequency Distribution: Table showing counts for each category.
Relative Frequency: Proportion of observations in each category.
Bar Graph: Visual display of frequencies for categories.
Pareto Chart: Bar graph with categories ordered by frequency.
Pie Chart: Circular chart showing proportions.
Side-by-Side Bar Graphs: Compare frequencies across groups.
Formula: Relative frequency = frequency / total number of observations.
Example: Survey results on favorite ice cream flavors displayed in a bar graph.
Section 2.2: Quantitative Data
Quantitative data is organized into classes and displayed using histograms and frequency tables.
Class: Interval grouping of data values.
Histogram: Bar graph for quantitative data; bars touch to indicate continuous intervals.
Lower/Upper Class Limit: Smallest/largest value in a class.
Class Width: Difference between lower limits of consecutive classes.
Uniform Distribution: All values occur with similar frequency.
Bell-Shaped Distribution: Symmetrical, most data near the center.
Left/Right Skew: Tail extends to the left (negative) or right (positive).
Example: Heights of students grouped into intervals and displayed in a histogram.
Note: All classes in a histogram should have equal width.
Section 2.3: Other Displays of Quantitative Data
Additional methods for summarizing quantitative data include cumulative tables and time-series graphs.
Cumulative Frequency Distribution: Shows the total number of observations up to each class.
Cumulative Relative Frequency Distribution: Shows the proportion of observations up to each class.
Ogive: Line graph of cumulative frequencies.
Time-Series Data: Data collected over time.
Time-Series Graph: Line graph showing trends over time.
Formulas:
Cumulative frequency of class i = cumulative frequency of class (i-1) + frequency of class i
Cumulative relative frequency = cumulative frequency / total number of observations
Example: Monthly sales data plotted as a time-series graph.
Numerically Summarizing Data
Section 3.1: Measures of Central Tendency
Measures of central tendency describe the center of a data set.
Arithmetic Mean (Average): Sum of all values divided by the number of values.
Population mean:
Sample mean:
Median (M): Middle value when data is ordered.
If n is odd:
If n is even:
Mode: Value(s) that occur most frequently. Data can be bimodal, multimodal, or have no mode.
Resistant Statistic: Not affected by extreme values (e.g., median).
Example: Test scores: 70, 80, 80, 90. Mean = 80, Median = 80, Mode = 80.
Note: Mean is sensitive to outliers; median is preferred for skewed data.
Section 3.2: Measures of Spread (Dispersion)
Measures of spread describe the variability in a data set.
Range (R):
Population Standard Deviation ():
Sample Standard Deviation (s):
Variance: Square of the standard deviation.
Degrees of Freedom: For sample variance, denominator is n-1.
Bias: Sample variance is an unbiased estimator of population variance when dividing by n-1.
Example: Data: 2, 4, 6. Mean = 4, s =
Section 3.4: Measures of Position and Outliers
Measures of position locate a value within a data set and help identify outliers.
z-score: Number of standard deviations a value is from the mean.
Population:
Sample:
Percentile (Pk): Value below which k% of data falls.
Quartiles: Q1 (25th percentile), Q3 (75th percentile), Median (50th percentile).
Interquartile Range (IQR):
Outliers: Values outside the fences.
Upper fence:
Lower fence:
Example: Data: 1, 2, 3, 4, 5, 6, 100. 100 is an outlier.
Section 3.5: 5 Number Summaries and Box Plots
The five-number summary and boxplots provide a concise summary and visual representation of data distribution.
Five-Number Summary: Minimum, Q1, Median, Q3, Maximum.
Boxplot: Graphical display of the five-number summary, showing spread and skewness.
Example: Data: 2, 4, 6, 8, 10. Five-number summary: 2, 4, 6, 8, 10.
Describing the Relation Between Two Variables
Section 4.1: Scatter Diagrams and Correlation
Scatter diagrams and correlation coefficients are used to analyze the relationship between two quantitative variables.
Scatter Plot/Diagram: Graph of paired (x, y) data points.
Linear vs Nonlinear Relations: Linear if points form a straight pattern; nonlinear otherwise.
Positive/Negative Association: Positive if y increases with x; negative if y decreases as x increases.
Linear Correlation Coefficient (r): Measures strength and direction of linear relationship.
Critical Value: Used to assess significance of r; if , the correlation is significant.
Properties of r: ; r = 1 or -1 indicates perfect linear relation; r = 0 indicates no linear relation.
Correlation ≠ Causation: A significant r does not imply causation.
Example: Height and weight plotted on a scatter diagram; r = 0.85 indicates strong positive association.
Section 4.2: Least Squares Regression
The least-squares regression line models the linear relationship between two variables and is used for prediction.
Least-Squares Regression Line: Line that minimizes the sum of squared residuals.
Equation:
Slope:
Intercept:
Predicted Value (): Value on the regression line for a given x.
Residual: Difference between observed and predicted value:
Interpretation of Slope: Change in y for a one-unit increase in x.
Extrapolation: Predicting for x-values outside the data range is unreliable.
Example: Given data on study hours (x) and test scores (y), the regression line can predict test scores for a given number of study hours.
Appendix: Key Tables
Level of Measurement | Type | Example |
|---|---|---|
Nominal | Qualitative | Gender, Color |
Ordinal | Qualitative/Quantitative | Rankings, Letter Grades |
Interval | Quantitative | Temperature (°F) |
Ratio | Quantitative | Height, Weight |
Additional info: This table summarizes the main levels of measurement, their types, and examples for quick reference.