BackMA 113: Midterm 1 Study Guide – Data Collection, Summarizing, and Exploring Data
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Data Collection
Section 1.1: Introduction to Data
This section introduces the foundational vocabulary and concepts necessary for understanding how data is collected and classified in statistics.
Population: The entire group of individuals or items of interest in a study.
Sample: A subset of the population selected for analysis.
Statistic: A numerical summary of a sample.
Parameter: A numerical summary of a population.
Variable: A characteristic or attribute that can assume different values among individuals in a population.
Qualitative (Categorical) Variable: Describes qualities or categories (e.g., color, type).
Quantitative Variable: Describes quantities or amounts (e.g., height, age).
Discrete Variable: Quantitative variable with countable values (e.g., number of clocks).
Continuous Variable: Quantitative variable with infinitely many possible values within a range (e.g., temperature).
Levels of Measurement:
Nominal: Categories with no inherent order (qualitative).
Ordinal: Categories with a meaningful order but not evenly spaced (qualitative or quantitative).
Interval: Quantitative, differences have meaning, but zero is arbitrary (e.g., temperature in °F).
Ratio: Quantitative, differences and ratios have meaning, and zero indicates absence (e.g., test scores).
Example: The number of books in a library (discrete, quantitative, ratio); the color of cars (qualitative, nominal).
Section 1.2: Observational Studies vs Designed Experiments
This section distinguishes between methods of data collection and the implications for causation and association.
Observational Study: Researchers observe outcomes without assigning treatments.
Designed Experiment: Researchers assign treatments to study their effects.
Explanatory Variable: Variable manipulated or categorized to observe its effect.
Response Variable: Outcome measured in the study.
Causation vs Association: Experiments can establish causation; observational studies can only suggest association.
Confounding Variable: Variable related to both explanatory and response variables, potentially distorting results.
Lurking Variable: Unmeasured variable influencing the relationship between studied variables.
Example: Studying the effect of a new drug (experiment) vs observing health outcomes in a population (observational study).
Section 1.3: Simple Random Sampling
Simple random sampling ensures every member of the population has an equal chance of being selected.
Random Sampling: Selection based on chance.
Simple Random Sample: Every possible sample of a given size has the same probability of selection.
Example: Drawing names from a hat to select a sample.
Section 1.4: Other Sampling Methods
Alternative sampling methods are used when simple random sampling is impractical.
Stratified Sample: Population divided into subgroups (strata), and random samples taken from each.
Systematic Sample: Every k-th individual is selected from a list after a random start.
Cluster Sample: Population divided into clusters, some clusters are randomly selected, and all individuals in chosen clusters are sampled.
Convenience Sample: Individuals are chosen based on ease of access.
Systematic Sampling Steps: (1) Number the population, (2) Choose a random starting point, (3) Select every k-th member.
Section 1.5: Bias
Bias refers to systematic errors in data collection that can affect the validity of results.
Sampling Bias: Sample is not representative due to flawed methodology.
Nonresponse Bias: Individuals selected do not respond.
Response Bias: Responses are inaccurate due to question wording or respondent behavior.
Undercoverage: Some groups are inadequately represented.
Sampling Error: Natural variability from using a sample to estimate a population parameter.
Non-sampling Error: Errors from poor data collection, processing, or measurement.
Example: Surveying only daytime shoppers (sampling bias); poorly worded questions (response bias).
Section 1.6: Experimental Design
Experimental design involves planning how to assign treatments and measure outcomes to ensure valid conclusions.
Experiment: Study where treatments are assigned to observe effects.
Factors: Explanatory variables in an experiment.
Treatment: Specific condition applied to subjects.
Experimental Unit/Subject: The individual receiving the treatment.
Control Group: Group receiving no treatment or a standard treatment for comparison.
Placebo: Inactive treatment used to control for psychological effects.
Blinding: Subjects (single-blind) or both subjects and researchers (double-blind) do not know treatment assignments.
Example: Testing a new medication with a placebo group and double-blind design.
Organizing and Summarizing Data
Section 2.1: Qualitative Data
Qualitative data is summarized using tables and graphical displays to reveal patterns and distributions.
Raw Data: Unprocessed data as collected.
Frequency Distribution: Table showing counts for each category.
Relative Frequency: Proportion of observations in each category.
Bar Graph: Visual display of frequencies for categories.
Pareto Chart: Bar graph with categories ordered by frequency.
Pie Chart: Circular chart showing relative frequencies as sectors.
Formula:
Relative frequency = frequency / total number of observations
Example: Survey results on favorite ice cream flavors displayed in a bar graph and pie chart.
Section 2.2: Quantitative Data
Quantitative data is organized into classes and displayed using histograms to reveal distribution shapes.
Class: Range of values grouped together for analysis.
Histogram: Bar graph for quantitative data; bars touch to indicate continuous intervals.
Lower/Upper Class Limit: Smallest/largest value in a class.
Class Width: Difference between lower limits of consecutive classes.
Uniform Distribution: All values occur with similar frequency.
Bell-shaped Distribution: Symmetrical, with most values near the center.
Left/Right Skew: Tail extends to the left (negative skew) or right (positive skew).
Example: Heights of students grouped into intervals and displayed in a histogram.
Section 2.3: Other Displays of Quantitative Data
Additional displays help summarize cumulative information and trends over time.
Cumulative Frequency Distribution: Table showing the number of observations below each class boundary.
Cumulative Relative Frequency Distribution: Proportion of observations below each class boundary.
Ogive: Graph of cumulative frequencies.
Time-Series Data: Data collected over time.
Time-Series Graph: Line graph showing trends over time.
Formulas:
Cumulative frequency of class i = (Cumulative frequency of class i-1) + (frequency of class i), for i > 1; for i = 1, it is just the frequency of class 1.
Cumulative relative frequency = cumulative frequency / total number of observations
Example: Monthly sales data plotted as a time-series graph.
Numerically Summarizing Data
Section 3.1: Measures of Central Tendency
Measures of central tendency describe the center of a data set.
Arithmetic Mean (Average): Sum of all values divided by the number of values.
Population Mean (\( \mu \)): Mean of all values in the population.
Sample Mean (\( \overline{x} \)): Mean of values in a sample.
Median (M): Middle value when data is ordered.
Mode: Value(s) that occur most frequently.
Resistant Statistic: Not affected by extreme values (e.g., median).
Bimodal/Multimodal: Two or more modes.
Formulas:
Population mean:
Sample mean:
Median (odd n):
Median (even n):
Example: Data set: 2, 4, 4, 5, 7. Mean = 4.4, Median = 4, Mode = 4.
Section 3.2: Measures of Spread (Dispersion)
Measures of spread describe the variability in a data set.
Range (R): Difference between maximum and minimum values.
Population Standard Deviation (\( \sigma \)): Average distance of data points from the population mean.
Sample Standard Deviation (s): Average distance of data points from the sample mean.
Variance: Square of the standard deviation.
Degrees of Freedom: Number of values free to vary when estimating a parameter (n-1 for sample standard deviation).
Formulas:
Range:
Population standard deviation:
Sample standard deviation:
Example: Data set: 2, 4, 4, 5, 7. Range = 5, Standard deviation can be calculated using the above formulas.
Section 3.4: Measures of Position and Outliers
Measures of position locate a value within a data set and help identify outliers.
z-score: Number of standard deviations a value is from the mean.
Percentile (Pk): Value below which k% of data falls.
Quartiles (Q1, Q3): 25th and 75th percentiles, respectively.
Interquartile Range (IQR): Difference between Q3 and Q1.
Outlier: Value outside the range defined by the lower and upper fences.
Upper Fence:
Lower Fence:
Formulas:
Population z-score:
Sample z-score:
IQR:
Upper fence:
Lower fence:
Example: Data set: 2, 4, 4, 5, 7. Q1 = 4, Q3 = 5, IQR = 1, Upper fence = 6.5, Lower fence = 2.5.
Section 3.5: 5 Number Summaries and Box Plots
The five-number summary and box plots provide a concise summary and visual representation of data distribution.
Five-number summary: Minimum, Q1, Median, Q3, Maximum.
Boxplot: Graphical display of the five-number summary, showing spread and skewness.
Example: Data set: 2, 4, 4, 5, 7. Five-number summary: 2, 4, 4, 5, 7.
Describing the Relation Between Two Variables
Section 4.1: Scatter Diagrams and Correlation
Scatter diagrams and correlation coefficients are used to assess the relationship between two quantitative variables.
Scatter Plot/Diagram: Graph of paired (x, y) data points.
Linear vs Nonlinear Relations: Linear if points form a straight pattern; nonlinear otherwise.
Positive/Negative Association: Positive if y increases with x; negative if y decreases with x.
Linear Correlation Coefficient (r): Measures strength and direction of linear relationship (range: -1 to 1).
Critical Value: Threshold for determining statistical significance of r.
Formulas:
Correlation coefficient:
Example: Plotting students' heights vs weights and calculating r to assess the relationship.
Section 4.2: Least Squares Regression Line
The least-squares regression line models the linear relationship between two variables and is used for prediction.
Predicted Value (\( \hat{y} \)): Value estimated by the regression line for a given x.
Residual: Difference between observed and predicted values (\( y_i - \hat{y}_i \)).
Least-Squares Regression Line: Line that minimizes the sum of squared residuals.
Extrapolation: Predicting for x-values outside the observed range (unreliable).
Interpretation of Slope (b1): Change in y for a one-unit increase in x.
Interpretation of Intercept (b0): Predicted value when x = 0 (may not always be meaningful).
Formulas:
Line through two points:
Least-squares regression line:
Slope:
Intercept:
Example: Given data on study hours (x) and test scores (y), calculate the regression line to predict scores based on hours studied.