Skip to main content
Back

MA 113: Midterm 1 Study Guide – Data Collection, Summarizing, and Exploring Data

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Data Collection

Section 1.1: Introduction to Data

This section introduces the foundational vocabulary and concepts necessary for understanding how data is collected and classified in statistics.

  • Population: The entire group of individuals or items of interest in a study.

  • Sample: A subset of the population selected for analysis.

  • Statistic: A numerical summary of a sample.

  • Parameter: A numerical summary of a population.

  • Variable: A characteristic or attribute that can assume different values among individuals in a population.

  • Qualitative (Categorical) Variable: Describes qualities or categories (e.g., color, type).

  • Quantitative Variable: Describes quantities or amounts (e.g., height, age).

  • Discrete Variable: Quantitative variable with countable values (e.g., number of clocks).

  • Continuous Variable: Quantitative variable with infinitely many possible values within a range (e.g., temperature).

  • Levels of Measurement:

    • Nominal: Categories with no inherent order (qualitative).

    • Ordinal: Categories with a meaningful order but not evenly spaced (qualitative or quantitative).

    • Interval: Quantitative, differences have meaning, but zero is arbitrary (e.g., temperature in °F).

    • Ratio: Quantitative, differences and ratios have meaning, and zero indicates absence (e.g., test scores).

Example: The number of books in a library (discrete, quantitative, ratio); the color of cars (qualitative, nominal).

Section 1.2: Observational Studies vs Designed Experiments

This section distinguishes between methods of data collection and the implications for causation and association.

  • Observational Study: Researchers observe outcomes without assigning treatments.

  • Designed Experiment: Researchers assign treatments to study their effects.

  • Explanatory Variable: Variable manipulated or categorized to observe its effect.

  • Response Variable: Outcome measured in the study.

  • Causation vs Association: Experiments can establish causation; observational studies can only suggest association.

  • Confounding Variable: Variable related to both explanatory and response variables, potentially distorting results.

  • Lurking Variable: Unmeasured variable influencing the relationship between studied variables.

Example: Studying the effect of a new drug (experiment) vs observing health outcomes in a population (observational study).

Section 1.3: Simple Random Sampling

Simple random sampling ensures every member of the population has an equal chance of being selected.

  • Random Sampling: Selection based on chance.

  • Simple Random Sample: Every possible sample of a given size has the same probability of selection.

Example: Drawing names from a hat to select a sample.

Section 1.4: Other Sampling Methods

Alternative sampling methods are used when simple random sampling is impractical.

  • Stratified Sample: Population divided into subgroups (strata), and random samples taken from each.

  • Systematic Sample: Every k-th individual is selected from a list after a random start.

  • Cluster Sample: Population divided into clusters, some clusters are randomly selected, and all individuals in chosen clusters are sampled.

  • Convenience Sample: Individuals are chosen based on ease of access.

Systematic Sampling Steps: (1) Number the population, (2) Choose a random starting point, (3) Select every k-th member.

Section 1.5: Bias

Bias refers to systematic errors in data collection that can affect the validity of results.

  • Sampling Bias: Sample is not representative due to flawed methodology.

  • Nonresponse Bias: Individuals selected do not respond.

  • Response Bias: Responses are inaccurate due to question wording or respondent behavior.

  • Undercoverage: Some groups are inadequately represented.

  • Sampling Error: Natural variability from using a sample to estimate a population parameter.

  • Non-sampling Error: Errors from poor data collection, processing, or measurement.

Example: Surveying only daytime shoppers (sampling bias); poorly worded questions (response bias).

Section 1.6: Experimental Design

Experimental design involves planning how to assign treatments and measure outcomes to ensure valid conclusions.

  • Experiment: Study where treatments are assigned to observe effects.

  • Factors: Explanatory variables in an experiment.

  • Treatment: Specific condition applied to subjects.

  • Experimental Unit/Subject: The individual receiving the treatment.

  • Control Group: Group receiving no treatment or a standard treatment for comparison.

  • Placebo: Inactive treatment used to control for psychological effects.

  • Blinding: Subjects (single-blind) or both subjects and researchers (double-blind) do not know treatment assignments.

Example: Testing a new medication with a placebo group and double-blind design.

Organizing and Summarizing Data

Section 2.1: Qualitative Data

Qualitative data is summarized using tables and graphical displays to reveal patterns and distributions.

  • Raw Data: Unprocessed data as collected.

  • Frequency Distribution: Table showing counts for each category.

  • Relative Frequency: Proportion of observations in each category.

  • Bar Graph: Visual display of frequencies for categories.

  • Pareto Chart: Bar graph with categories ordered by frequency.

  • Pie Chart: Circular chart showing relative frequencies as sectors.

Formula:

  • Relative frequency = frequency / total number of observations

Example: Survey results on favorite ice cream flavors displayed in a bar graph and pie chart.

Section 2.2: Quantitative Data

Quantitative data is organized into classes and displayed using histograms to reveal distribution shapes.

  • Class: Range of values grouped together for analysis.

  • Histogram: Bar graph for quantitative data; bars touch to indicate continuous intervals.

  • Lower/Upper Class Limit: Smallest/largest value in a class.

  • Class Width: Difference between lower limits of consecutive classes.

  • Uniform Distribution: All values occur with similar frequency.

  • Bell-shaped Distribution: Symmetrical, with most values near the center.

  • Left/Right Skew: Tail extends to the left (negative skew) or right (positive skew).

Example: Heights of students grouped into intervals and displayed in a histogram.

Section 2.3: Other Displays of Quantitative Data

Additional displays help summarize cumulative information and trends over time.

  • Cumulative Frequency Distribution: Table showing the number of observations below each class boundary.

  • Cumulative Relative Frequency Distribution: Proportion of observations below each class boundary.

  • Ogive: Graph of cumulative frequencies.

  • Time-Series Data: Data collected over time.

  • Time-Series Graph: Line graph showing trends over time.

Formulas:

  • Cumulative frequency of class i = (Cumulative frequency of class i-1) + (frequency of class i), for i > 1; for i = 1, it is just the frequency of class 1.

  • Cumulative relative frequency = cumulative frequency / total number of observations

Example: Monthly sales data plotted as a time-series graph.

Numerically Summarizing Data

Section 3.1: Measures of Central Tendency

Measures of central tendency describe the center of a data set.

  • Arithmetic Mean (Average): Sum of all values divided by the number of values.

  • Population Mean (\( \mu \)): Mean of all values in the population.

  • Sample Mean (\( \overline{x} \)): Mean of values in a sample.

  • Median (M): Middle value when data is ordered.

  • Mode: Value(s) that occur most frequently.

  • Resistant Statistic: Not affected by extreme values (e.g., median).

  • Bimodal/Multimodal: Two or more modes.

Formulas:

  • Population mean:

  • Sample mean:

  • Median (odd n):

  • Median (even n):

Example: Data set: 2, 4, 4, 5, 7. Mean = 4.4, Median = 4, Mode = 4.

Section 3.2: Measures of Spread (Dispersion)

Measures of spread describe the variability in a data set.

  • Range (R): Difference between maximum and minimum values.

  • Population Standard Deviation (\( \sigma \)): Average distance of data points from the population mean.

  • Sample Standard Deviation (s): Average distance of data points from the sample mean.

  • Variance: Square of the standard deviation.

  • Degrees of Freedom: Number of values free to vary when estimating a parameter (n-1 for sample standard deviation).

Formulas:

  • Range:

  • Population standard deviation:

  • Sample standard deviation:

Example: Data set: 2, 4, 4, 5, 7. Range = 5, Standard deviation can be calculated using the above formulas.

Section 3.4: Measures of Position and Outliers

Measures of position locate a value within a data set and help identify outliers.

  • z-score: Number of standard deviations a value is from the mean.

  • Percentile (Pk): Value below which k% of data falls.

  • Quartiles (Q1, Q3): 25th and 75th percentiles, respectively.

  • Interquartile Range (IQR): Difference between Q3 and Q1.

  • Outlier: Value outside the range defined by the lower and upper fences.

  • Upper Fence:

  • Lower Fence:

Formulas:

  • Population z-score:

  • Sample z-score:

  • IQR:

  • Upper fence:

  • Lower fence:

Example: Data set: 2, 4, 4, 5, 7. Q1 = 4, Q3 = 5, IQR = 1, Upper fence = 6.5, Lower fence = 2.5.

Section 3.5: 5 Number Summaries and Box Plots

The five-number summary and box plots provide a concise summary and visual representation of data distribution.

  • Five-number summary: Minimum, Q1, Median, Q3, Maximum.

  • Boxplot: Graphical display of the five-number summary, showing spread and skewness.

Example: Data set: 2, 4, 4, 5, 7. Five-number summary: 2, 4, 4, 5, 7.

Describing the Relation Between Two Variables

Section 4.1: Scatter Diagrams and Correlation

Scatter diagrams and correlation coefficients are used to assess the relationship between two quantitative variables.

  • Scatter Plot/Diagram: Graph of paired (x, y) data points.

  • Linear vs Nonlinear Relations: Linear if points form a straight pattern; nonlinear otherwise.

  • Positive/Negative Association: Positive if y increases with x; negative if y decreases with x.

  • Linear Correlation Coefficient (r): Measures strength and direction of linear relationship (range: -1 to 1).

  • Critical Value: Threshold for determining statistical significance of r.

Formulas:

  • Correlation coefficient:

Example: Plotting students' heights vs weights and calculating r to assess the relationship.

Section 4.2: Least Squares Regression Line

The least-squares regression line models the linear relationship between two variables and is used for prediction.

  • Predicted Value (\( \hat{y} \)): Value estimated by the regression line for a given x.

  • Residual: Difference between observed and predicted values (\( y_i - \hat{y}_i \)).

  • Least-Squares Regression Line: Line that minimizes the sum of squared residuals.

  • Extrapolation: Predicting for x-values outside the observed range (unreliable).

  • Interpretation of Slope (b1): Change in y for a one-unit increase in x.

  • Interpretation of Intercept (b0): Predicted value when x = 0 (may not always be meaningful).

Formulas:

  • Line through two points:

  • Least-squares regression line:

  • Slope:

  • Intercept:

Example: Given data on study hours (x) and test scores (y), calculate the regression line to predict scores based on hours studied.

Pearson Logo

Study Prep