Skip to main content
Back

MA 113: Statistics Midterm 1 Study Guide – Data Collection, Summarizing, and Relationships

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Data Collection

Section 1.1: Introduction to Data

This section introduces the foundational concepts of statistics, focusing on types of data, variables, and levels of measurement.

  • Population: The entire group of individuals or items of interest in a study.

  • Sample: A subset of the population selected for analysis.

  • Individual: A single member of the population or sample.

  • Parameter: A numerical summary describing a population.

  • Statistic: A numerical summary describing a sample.

  • Variable: A characteristic or attribute that can assume different values among individuals.

  • Qualitative (Categorical) Variable: Describes qualities or categories (e.g., color, type).

  • Quantitative Variable: Describes numerical measurements (e.g., height, age).

  • Discrete Variable: Quantitative variable with countable values (e.g., number of clocks).

  • Continuous Variable: Quantitative variable with infinitely many possible values within a range (e.g., temperature).

  • Levels of Measurement:

    • Nominal: Categories with no order (e.g., gender).

    • Ordinal: Categories with a meaningful order but not equal intervals (e.g., rankings).

    • Interval: Ordered, equal intervals, but no true zero (e.g., temperature in °F).

    • Ratio: Like interval, but with a meaningful zero (e.g., weight, height).

Example: The number of books on a shelf (discrete, quantitative, ratio); temperature outside (continuous, quantitative, interval).

Section 1.2: Observational Studies vs Designed Experiments

This section distinguishes between methods of data collection and the implications for causation.

  • Observational Study: Observes individuals without influencing them.

  • Designed Experiment: Applies treatments to individuals and observes responses.

  • Explanatory Variable: Variable manipulated or categorized to observe its effect.

  • Response Variable: Outcome measured in the study.

  • Causation vs Association: Experiments can establish causation; observational studies can only suggest association (correlation).

  • Confounding: When effects of two variables cannot be distinguished.

  • Lurking Variable: A variable not included in the study that affects the response variable.

Example: A clinical trial (experiment) vs a survey of eating habits (observational study).

Section 1.3: Simple Random Sampling

Simple random sampling ensures every member of the population has an equal chance of selection.

  • Random Sampling: Selection based on chance.

  • Simple Random Sample: Every possible sample of a given size has the same chance of being chosen.

Example: Drawing names from a hat or using a random number generator.

Section 1.4: Other Sampling Methods

Alternative sampling methods are used when simple random sampling is impractical.

  • Stratified Sample: Population divided into subgroups (strata), and random samples taken from each.

  • Systematic Sample: Every k-th individual is selected from a list after a random start.

  • Cluster Sample: Population divided into clusters, some clusters are randomly selected, and all individuals in chosen clusters are sampled.

  • Convenience Sample: Individuals are chosen based on ease of access.

Example: Surveying every 10th person entering a store (systematic sampling).

Formula: Systematic sampling steps: Number the population, choose a random starting point, then select every k-th member.

Section 1.5: Bias

Bias refers to systematic errors in data collection that can affect the validity of results.

  • Sampling Bias: Sample is not representative due to poor methodology.

  • Nonresponse Bias: Individuals selected do not respond.

  • Response Bias: Responses are inaccurate due to question wording or respondent behavior.

  • Undercoverage: Some groups are inadequately represented.

  • Sampling Error: Natural variability from using a sample to estimate a population parameter.

  • Non-sampling Error: Errors from poor data collection, processing, or measurement.

Example: Only surveying people at a gym about exercise habits (sampling bias).

Section 1.6: Experimental Design

Experimental design involves planning how to conduct an experiment to ensure valid and reliable results.

  • Experiment: Study where treatments are applied to subjects.

  • Factors: Explanatory variables manipulated in the experiment.

  • Treatment: Specific condition applied to subjects.

  • Experimental Unit/Subject: The individual receiving the treatment.

  • Control Group: Group receiving no treatment or a standard treatment for comparison.

  • Placebo: Inactive treatment used to control for psychological effects.

  • Blinding: Subjects (single-blind) or both subjects and experimenters (double-blind) do not know treatment assignments.

Example: Testing a new drug with a placebo group and double-blinding.

Organizing and Summarizing Data

Section 2.1: Qualitative Data

Qualitative data is summarized using tables and graphical displays.

  • Raw Data: Original, unprocessed data.

  • Frequency Distribution: Table showing counts for each category.

  • Relative Frequency: Proportion of observations in each category.

  • Bar Graph: Visual display of frequencies for categories.

  • Pareto Chart: Bar graph with categories ordered by frequency.

  • Pie Chart: Circular chart showing proportions.

  • Side-by-Side Bar Graphs: Compare frequencies across groups.

Formula: Relative frequency = frequency / total number of observations.

Example: Survey results on favorite ice cream flavors displayed in a bar graph.

Section 2.2: Quantitative Data

Quantitative data is organized into classes and displayed using histograms and frequency tables.

  • Class: Interval grouping of data values.

  • Histogram: Bar graph for quantitative data; bars touch to indicate continuous intervals.

  • Lower/Upper Class Limit: Smallest/largest value in a class.

  • Class Width: Difference between lower limits of consecutive classes.

  • Uniform Distribution: All values occur with similar frequency.

  • Bell-Shaped Distribution: Symmetrical, most data near the center.

  • Left/Right Skew: Tail extends to the left (negative) or right (positive).

Example: Heights of students grouped into intervals and displayed in a histogram.

Note: All classes in a histogram should have equal width.

Section 2.3: Other Displays of Quantitative Data

Additional methods for summarizing quantitative data include cumulative tables and time-series graphs.

  • Cumulative Frequency Distribution: Shows the total number of observations up to each class.

  • Cumulative Relative Frequency Distribution: Shows the proportion of observations up to each class.

  • Ogive: Line graph of cumulative frequencies.

  • Time-Series Data: Data collected over time.

  • Time-Series Graph: Line graph showing trends over time.

Formulas:

  • Cumulative frequency of class i = cumulative frequency of class (i-1) + frequency of class i

  • Cumulative relative frequency = cumulative frequency / total number of observations

Example: Monthly sales data plotted as a time-series graph.

Numerically Summarizing Data

Section 3.1: Measures of Central Tendency

Measures of central tendency describe the center of a data set.

  • Arithmetic Mean (Average): Sum of all values divided by the number of values.

    • Population mean:

    • Sample mean:

  • Median (M): Middle value when data is ordered.

    • If n is odd:

    • If n is even:

  • Mode: Value(s) that occur most frequently. Data can be bimodal, multimodal, or have no mode.

  • Resistant Statistic: Not affected by extreme values (e.g., median).

Example: Test scores: 70, 80, 80, 90. Mean = 80, Median = 80, Mode = 80.

Note: Mean is sensitive to outliers; median is preferred for skewed data.

Section 3.2: Measures of Spread (Dispersion)

Measures of spread describe the variability in a data set.

  • Range (R):

  • Population Standard Deviation ():

  • Sample Standard Deviation (s):

  • Variance: Square of the standard deviation.

  • Degrees of Freedom: For sample variance, denominator is n-1.

  • Bias: Sample variance is an unbiased estimator of population variance when dividing by n-1.

Example: Data: 2, 4, 6. Mean = 4, s =

Section 3.4: Measures of Position and Outliers

Measures of position locate a value within a data set and help identify outliers.

  • z-score: Number of standard deviations a value is from the mean.

    • Population:

    • Sample:

  • Percentile (Pk): Value below which k% of data falls.

  • Quartiles: Q1 (25th percentile), Q3 (75th percentile), Median (50th percentile).

  • Interquartile Range (IQR):

  • Outliers: Values outside the fences.

    • Upper fence:

    • Lower fence:

Example: Data: 1, 2, 3, 4, 5, 6, 100. 100 is an outlier.

Section 3.5: 5 Number Summaries and Box Plots

The five-number summary and boxplots provide a concise summary and visual representation of data distribution.

  • Five-Number Summary: Minimum, Q1, Median, Q3, Maximum.

  • Boxplot: Graphical display of the five-number summary, showing spread and skewness.

Example: Data: 2, 4, 6, 8, 10. Five-number summary: 2, 4, 6, 8, 10.

Describing the Relation Between Two Variables

Section 4.1: Scatter Diagrams and Correlation

Scatter diagrams and correlation coefficients are used to analyze the relationship between two quantitative variables.

  • Scatter Plot/Diagram: Graph of paired (x, y) data points.

  • Linear vs Nonlinear Relations: Linear if points form a straight pattern; nonlinear otherwise.

  • Positive/Negative Association: Positive if y increases with x; negative if y decreases as x increases.

  • Linear Correlation Coefficient (r): Measures strength and direction of linear relationship.

  • Critical Value: Used to assess significance of r; if , the correlation is significant.

  • Properties of r: ; r = 1 or -1 indicates perfect linear relation; r = 0 indicates no linear relation.

  • Correlation ≠ Causation: A significant r does not imply causation.

Example: Height and weight plotted on a scatter diagram; r = 0.85 indicates strong positive association.

Section 4.2: Least Squares Regression

The least-squares regression line models the linear relationship between two variables and is used for prediction.

  • Least-Squares Regression Line: Line that minimizes the sum of squared residuals.

    • Equation:

    • Slope:

    • Intercept:

  • Predicted Value (): Value on the regression line for a given x.

  • Residual: Difference between observed and predicted value:

  • Interpretation of Slope: Change in y for a one-unit increase in x.

  • Extrapolation: Predicting for x-values outside the data range is unreliable.

Example: Given data on study hours (x) and test scores (y), the regression line can predict test scores for a given number of study hours.

Appendix: Key Tables

Level of Measurement

Type

Example

Nominal

Qualitative

Gender, Color

Ordinal

Qualitative/Quantitative

Rankings, Letter Grades

Interval

Quantitative

Temperature (°F)

Ratio

Quantitative

Height, Weight

Additional info: This table summarizes the main levels of measurement, their types, and examples for quick reference.

Pearson Logo

Study Prep