Statistics Unit 1: Data Collection, Summarization, and Numerical Description

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Data Collection and Types of Data

Population, Sample, and Variables

Understanding the distinction between populations and samples is fundamental in statistics. A population is the entire group of individuals or items of interest, while a sample is a subset selected from the population. Parameters are numerical summaries describing populations, whereas statistics describe samples.

Descriptive statistics: Methods for summarizing and organizing data.
Inferential statistics: Methods for making predictions or inferences about a population based on sample data.
Variables: Characteristics or properties that can take on different values.
Quantitative variables: Numeric values (discrete or continuous).
Qualitative (categorical) variables: Non-numeric categories or groups.

Explanatory variables are used to explain or predict changes in response variables.

Study Designs and Sampling Methods

Statistical studies can be observational or experimental. Observational studies include cross-sectional, case-control, and cohort designs, while experiments may be completely randomized or use matched-pairs designs. Proper sampling is crucial to avoid bias and ensure representativeness.

Sampling schemes: Simple random, stratified, cluster, convenience, multistage.
Sources of bias: Sampling bias, nonresponse bias, response bias, undercoverage.
Open vs. closed questions: Open questions allow free responses; closed questions provide fixed options.
Sampling errors: Errors due to the process of selecting a sample.
Non-sampling errors: Errors not related to sampling, such as data entry mistakes.

In experiments, terms such as experimental unit/subject, factor, and treatment are used. The placebo effect, blinding, and replication are important for validity.

Organizing and Summarizing Data

Raw Data and Frequency Distributions

Raw data can be organized into frequency distributions, which summarize data into classes or categories. Key terms include class width, class limits, class midpoints, frequency, relative frequency, and cumulative frequency.

Qualitative data graphs: Bar graph, Pareto chart, side-by-side bar graph, pie chart.
Quantitative data graphs: Histogram, dot plot, stem-and-leaf plot, time-series graph, boxplot.
Distribution shapes: Symmetric, uniform, bell-shaped, skewed left, skewed right.

Statistical software such as StatCrunch can be used to construct these graphs and tables.

StatCrunch frequency table menu

Numerically Summarizing Data

Measures of Center

Measures of center describe the typical value in a data set. The most common measures are the mean, median, and mode.

Mean (arithmetic mean): The average value. For a sample: ; for a population: .
Weighted mean:
Median: The middle value when data are ordered. If n is odd, median is the th value; if n is even, median is the mean of the th and th values.
Mode: The most frequently occurring value(s).

Data sets may be no mode, bimodal, or multimodal. The mean and standard deviation can be calculated for both populations and samples, with sample statistics often being resistant to outliers.

Measures of Dispersion

Measures of dispersion describe the spread of data values.

Range: Difference between largest and smallest values.
Variance (sample):
Standard deviation (sample):
Variance (population):
Standard deviation (population):
Interquartile range (IQR):

The Empirical Rule

The Empirical Rule applies to bell-shaped (normal) distributions and describes the proportion of data within certain standard deviations of the mean:

Approximately 68% within 1 standard deviation ()
Approximately 95% within 2 standard deviations ()
Approximately 99.7% within 3 standard deviations ()

Empirical Rule bell curve

For sample data, use and in place of and .

Measures of Position and Outliers

Measures of position help describe the relative standing of a data value within a data set.

z-score: Indicates how many standard deviations a value is from the mean. For a sample: ; for a population:
Percentiles: The kth percentile is a value below which k percent of the data fall.
Quartiles: Divide data into four equal parts. is the median of the lower half, is the median, is the median of the upper half.
Five-number summary: Minimum, , median (), , maximum.
Interquartile Range (IQR):

To identify outliers:

Find and .
Compute IQR.
Calculate fences: Lower fence = ; Upper fence = .
Values outside these fences are considered outliers.

Boxplots can be used to visually identify outliers using these fences.

Additional info:

Statistical software such as StatCrunch provides menu options for constructing frequency tables, graphs, and summary statistics, as shown in the included image.
All formulas are provided in LaTeX format for clarity and ease of reference.