Sampling and Sampling Distributions – Study Notes

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Sampling and Sampling Distributions

7.1 Why Sample?

Sampling is a fundamental concept in statistics, especially in business contexts where measuring an entire population is often impractical. Instead, a subset (sample) is studied to make inferences about the whole group (population).

Population: All possible subjects of interest in a study.
Sample: A subset of the population, selected for analysis.
Why Sample?
- Measuring an entire population can be expensive or impossible.
- A properly selected sample allows for accurate assessment of the population.

7.2 Types of Sampling and Biases

Sampling methods determine how representative and reliable the results are. There are two main categories: probability and nonprobability sampling.

Probability Sampling

Probability Sample: Each member of the population has a known, nonzero chance of being selected.
Advantage: Enables inferential statistical tests for reliable conclusions about the population.

Simple Random Sampling

Every member of the population has an equal chance of being chosen.
Example: Selecting 10 students at random from a list of 1,800 using Excel's Data Analysis tool.
Without Replacement: Once selected, a member cannot be chosen again.

Systematic Sampling

Every kth member is chosen, where k is the population size divided by the sample size.

Formula for Systematic Sampling Constant:

Example: For a population of 1,800 and a sample of 10, (choose every 180th student).
Advantages: Easy to implement, reduces judgment bias.
Disadvantage: Risk of periodicity bias if there is a pattern in the population matching k.

Stratified Sampling

Population is divided into mutually exclusive groups (strata), and random samples are taken from each.
Homogeneity within strata, heterogeneity between strata.
Strata are based on important variables (e.g., age, income).

Cluster Sampling

Randomly select clusters (often based on geography), then sample all or some members within clusters.
Clusters are mini-populations, often heterogeneous within but similar to the overall population.
Examples: Classrooms, test-market cities.

Resampling

Statistical technique where many samples are repeatedly drawn from an available population.
Bootstrap Method: Uses computer software to extract many samples with replacement to estimate parameters (mean, proportion).

Nonprobability Sampling

Probability of selection is unknown.
Convenience Sample: Members are chosen because they are easily accessible.
Advantages: Quick, easy, provides general information.
Disadvantages: May not be representative.

Biases in Sampling

Biases are systematic errors that can affect the validity of results.

Type	Description
Sampling Bias	Sample is not representative of the population.
Nonresponse Bias	Individuals who do not respond differ from those who do.
Response Bias	Respondents provide inaccurate answers (e.g., due to leading questions).
Undercoverage Bias	Certain portions of the population are insufficiently represented.
Voluntary Response Bias	Those who volunteer differ systematically from those who do not.
Cognitive Biases	Logical errors in reasoning (e.g., anchoring, availability heuristic, confirmation, recency).

7.3 Sampling and Nonsampling Errors

Errors can arise from both the sampling process and other aspects of data collection.

Parameter: Value describing a population characteristic (e.g., mean, median).
Statistic: Value calculated from a sample.
Sampling Error: Difference between a sample statistic and the population parameter.

Formula for Sampling Error of the Sample Mean:

\[ \text{Sampling Error} = \overline{x} - \mu \]

Larger sample sizes reduce average sampling error.
Nonsampling Errors: Arise from ambiguous questions, leading questions, or data collection mistakes. Not related to sampling variability.

7.4 The Central Limit Theorem (CLT)

The Central Limit Theorem is a cornerstone of inferential statistics, stating that the distribution of sample means approaches normality as sample size increases, regardless of the population's distribution.

For large samples (usually ), the sampling distribution of the mean is approximately normal.
If the population is normal, the sampling distribution is normal for any sample size.
The mean of the sampling distribution equals the population mean ().
The standard deviation of the sampling distribution (standard error) is:

\[ \sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}} \]

As sample size increases, the standard error decreases, making estimates more precise.
If sampling from a finite population (where ), use the finite population correction:

\[ \sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}} \]

Application Example: Testing claims about means (e.g., average drive time) using the CLT and calculating probabilities with z-scores.

7.5 The Sampling Distribution of the Proportion

When dealing with proportions (e.g., percentage of successes), the sampling distribution describes the pattern of sample proportions from repeated samples.

Underlying distribution is binomial.
Conditions: and (where ).
Sample Proportion Formula:

\[ \hat{p} = \frac{x}{n} \]

Standard Error of the Proportion:

\[ \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \]

Z-score for the Sample Proportion:

\[ z = \frac{\hat{p} - p}{\sigma_{\hat{p}}} \]

Example: Testing a college's claim that 70% of graduates have jobs related to their majors using a sample of 120 students.

Summary Table: Types of Sampling Methods

Sampling Method	Description	Key Feature
Simple Random	Every member has equal chance	Random selection
Systematic	Every kth member selected	Uses interval k
Stratified	Population divided into strata, sample from each	Homogeneity within strata
Cluster	Randomly select clusters, sample within	Clusters are mini-populations
Convenience	Sample easily accessible members	Nonprobability

Key Takeaways:

Sampling allows for efficient and practical data collection.
Probability sampling methods support valid statistical inference.
Biases and errors must be minimized for reliable results.
The Central Limit Theorem justifies the use of normal probability models for sample means and proportions.