Collecting Data: Study Design and Sampling in Statistics

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Collecting Data in Statistics

Overview

Collecting data is a foundational step in statistical analysis, involving the identification of data sources, study design, sampling definitions, and sampling methods. Proper data collection ensures the reliability and validity of statistical conclusions.

Data sources: Existing databases, company records, government sources, sensors, and more.
Study design: Planning how data will be collected and analyzed.
Sampling definitions: Key terms such as population, sample, unit, sampling frame, and variable.
Sampling methods: Techniques for selecting representative samples.

Data Sources

Existing Sources and Business Applications

Data can be obtained from a variety of existing sources, each with specific uses in business and research.

Company websites: Online sales data, booking information.
Restaurants and retailers: Daily/weekly sales, costs, and other measurable data.
Wearables and sensors: Data from manufacturing and health devices.
Government and European sources: Public datasets (e.g., www.cso.ie).

Applications: Market segmentation, targeted sales, customer profiling, smart manufacturing, supply chain analysis.

Statistical Study Designs

Observational vs Experimental Studies

Statistical studies are classified based on how data is collected and the role of the researcher.

Observational studies: The researcher observes without intervention. Example: Questionnaire on customer satisfaction.
Experimental studies: The researcher manipulates variables to observe effects. Examples: Randomised Control Trial (RCT), Product A/B Testing.

Cross-sectional vs Longitudinal Studies

Cross-sectional studies: Data collected at a single point in time. Example: Opinion poll.
Longitudinal studies: Data collected over a period of time. Example: Cohort study (e.g., Growing up in Ireland).

Introduction to Collecting Data

Why Not Survey Everyone?

Surveying an entire population is often impractical due to constraints such as time, cost, and changing opinions. Instead, a sample is used to make inferences about the population.

Time: Large populations require significant resources.
Money: Cost of surveying everyone is prohibitive.
Changing opinions: Responses may vary over time.
No need to ask everyone: Representative samples suffice.

Sampling

Representative Samples

Sampling involves selecting a subset of the population to estimate population characteristics. A representative sample accurately reflects the population, avoiding over- or under-representation of any group.

Sample values are used to estimate population values.
Obtaining a representative sample can be challenging due to potential biases.

Case Study: The 1936 Literary Digest Poll

This historical example illustrates the consequences of poor sampling methods.

Predicted Landon to win (43% predicted vs. 62% actual for Roosevelt).
Errors: High non-response (only 24% returned questionnaires), biased sample (used phone/social club records).
Modern relevance: Email/Internet surveys often have low response rates; identifying non-biased samples is crucial.

Key Definitions in Sampling

Important Terms

Population: Entire group of objects/subjects about which information is required.
Sample: Subset of the population that is actually observed or measured.
Unit: Any individual member of the population.
Sampling frame: List or form of identification of individuals in the population.
Variable: Any quantity measured whose value varies from one unit to another.

Examples

Tour company survey: Population: All customers of the tour company Sampling frame: Email database of customers Sample: 200 customers who responded Unit: A customer Variable: Customer's rating of their experience on Tour of Killarney
Cholesterol study: Population: All first year UL Business students Sampling frame: Class list Sample: 20 students Unit: A student Variable: Cholesterol level

Parameters and Statistics

Definitions and Examples

Parameter: Numerical characteristic of the population; usually unknown and fixed.
Statistic: Numerical characteristic of the sample; computed from the sample and varies from sample to sample.

Example:

10% of all Cork residential phones are unlisted (Parameter).
7% of a sample of 100 phones are unlisted (Statistic).

Representative Sample

Importance and Consequences

A representative sample mirrors the population, ensuring valid generalizations.
Non-representative samples lead to incorrect conclusions and limited applicability.

Accuracy of Sample Statistics

Precision and Bias

Precision: Increased by larger sample sizes.
Low bias: Achieved through random sampling.

Random Sampling

Each unit has an equal chance of selection.
Advantage: Protects against bias.
Disadvantage: Difficult to implement in practice.

Illustration of Bias and Precision

Type	Description
Low precision	Results are scattered, not close to the true value.
High precision, high bias	Results are close together but far from the true value.
Low precision, high bias	Results are scattered and far from the true value.
High precision, low bias	Results are close together and close to the true value.

Types of Sampling Methods

Simple Random Sampling

All units have an equal chance of selection (e.g., names in a hat).

Stratified Sampling

Population divided into sections; random sample taken from each section.
Common in opinion polls.

Cluster Sampling

Population divided into clusters; random sample of clusters selected, all subjects within chosen clusters included.
Used in medical research (e.g., clinics as clusters).

Systematic Sampling

Sample selected by moving systematically through the population from a random start point (e.g., every 10th car).

Convenience Sampling

Sample identified by convenience, not randomization.

Judgement Sampling

Sample identified by expert judgement.

Voluntary Response Sampling

Participants volunteer to be part of the sample (e.g., radio polls).

Potential Errors in Data Collection

Sampling Errors

Biased sampling method.
Sampling frame differs from the population.

Non-sampling Errors

Wording of questions.
Method of contacting subjects (email, post, interviews).
Non-response.
Missing data.
Processing errors.

Summary

Aim: Take a representative sample from a population.
Sample statistics estimate unknown population parameters.
Random sampling should be used to avoid bias.
Sample size increases the precision of a sample statistic.

Key Formulas

Sample Mean:
Population Mean:
Sample Proportion:

Additional info: Academic context and examples have been expanded for clarity and completeness.