Skip to main content
Back

Chapter 1 - Defining and Collecting Data in Business Statistics

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Defining and Collecting Data

Introduction

Defining and collecting data is the foundational step in business statistics. Accurate data collection and clear variable definitions are essential for meaningful analysis and sound business decisions. This section covers the types of variables, measurement scales, sampling methods, data cleaning, and survey errors, providing a comprehensive overview for students beginning their study of business statistics.

Variables and Measurement Scales

Defining Variables

  • Variable: A characteristic or property that can take on different values among subjects in a study.

  • Operational Definition: A clear explanation of how a variable will be measured or observed, ensuring consistency and clarity among stakeholders.

  • Example: "Monthly sales" must specify whether it refers to the entire chain or individual stores, net or gross sales, and the unit of measurement (e.g., U.S. dollars).

Types of Variables

  • Numerical (Quantitative) Variables: Represent counted or measured quantities (e.g., monthly sales, age).

  • Categorical (Qualitative) Variables: Represent categories or groups (e.g., gender, product type).

  • Discrete Numerical Variables: Result from counting processes (e.g., number of smartphones sold).

  • Continuous Numerical Variables: Result from measuring processes and can take any value within an interval (e.g., time spent waiting in line).

  • Contextual Classification: The classification of a variable can depend on the analysis context (e.g., age as a number or as age groups).

Measurement Scales

  • Nominal Scale: Categorical, no inherent order (e.g., beverage type: coffee, tea, water).

  • Ordinal Scale: Categorical, with an implied order (e.g., business size: small, medium, large).

  • Interval Scale: Numerical, equal differences between values, no true zero (e.g., temperature in °F).

  • Ratio Scale: Numerical, ordered, equal intervals, true zero (e.g., income, download time).

Table: Comparison of Measurement Scales

Scale

Type

Order

Equal Intervals

True Zero

Example

Nominal

Categorical

No

No

No

Gender

Ordinal

Categorical

Yes

No

No

Business Size

Interval

Numerical

Yes

Yes

No

Temperature (°F)

Ratio

Numerical

Yes

Yes

Yes

Income

Data Collection Methods

Populations and Samples

  • Population: All items or individuals of interest (e.g., all sales transactions in a year).

  • Sample: A subset of the population, used for practical data collection and analysis.

  • Statistic: A summary value for a sample.

  • Parameter: A summary value for a population.

Data Sources

  • Primary Data: Collected directly by the researcher or organization.

  • Secondary Data: Collected by others and used by the researcher (e.g., industry reports).

  • Observational Study: Data collected in a natural setting without intervention.

  • Designed Experiment: Researcher assigns treatments and observes outcomes.

Sampling Methods

Sampling Frames and Bias

  • Frame: The list of items from which a sample is drawn; must accurately represent the population.

  • Bias: Systematic error introduced by improper sampling or frame selection.

Types of Sampling Methods

Table: Sampling Methods Overview

Sampling Method

Probability Known?

Description

Example

Simple Random

Yes

Every item has equal chance of selection

Randomly select 50 invoices from 5,000

Systematic

Yes

Select every k-th item after a random start

Every 20th item from a list

Stratified

Yes

Divide population into strata, sample from each

Sample students by class year

Cluster

Yes

Divide into clusters, sample clusters, study all in cluster

Sample city blocks, survey all households in selected blocks

Convenience

No

Sample items easy to reach

Survey people at a mall

Judgment

No

Sample selected by expert opinion

Interview industry experts

Simple Random Sampling

  • Each item and each possible sample of a fixed size has an equal chance of selection.

  • Can be done with or without replacement.

  • Random number tables or software functions (e.g., Excel's RANDBETWEEN) are used for selection.

  • Formula: Probability of selecting any item on the first draw is (where is population size).

Systematic Sampling

  • Divide population of size into groups of items, where (rounded as needed).

  • Randomly select the first item from the first items, then every -th item thereafter.

  • Efficient for ordered lists but can be biased if there is a hidden pattern.

Stratified Sampling

  • Divide population into homogeneous subgroups (strata), then randomly sample from each stratum.

  • Ensures representation from all subgroups, increasing precision.

Cluster Sampling

  • Divide population into clusters (e.g., geographic areas), randomly select clusters, and study all items within selected clusters.

  • Cost-effective for large, dispersed populations but may require larger sample sizes for precision.

Data Cleaning and Preprocessing

Data Cleaning

  • Essential for accuracy and quality before analysis.

  • Addresses invalid values, coding errors, integration errors, and missing values.

  • Manual review is often necessary; always preserve the original data.

Types of Data Issues

  • Invalid Variable Values: Entries not matching operational definitions or outside valid ranges.

  • Coding Errors: Inconsistent or incorrect data entries (e.g., "Female" vs. "F").

  • Data Integration Errors: Redundant columns, duplicated rows, inconsistent units.

  • Missing Values: Data not collected or recorded; distinct from miscoded values.

  • Outliers: Extreme values identified using descriptive statistics (e.g., standard deviation, interquartile range).

Other Data Preprocessing Tasks

  • Data Formatting: Adjusting structure or encoding for analysis (e.g., converting images to spreadsheets).

  • Stacking Data: Combining multiple columns into one with a group label.

  • Unstacking Data: Splitting a column into multiple columns based on a grouping variable.

  • Recoding Variables: Redefining categories or grouping numerical values into ranges for analysis.

Survey Errors and Ethical Issues

Types of Survey Errors

  • Coverage Error: Some groups are excluded from the sampling frame, leading to selection bias.

  • Nonresponse Error: Not all selected individuals respond, possibly biasing results.

  • Sampling Error: Natural variation due to sampling; measured by margin of error.

  • Measurement Error: Errors from question design, respondent misunderstanding, or data recording.

Ethical Issues in Surveys

  • Intentional exclusion of groups (coverage error) or design leading to nonresponse is unethical.

  • Failure to disclose sample size or margin of error can mislead stakeholders.

  • Leading questions or interviewer influence can bias results.

  • Using nonprobability samples for generalization without disclosure is unethical.

Applications and Case Studies

Business Case Examples

  • Coca-Cola "New Coke": Focusing on taste preference in blind tests ignored actual purchase intent, leading to a failed product launch.

  • AMS Telecommunications: Uses internal and external data sources, emphasizing the need for clear operational definitions and appropriate data collection methods.

  • CardioGood Fitness: Identifies customer profiles using both categorical and numerical variables for targeted marketing.

  • Clear Mountain State Student Survey: Demonstrates the importance of variable classification for appropriate statistical analysis.

Software Tools for Data Handling

Excel

  • Automatically infers variable types; use leading apostrophes to force categorical treatment of numbers.

  • Functions like RANDBETWEEN for random sampling; add-ins for sampling without replacement.

  • Formulas and lookup tables for data cleaning and recoding.

JMP

  • Allows manual adjustment of variable type and scale.

  • Subset and stratify functions for sampling.

  • Stacking and unstacking via dialog boxes.

Minitab

  • Automatic and manual variable type assignment.

  • Data cleaning during import and with column formulas.

  • Replace and Calculator commands for recoding variables.

Summary Table: Variable Classification Examples

Variable

Type

Discrete/Continuous

Scale

Number of cellphones

Numerical

Discrete

Ratio

Monthly data usage

Numerical

Continuous

Ratio

Academic major

Categorical

Nominal

Gender

Categorical

Nominal

Income

Numerical

Discrete/Continuous

Ratio

Test scores

Numerical

Continuous

Interval/Ratio

Key Takeaways

  • Clearly define variables and their measurement scales before collecting data.

  • Choose appropriate sampling methods to ensure representativeness and minimize bias.

  • Data cleaning and preprocessing are critical for reliable analysis.

  • Be aware of potential survey errors and ethical considerations in data collection and reporting.

  • Use software tools effectively for data handling, but always verify and clean data manually as needed.

Additional info: Some explanations and examples have been expanded for clarity and completeness, following academic best practices for introductory business statistics.

Pearson Logo

Study Prep