BackChapter 1 - Defining and Collecting Data in Business Statistics
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Defining and Collecting Data
Introduction
Defining and collecting data is the foundational step in business statistics. Accurate data collection and clear variable definitions are essential for meaningful analysis and sound business decisions. This section covers the types of variables, measurement scales, sampling methods, data cleaning, and survey errors, providing a comprehensive overview for students beginning their study of business statistics.
Variables and Measurement Scales
Defining Variables
Variable: A characteristic or property that can take on different values among subjects in a study.
Operational Definition: A clear explanation of how a variable will be measured or observed, ensuring consistency and clarity among stakeholders.
Example: "Monthly sales" must specify whether it refers to the entire chain or individual stores, net or gross sales, and the unit of measurement (e.g., U.S. dollars).
Types of Variables
Numerical (Quantitative) Variables: Represent counted or measured quantities (e.g., monthly sales, age).
Categorical (Qualitative) Variables: Represent categories or groups (e.g., gender, product type).
Discrete Numerical Variables: Result from counting processes (e.g., number of smartphones sold).
Continuous Numerical Variables: Result from measuring processes and can take any value within an interval (e.g., time spent waiting in line).
Contextual Classification: The classification of a variable can depend on the analysis context (e.g., age as a number or as age groups).
Measurement Scales
Nominal Scale: Categorical, no inherent order (e.g., beverage type: coffee, tea, water).
Ordinal Scale: Categorical, with an implied order (e.g., business size: small, medium, large).
Interval Scale: Numerical, equal differences between values, no true zero (e.g., temperature in °F).
Ratio Scale: Numerical, ordered, equal intervals, true zero (e.g., income, download time).
Table: Comparison of Measurement Scales
Scale | Type | Order | Equal Intervals | True Zero | Example |
|---|---|---|---|---|---|
Nominal | Categorical | No | No | No | Gender |
Ordinal | Categorical | Yes | No | No | Business Size |
Interval | Numerical | Yes | Yes | No | Temperature (°F) |
Ratio | Numerical | Yes | Yes | Yes | Income |
Data Collection Methods
Populations and Samples
Population: All items or individuals of interest (e.g., all sales transactions in a year).
Sample: A subset of the population, used for practical data collection and analysis.
Statistic: A summary value for a sample.
Parameter: A summary value for a population.
Data Sources
Primary Data: Collected directly by the researcher or organization.
Secondary Data: Collected by others and used by the researcher (e.g., industry reports).
Observational Study: Data collected in a natural setting without intervention.
Designed Experiment: Researcher assigns treatments and observes outcomes.
Sampling Methods
Sampling Frames and Bias
Frame: The list of items from which a sample is drawn; must accurately represent the population.
Bias: Systematic error introduced by improper sampling or frame selection.
Types of Sampling Methods
Table: Sampling Methods Overview
Sampling Method | Probability Known? | Description | Example |
|---|---|---|---|
Simple Random | Yes | Every item has equal chance of selection | Randomly select 50 invoices from 5,000 |
Systematic | Yes | Select every k-th item after a random start | Every 20th item from a list |
Stratified | Yes | Divide population into strata, sample from each | Sample students by class year |
Cluster | Yes | Divide into clusters, sample clusters, study all in cluster | Sample city blocks, survey all households in selected blocks |
Convenience | No | Sample items easy to reach | Survey people at a mall |
Judgment | No | Sample selected by expert opinion | Interview industry experts |
Simple Random Sampling
Each item and each possible sample of a fixed size has an equal chance of selection.
Can be done with or without replacement.
Random number tables or software functions (e.g., Excel's RANDBETWEEN) are used for selection.
Formula: Probability of selecting any item on the first draw is (where is population size).
Systematic Sampling
Divide population of size into groups of items, where (rounded as needed).
Randomly select the first item from the first items, then every -th item thereafter.
Efficient for ordered lists but can be biased if there is a hidden pattern.
Stratified Sampling
Divide population into homogeneous subgroups (strata), then randomly sample from each stratum.
Ensures representation from all subgroups, increasing precision.
Cluster Sampling
Divide population into clusters (e.g., geographic areas), randomly select clusters, and study all items within selected clusters.
Cost-effective for large, dispersed populations but may require larger sample sizes for precision.
Data Cleaning and Preprocessing
Data Cleaning
Essential for accuracy and quality before analysis.
Addresses invalid values, coding errors, integration errors, and missing values.
Manual review is often necessary; always preserve the original data.
Types of Data Issues
Invalid Variable Values: Entries not matching operational definitions or outside valid ranges.
Coding Errors: Inconsistent or incorrect data entries (e.g., "Female" vs. "F").
Data Integration Errors: Redundant columns, duplicated rows, inconsistent units.
Missing Values: Data not collected or recorded; distinct from miscoded values.
Outliers: Extreme values identified using descriptive statistics (e.g., standard deviation, interquartile range).
Other Data Preprocessing Tasks
Data Formatting: Adjusting structure or encoding for analysis (e.g., converting images to spreadsheets).
Stacking Data: Combining multiple columns into one with a group label.
Unstacking Data: Splitting a column into multiple columns based on a grouping variable.
Recoding Variables: Redefining categories or grouping numerical values into ranges for analysis.
Survey Errors and Ethical Issues
Types of Survey Errors
Coverage Error: Some groups are excluded from the sampling frame, leading to selection bias.
Nonresponse Error: Not all selected individuals respond, possibly biasing results.
Sampling Error: Natural variation due to sampling; measured by margin of error.
Measurement Error: Errors from question design, respondent misunderstanding, or data recording.
Ethical Issues in Surveys
Intentional exclusion of groups (coverage error) or design leading to nonresponse is unethical.
Failure to disclose sample size or margin of error can mislead stakeholders.
Leading questions or interviewer influence can bias results.
Using nonprobability samples for generalization without disclosure is unethical.
Applications and Case Studies
Business Case Examples
Coca-Cola "New Coke": Focusing on taste preference in blind tests ignored actual purchase intent, leading to a failed product launch.
AMS Telecommunications: Uses internal and external data sources, emphasizing the need for clear operational definitions and appropriate data collection methods.
CardioGood Fitness: Identifies customer profiles using both categorical and numerical variables for targeted marketing.
Clear Mountain State Student Survey: Demonstrates the importance of variable classification for appropriate statistical analysis.
Software Tools for Data Handling
Excel
Automatically infers variable types; use leading apostrophes to force categorical treatment of numbers.
Functions like RANDBETWEEN for random sampling; add-ins for sampling without replacement.
Formulas and lookup tables for data cleaning and recoding.
JMP
Allows manual adjustment of variable type and scale.
Subset and stratify functions for sampling.
Stacking and unstacking via dialog boxes.
Minitab
Automatic and manual variable type assignment.
Data cleaning during import and with column formulas.
Replace and Calculator commands for recoding variables.
Summary Table: Variable Classification Examples
Variable | Type | Discrete/Continuous | Scale |
|---|---|---|---|
Number of cellphones | Numerical | Discrete | Ratio |
Monthly data usage | Numerical | Continuous | Ratio |
Academic major | Categorical | — | Nominal |
Gender | Categorical | — | Nominal |
Income | Numerical | Discrete/Continuous | Ratio |
Test scores | Numerical | Continuous | Interval/Ratio |
Key Takeaways
Clearly define variables and their measurement scales before collecting data.
Choose appropriate sampling methods to ensure representativeness and minimize bias.
Data cleaning and preprocessing are critical for reliable analysis.
Be aware of potential survey errors and ethical considerations in data collection and reporting.
Use software tools effectively for data handling, but always verify and clean data manually as needed.
Additional info: Some explanations and examples have been expanded for clarity and completeness, following academic best practices for introductory business statistics.