Introduction to Statistics: Key Concepts and Types of Data

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Introduction to Statistics

Statistical and Critical Thinking

Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data. A major use of statistics is to collect and use sample data to make conclusions about populations. Critical thinking in statistics involves questioning the validity of data, the methods of collection, and the appropriateness of conclusions drawn from data.

Population: The complete set of all individuals or items of interest in a statistical study.
Sample: A subset of the population, selected for analysis.
Parameter: A numerical measurement describing some characteristic of a population.
Statistic: A numerical measurement describing some characteristic of a sample.
Key Concept: Statistics often uses information from a sample to make inferences about a population.

Types of Data

Quantitative and Categorical Data

Data can be classified into two main types: quantitative (numerical) and categorical (qualitative or attribute) data.

Quantitative Data: Consists of numbers representing counts or measurements. Examples: The weights of supermodels, the ages of respondents.
Categorical Data: Consists of names or labels (not numbers that represent counts or measurements). Examples: The gender (male/female) of professional athletes, shirt numbers on professional athletes (as substitutes for names).

Working with Quantitative Data: Discrete vs. Continuous

Quantitative data can be further described by distinguishing between discrete and continuous types.

Discrete Data: Result when the data values are quantitative and the number of values is finite or countable. Example: The number of tosses of a coin before getting heads.
Continuous Data: Result from infinitely many possible quantitative values, where the collection of values is not countable. Example: The lengths of distances from 0 cm to 12 cm.

Levels of Measurement

Overview of Measurement Levels

Another way of classifying data is to use four levels of measurement: nominal, ordinal, interval, and ratio. These levels determine the types of statistical analyses that are appropriate for the data.

Nominal Level: Data consist of names, labels, or categories only. The data cannot be arranged in any meaningful order. Example: Survey responses of yes, no, and undecided.
Ordinal Level: Data can be arranged in some order, but differences between data values either cannot be determined or are meaningless. Example: Course grades A, B, C, D, or F.
Interval Level: Data can be arranged in order, and differences between data values can be found and are meaningful. However, there is no natural zero starting point. Example: Years 1000, 2000, 1776, and 1492.
Ratio Level: Data can be arranged in order, differences can be found and are meaningful, and there is a natural zero starting point. Both differences and ratios are meaningful. Example: Class times of 50 minutes and 100 minutes.

Summary Table: Levels of Measurement

Level	Description	Example
Nominal	Categories only	Yes/No/Undecided
Ordinal	Categories with some order	Course grades (A, B, C, D, F)
Interval	Differences but no natural zero point	Years (1000, 2000, etc.)
Ratio	Differences and a natural zero point	Class times (minutes)

Big Data and Data Science

Big Data

Big data refers to data sets so large and complex that traditional software tools are inadequate for analysis. Analysis of big data may require software running in parallel on many different computers.

Data Science

Data science involves the application of statistics, computer science, and software engineering, along with other relevant fields such as sociology or finance, to analyze and interpret complex data sets.

Handling Missing Data

Types of Missing Data

Missing Completely at Random (MCAR): The likelihood of a data value being missing is independent of its value or any other values in the data set. Any data value is just as likely to be missing as any other.
Missing Not at Random (MNAR): The missing value is related to the reason that it is missing.

Correcting for Missing Data

Delete Cases: One common method is to delete all subjects having any missing values.
Impute Missing Values: Substitute values for missing data, a process known as imputation.

Additional info: In practice, the choice of method for handling missing data depends on the nature of the data and the reason for missingness. Imputation methods can range from simple (mean substitution) to complex (multiple imputation, regression-based methods).