BackIntroduction to Statistics: Key Concepts and Types of Data
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Introduction to Statistics
Statistical and Critical Thinking
Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data. A major use of statistics is to collect and use sample data to make conclusions about populations. Critical thinking in statistics involves questioning the validity of data, the methods of collection, and the appropriateness of conclusions drawn from data.
Population: The complete set of all individuals or items of interest in a statistical study.
Sample: A subset of the population, selected for analysis.
Parameter: A numerical measurement describing some characteristic of a population.
Statistic: A numerical measurement describing some characteristic of a sample.
Key Concept: Statistics often uses information from a sample to make inferences about a population.
Types of Data
Quantitative and Categorical Data
Data can be classified into two main types: quantitative (numerical) and categorical (qualitative or attribute) data.
Quantitative Data: Consists of numbers representing counts or measurements. Examples: The weights of supermodels, the ages of respondents.
Categorical Data: Consists of names or labels (not numbers that represent counts or measurements). Examples: The gender (male/female) of professional athletes, shirt numbers on professional athletes (as substitutes for names).
Working with Quantitative Data: Discrete vs. Continuous
Quantitative data can be further described by distinguishing between discrete and continuous types.
Discrete Data: Result when the data values are quantitative and the number of values is finite or countable. Example: The number of tosses of a coin before getting heads.
Continuous Data: Result from infinitely many possible quantitative values, where the collection of values is not countable. Example: The lengths of distances from 0 cm to 12 cm.
Levels of Measurement
Overview of Measurement Levels
Another way of classifying data is to use four levels of measurement: nominal, ordinal, interval, and ratio. These levels determine the types of statistical analyses that are appropriate for the data.
Nominal Level: Data consist of names, labels, or categories only. The data cannot be arranged in any meaningful order. Example: Survey responses of yes, no, and undecided.
Ordinal Level: Data can be arranged in some order, but differences between data values either cannot be determined or are meaningless. Example: Course grades A, B, C, D, or F.
Interval Level: Data can be arranged in order, and differences between data values can be found and are meaningful. However, there is no natural zero starting point. Example: Years 1000, 2000, 1776, and 1492.
Ratio Level: Data can be arranged in order, differences can be found and are meaningful, and there is a natural zero starting point. Both differences and ratios are meaningful. Example: Class times of 50 minutes and 100 minutes.
Summary Table: Levels of Measurement
Level | Description | Example |
|---|---|---|
Nominal | Categories only | Yes/No/Undecided |
Ordinal | Categories with some order | Course grades (A, B, C, D, F) |
Interval | Differences but no natural zero point | Years (1000, 2000, etc.) |
Ratio | Differences and a natural zero point | Class times (minutes) |
Big Data and Data Science
Big Data
Big data refers to data sets so large and complex that traditional software tools are inadequate for analysis. Analysis of big data may require software running in parallel on many different computers.
Data Science
Data science involves the application of statistics, computer science, and software engineering, along with other relevant fields such as sociology or finance, to analyze and interpret complex data sets.
Handling Missing Data
Types of Missing Data
Missing Completely at Random (MCAR): The likelihood of a data value being missing is independent of its value or any other values in the data set. Any data value is just as likely to be missing as any other.
Missing Not at Random (MNAR): The missing value is related to the reason that it is missing.
Correcting for Missing Data
Delete Cases: One common method is to delete all subjects having any missing values.
Impute Missing Values: Substitute values for missing data, a process known as imputation.
Additional info: In practice, the choice of method for handling missing data depends on the nature of the data and the reason for missingness. Imputation methods can range from simple (mean substitution) to complex (multiple imputation, regression-based methods).