Biostatistics and Public Health: Week 1 Study Guide

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Introduction to Biostatistics and Public Health

Overview of Biostatistics

Biostatistics is the application of statistical methods to biological, medical, and public health research. It enables researchers to analyze data, draw conclusions, and make informed decisions about health and medicine.

Definition: Biostatistics uses statistical techniques to interpret and analyze data from biological and health-related studies.
Applications: Used in epidemiology, clinical trials, public health policy, and health disparities research.
Example: Analyzing the relationship between smoking and lung cancer incidence.

Course Structure and Expectations

Course Organization

This course (PUBHLTH 500: Investigating Public Health) introduces students to biostatistics in the context of public health. The course includes lectures, lab assignments, quizzes, and a group paper/presentation.

Assignments: Lab assignments (due Fridays), quizzes (multiple choice, via Canvas), and a group paper/presentation.
Grading: Categories are weighted; see syllabus for details.
Communication: All course activities are managed through Canvas.

Key Concepts in Biostatistics

Variables and Data Types

Understanding variable types is fundamental in biostatistics, as it determines the appropriate statistical methods for analysis.

Numerical Variables: Quantitative measurements, can be continuous (e.g., height, weight) or discrete (e.g., number of children).
Categorical Variables: Qualitative characteristics, can be nominal (unordered, e.g., gender) or ordinal (ordered, e.g., education level).
Example: Survey data may include variables such as gender (categorical), cigarettes smoked per week (numerical), and education level (ordinal).
Study Designs in Public Health

Different study designs are used to investigate relationships between exposures and outcomes in public health research.

Cohort Study: Follows subjects over time to assess the relationship between exposure and outcome. Cannot prove causation.
Case-Control Study: Compares individuals with a condition (cases) to those without (controls), looking back at exposures.
Cross-Sectional Study: Examines a sample at one point in time; useful for assessing associations but not causation.
Randomized Controlled Trial (RCT): Subjects are randomly assigned to groups; considered the gold standard for establishing causation.

Correlation vs. Causation

It is essential to distinguish between correlation (association) and causation in statistical analysis.

Correlation: Indicates a relationship between two variables, but does not imply one causes the other.
Causation: Implies that changes in one variable directly result in changes in another; best established through RCTs.
Example: Observational studies may show that seafood consumption is associated with lower cancer rates, but cannot prove causation due to potential confounding variables.

Sampling Methods

Importance of Sampling

Sampling is critical because it is often impractical to study entire populations. A good sample should be representative to allow valid generalization.

Population: The entire group of interest.
Sample: A subset of the population used to make inferences about the whole.
Statistical Inference: The process of using sample data to estimate population parameters.

Sampling Techniques

Several sampling methods are used to ensure representativeness and minimize bias.

Simple Random Sampling: Every member of the population has an equal chance of being selected.
Stratified Sampling: Population is divided into strata (groups), and samples are taken from each stratum.
Cluster Sampling: Population is divided into clusters, some clusters are randomly selected, and all individuals in those clusters are studied.
Example: For a household survey in a metropolitan area, cluster sampling may be used for practicality, but care must be taken to avoid missing important subgroups.

Introduction to R and Data Analysis

Using R and RStudio

R is a statistical programming language widely used in biostatistics. RStudio is an integrated development environment (IDE) for R.

Importing Data: Use the Import Wizard or packages like tidyverse to load datasets (e.g., BRFSS23).
Basic Commands: summary() for summary statistics, table() for frequency counts, boxplot() and hist() for visualizations.
R-Markdown: Allows integration of code, output, and formatted text for reproducible reports. LaTeX can be used for equations (e.g., ).

Example R Code

Creating a Numeric Vector: x <- c(1,2,3,4)
Calculating Mean by Group: tapply(chicagoSEXVAR, mean, na.rm=TRUE)
Barplot of Categorical Variable: barplot(table(chicago$'_BMISCAT'))

Summary Table: Sampling Methods

Sampling Method	Description	Advantages	Disadvantages
Simple Random	Each member has equal chance	Unbiased, easy to analyze	May be impractical for large populations
Stratified	Population divided into strata, sampled from each	Ensures representation of all groups	Requires knowledge of strata
Cluster	Population divided into clusters, some clusters selected	Practical for large populations	May miss important subgroups

Key Formulas

Sample Mean:
Population Mean:
Sample Variance:

Additional info:

Some content inferred from context and standard biostatistics curriculum, such as definitions and examples of study designs and sampling methods.
BRFSS23 refers to the Behavioral Risk Factor Surveillance System dataset, commonly used in public health research.