BackBiostatistics and Public Health: Week 1 Study Guide
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Introduction to Biostatistics and Public Health
Overview of Biostatistics
Biostatistics is the application of statistical methods to biological, medical, and public health research. It enables researchers to analyze data, draw conclusions, and make informed decisions about health and medicine.
Definition: Biostatistics uses statistical techniques to interpret and analyze data from biological and health-related studies.
Applications: Used in epidemiology, clinical trials, public health policy, and health disparities research.
Example: Analyzing the relationship between smoking and lung cancer incidence.
Course Structure and Expectations
Course Organization
This course (PUBHLTH 500: Investigating Public Health) introduces students to biostatistics in the context of public health. The course includes lectures, lab assignments, quizzes, and a group paper/presentation.
Assignments: Lab assignments (due Fridays), quizzes (multiple choice, via Canvas), and a group paper/presentation.
Grading: Categories are weighted; see syllabus for details.
Communication: All course activities are managed through Canvas.
Key Concepts in Biostatistics
Variables and Data Types
Understanding variable types is fundamental in biostatistics, as it determines the appropriate statistical methods for analysis.
Numerical Variables: Quantitative measurements, can be continuous (e.g., height, weight) or discrete (e.g., number of children).
Categorical Variables: Qualitative characteristics, can be nominal (unordered, e.g., gender) or ordinal (ordered, e.g., education level).
Example: Survey data may include variables such as gender (categorical), cigarettes smoked per week (numerical), and education level (ordinal).
Study Designs in Public Health
Different study designs are used to investigate relationships between exposures and outcomes in public health research.
Cohort Study: Follows subjects over time to assess the relationship between exposure and outcome. Cannot prove causation.
Case-Control Study: Compares individuals with a condition (cases) to those without (controls), looking back at exposures.
Cross-Sectional Study: Examines a sample at one point in time; useful for assessing associations but not causation.
Randomized Controlled Trial (RCT): Subjects are randomly assigned to groups; considered the gold standard for establishing causation.
Correlation vs. Causation
It is essential to distinguish between correlation (association) and causation in statistical analysis.
Correlation: Indicates a relationship between two variables, but does not imply one causes the other.
Causation: Implies that changes in one variable directly result in changes in another; best established through RCTs.
Example: Observational studies may show that seafood consumption is associated with lower cancer rates, but cannot prove causation due to potential confounding variables.
Sampling Methods
Importance of Sampling
Sampling is critical because it is often impractical to study entire populations. A good sample should be representative to allow valid generalization.
Population: The entire group of interest.
Sample: A subset of the population used to make inferences about the whole.
Statistical Inference: The process of using sample data to estimate population parameters.
Sampling Techniques
Several sampling methods are used to ensure representativeness and minimize bias.
Simple Random Sampling: Every member of the population has an equal chance of being selected.
Stratified Sampling: Population is divided into strata (groups), and samples are taken from each stratum.
Cluster Sampling: Population is divided into clusters, some clusters are randomly selected, and all individuals in those clusters are studied.
Example: For a household survey in a metropolitan area, cluster sampling may be used for practicality, but care must be taken to avoid missing important subgroups.
Introduction to R and Data Analysis
Using R and RStudio
R is a statistical programming language widely used in biostatistics. RStudio is an integrated development environment (IDE) for R.
Importing Data: Use the Import Wizard or packages like tidyverse to load datasets (e.g., BRFSS23).
Basic Commands: summary() for summary statistics, table() for frequency counts, boxplot() and hist() for visualizations.
R-Markdown: Allows integration of code, output, and formatted text for reproducible reports. LaTeX can be used for equations (e.g., ).
Example R Code
Creating a Numeric Vector: x <- c(1,2,3,4)
Calculating Mean by Group: tapply(chicagoSEXVAR, mean, na.rm=TRUE)
Barplot of Categorical Variable: barplot(table(chicago$'_BMISCAT'))
Summary Table: Sampling Methods
Sampling Method | Description | Advantages | Disadvantages |
|---|---|---|---|
Simple Random | Each member has equal chance | Unbiased, easy to analyze | May be impractical for large populations |
Stratified | Population divided into strata, sampled from each | Ensures representation of all groups | Requires knowledge of strata |
Cluster | Population divided into clusters, some clusters selected | Practical for large populations | May miss important subgroups |
Key Formulas
Sample Mean:
Population Mean:
Sample Variance:
Additional info:
Some content inferred from context and standard biostatistics curriculum, such as definitions and examples of study designs and sampling methods.
BRFSS23 refers to the Behavioral Risk Factor Surveillance System dataset, commonly used in public health research.