BackStatistics and Biostatistics in General Biology: Concepts, Applications, and Analysis
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Statistics and Biostatistics
Introduction to Statistics in Biology
Statistics and biostatistics are essential tools in biology for analyzing data, drawing conclusions, and making informed decisions about health and disease. They provide methods to extract meaningful information from raw data and help scientists understand complex biological phenomena.
Statistics is the science of collecting, analyzing, and interpreting numerical data.
Biostatistics applies statistical methods to biological, medical, and health-related research.
Statisticians use concepts and methods to translate data into usable information, such as identifying cause and effect, assessing health risks, evaluating disease cures, and determining cost effectiveness of programs.
Statistical analysis allows scientists to systematically isolate and examine factors influencing health.
Statistical Data and Uncertainty
Nature of Statistical Data
Statistical data in biology is not always accurate or certain. The reliability of statistical findings depends on the quality of the data collected and the methods used for analysis.
Statistical findings are only as good as the underlying data.
The public often desires certainty, but scientific research is based on probability, not absolute certainty.
Political, cultural, and social factors can influence the interpretation and reporting of statistical data.
Example: The debate over mammograms for women aged 40-50 illustrates how statistical findings can be controversial and subject to public and political scrutiny.
Probability in Statistics
P-Value and Statistical Significance
Probability is a central concept in statistics, used to assess the likelihood that an observed result occurred by chance. The p-value is a key measure in hypothesis testing.
P-value: The probability that the observed result could happen by chance. For example, indicates an 80% chance of error.
P-value is influenced by the quality of research data and whether a true cause and effect exists.
When , the result is considered statistically significant, meaning it is unlikely to have occurred by chance.
When , there is still a 5% chance that the result is wrong.
Confidence Interval
A confidence interval provides a range of values within which a population parameter is likely to fall, with a certain level of confidence (commonly 95%).
Indicates the reliability of an estimate.
For a 95% confidence interval, there is a 95% probability that the true value lies within the specified range.
Law of Small Probabilities
The law of small probabilities states that rare events are unlikely to occur frequently, and unusual results may be due to chance rather than a true effect.
Types of Statistical Data
Categorical vs. Continuous Data
Statistical data can be classified into two main types: categorical and continuous.
Categorical Data: Data sorted into groups or categories (e.g., yes/no, male/female).
Continuous Data: Numerical data that can take any value within a range (e.g., age, height, weight).
Descriptive and Inferential Statistics
Descriptive Statistics
Descriptive statistics summarize and describe the main features of a dataset.
Used to describe a sample or population in a straightforward way.
Simplifies large amounts of data (e.g., mean age: 20.5 ± 1.5 years).
Standard deviation () indicates variability: 1 $s.d.$ covers 68% of data, 2 $s.d.$ covers 95%.
Inferential Statistics
Inferential statistics use data from a sample to make inferences about a larger population and test hypotheses about relationships and differences.
Determines if observed relationships and associations are real.
Common tests include:
T-Test: Compares means between two groups (numerical data).
Correlation Test: Assesses relationship between two numerical variables.
ANOVA: Compares means among multiple groups (numerical and categorical data).
Chi-Squared Test: Compares categorical variables (e.g., male vs. female smoking rates).
Multiple Regression: Examines the relationship between multiple independent variables and a single dependent variable.
Logistic Regression: Used when the dependent variable is categorical (e.g., yes/no outcomes).
Significance is determined by p-value ( indicates a significant result).
Statistics of Screening Tests
Sensitivity and Specificity
Screening tests are evaluated based on their ability to correctly identify individuals with and without a condition.
Sensitivity: The ability of a test to correctly identify true positives (those with the condition).
Specificity: The ability of a test to correctly identify true negatives (those without the condition).
False Negatives: Cases where the test fails to detect a condition that is present.
False Positives: Cases where the test incorrectly indicates a condition that is not present.
Increasing sample size and improving measurement accuracy can increase the power of a test (the probability of detecting a true effect).
Examples of Screening Tests
Mammography for breast cancer
HIV tests
Newborn screening
Rates and Indicators in Community Health
Common Health Rates
Rates are used to relate raw numbers to the size of the population and are important indicators of community health.
Birth rates
Mortality rates
Infant mortality rate
Crude rates vs. Adjusted rates
Group-specific rates
Risk Assessment and Risk Perception
Risk Assessment
Risk assessment identifies events and exposures that may be harmful to humans, estimates the probability of their occurrence, and evaluates the extent of harm they may cause.
For well-known risks, calculations are based on historical data.
Poorly understood risks require assumptions and estimates.
Risk Perception
Risk perception involves psychological factors and reflects the public's response to risks, which may differ from expert assessments.
Often classified on two scales: dread and knowability.
Public responses can be irrational compared to expert estimates.
Cost-Benefit and Cost-Effectiveness Analysis
Evaluating Policies and Programs
Cost-benefit analysis weighs the estimated cost of implementing a policy against the estimated benefit, usually in monetary terms. Cost-effectiveness analysis compares the efficiency of different methods to achieve the same objective.
Costs are generally easier to calculate than benefits.
Assigning monetary value to benefits, such as a life saved, is controversial.
Analysis helps determine the most efficient use of resources.
Summary Table: Types of Statistical Tests
Test | Type of Data | Main Purpose | Significance Criterion |
|---|---|---|---|
T-Test | Numerical | Compare means between two groups | |
Correlation Test | Numerical | Assess relationship between two variables | |
ANOVA | Numerical & Categorical | Compare means among multiple groups | |
Chi-Squared | Categorical | Compare proportions between groups | |
Multiple Regression | Numerical (dependent), Multiple independent | Assess effect of multiple variables on one outcome | |
Logistic Regression | Categorical (dependent), Multiple independent | Assess effect of multiple variables on categorical outcome |
Additional Resources
Rice Virtual Lab in Statistics by statistician David Lane: onlinestatbook.com/rvls.html
Additional info: Some explanations and definitions have been expanded for clarity and completeness, including the summary table of statistical tests and more detailed descriptions of sensitivity, specificity, and risk analysis.