Statistics and Biostatistics in General Biology: Concepts, Applications, and Analysis

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Statistics and Biostatistics

Introduction to Statistics in Biology

Statistics and biostatistics are essential tools in biology for analyzing data, drawing conclusions, and making informed decisions about health and disease. They provide methods to extract meaningful information from raw data and help scientists understand complex biological phenomena.

Statistics is the science of collecting, analyzing, and interpreting numerical data.
Biostatistics applies statistical methods to biological, medical, and health-related research.
Statisticians use concepts and methods to translate data into usable information, such as identifying cause and effect, assessing health risks, evaluating disease cures, and determining cost effectiveness of programs.
Statistical analysis allows scientists to systematically isolate and examine factors influencing health.

Statistical Data and Uncertainty

Nature of Statistical Data

Statistical data in biology is not always accurate or certain. The reliability of statistical findings depends on the quality of the data collected and the methods used for analysis.

Statistical findings are only as good as the underlying data.
The public often desires certainty, but scientific research is based on probability, not absolute certainty.
Political, cultural, and social factors can influence the interpretation and reporting of statistical data.
Example: The debate over mammograms for women aged 40-50 illustrates how statistical findings can be controversial and subject to public and political scrutiny.

Probability in Statistics

P-Value and Statistical Significance

Probability is a central concept in statistics, used to assess the likelihood that an observed result occurred by chance. The p-value is a key measure in hypothesis testing.

P-value: The probability that the observed result could happen by chance. For example, indicates an 80% chance of error.
P-value is influenced by the quality of research data and whether a true cause and effect exists.
When , the result is considered statistically significant, meaning it is unlikely to have occurred by chance.
When , there is still a 5% chance that the result is wrong.

Confidence Interval

A confidence interval provides a range of values within which a population parameter is likely to fall, with a certain level of confidence (commonly 95%).

Indicates the reliability of an estimate.
For a 95% confidence interval, there is a 95% probability that the true value lies within the specified range.

Law of Small Probabilities

The law of small probabilities states that rare events are unlikely to occur frequently, and unusual results may be due to chance rather than a true effect.

Types of Statistical Data

Categorical vs. Continuous Data

Statistical data can be classified into two main types: categorical and continuous.

Categorical Data: Data sorted into groups or categories (e.g., yes/no, male/female).
Continuous Data: Numerical data that can take any value within a range (e.g., age, height, weight).

Descriptive and Inferential Statistics

Descriptive Statistics

Descriptive statistics summarize and describe the main features of a dataset.

Used to describe a sample or population in a straightforward way.
Simplifies large amounts of data (e.g., mean age: 20.5 ± 1.5 years).
Standard deviation () indicates variability: 1 $s.d.$ covers 68% of data, 2 $s.d.$ covers 95%.

Inferential Statistics

Inferential statistics use data from a sample to make inferences about a larger population and test hypotheses about relationships and differences.

Determines if observed relationships and associations are real.
Common tests include:
- T-Test: Compares means between two groups (numerical data).
- Correlation Test: Assesses relationship between two numerical variables.
- ANOVA: Compares means among multiple groups (numerical and categorical data).
- Chi-Squared Test: Compares categorical variables (e.g., male vs. female smoking rates).
- Multiple Regression: Examines the relationship between multiple independent variables and a single dependent variable.
- Logistic Regression: Used when the dependent variable is categorical (e.g., yes/no outcomes).
Significance is determined by p-value ( indicates a significant result).

Statistics of Screening Tests

Sensitivity and Specificity

Screening tests are evaluated based on their ability to correctly identify individuals with and without a condition.

Sensitivity: The ability of a test to correctly identify true positives (those with the condition).
Specificity: The ability of a test to correctly identify true negatives (those without the condition).
False Negatives: Cases where the test fails to detect a condition that is present.
False Positives: Cases where the test incorrectly indicates a condition that is not present.
Increasing sample size and improving measurement accuracy can increase the power of a test (the probability of detecting a true effect).

Examples of Screening Tests

Mammography for breast cancer
HIV tests
Newborn screening

Rates and Indicators in Community Health

Common Health Rates

Rates are used to relate raw numbers to the size of the population and are important indicators of community health.

Birth rates
Mortality rates
Infant mortality rate
Crude rates vs. Adjusted rates
Group-specific rates

Risk Assessment and Risk Perception

Risk Assessment

Risk assessment identifies events and exposures that may be harmful to humans, estimates the probability of their occurrence, and evaluates the extent of harm they may cause.

For well-known risks, calculations are based on historical data.
Poorly understood risks require assumptions and estimates.

Risk Perception

Risk perception involves psychological factors and reflects the public's response to risks, which may differ from expert assessments.

Often classified on two scales: dread and knowability.
Public responses can be irrational compared to expert estimates.

Cost-Benefit and Cost-Effectiveness Analysis

Evaluating Policies and Programs

Cost-benefit analysis weighs the estimated cost of implementing a policy against the estimated benefit, usually in monetary terms. Cost-effectiveness analysis compares the efficiency of different methods to achieve the same objective.

Costs are generally easier to calculate than benefits.
Assigning monetary value to benefits, such as a life saved, is controversial.
Analysis helps determine the most efficient use of resources.

Summary Table: Types of Statistical Tests

Test	Type of Data	Main Purpose
T-Test	Numerical	Compare means between two groups
Correlation Test	Numerical	Assess relationship between two variables
ANOVA	Numerical & Categorical	Compare means among multiple groups
Chi-Squared	Categorical	Compare proportions between groups
Multiple Regression	Numerical (dependent), Multiple independent	Assess effect of multiple variables on one outcome
Logistic Regression	Categorical (dependent), Multiple independent	Assess effect of multiple variables on categorical outcome

Additional Resources

Rice Virtual Lab in Statistics by statistician David Lane: onlinestatbook.com/rvls.html

Additional info: Some explanations and definitions have been expanded for clarity and completeness, including the summary table of statistical tests and more detailed descriptions of sensitivity, specificity, and risk analysis.