The Human Genome and Human Genetic Variation: Structure, Sequencing, and Implications

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

The Human Genome and Human Genetic Variation

Introduction to the Human Genome

The human genome is the complete set of genetic information for Homo sapiens, encoded as DNA within the 23 pairs of chromosomes. Understanding the human genome is a central goal of modern biology, providing insights into gene function, evolution, and the genetic basis of human diversity and disease.

Genome Size: Approximately 3 billion base pairs (bp).
Protein-Coding Genes: About 20,000 genes scattered across 23 chromosomes.
Key Questions: How many genes do we have? What do they do? How do our genomes compare to other species? How much variation exists between individuals and populations?

Sequencing the Human Genome

The Human Genome Project

The Human Genome Project (HGP) was an international, publicly funded effort launched in 1990 to determine the sequence of the entire human genome. It was completed in 2003, with a parallel effort by the private company Celera Genomics. The project relied on years of research using model organisms and advanced sequencing technologies.

Public Project: International collaboration (US, UK, Japan, China, Germany, France).
Private Project: Celera Genomics, led by Craig Venter.
Outcome: Both teams published the genome sequence in 2003; the public project cost $3 billion, Celera $300 million (using public data).

Public and private leaders of the Human Genome Project

Whole Genome Shotgun Sequencing

Whole genome shotgun sequencing is a method used to sequence the human genome by breaking it into small fragments, sequencing each fragment, and then assembling the sequences using computational methods.

Library Construction: The genome is digested into random fragments (~2000 bp), each cloned into a bacterial plasmid to create a genomic library.
Sequencing: Each fragment is sequenced, originally using Sanger (dideoxy) sequencing.
Assembly: Computer algorithms identify overlapping sequences to reconstruct the entire genome.

Steps in shotgun sequencing Shotgun sequencing assembly process

Making a Genomic DNA Library

To sequence the genome, DNA is partially digested to produce overlapping fragments, which are then cloned into plasmids and stored in bacteria. This collection is called a genomic library.

Partial Digestion: Restriction enzymes cut DNA at random sites, creating overlapping fragments.
Cloning: Each fragment is inserted into a plasmid and transformed into bacteria.
Storage: Bacteria are stored in plates, each well containing a unique DNA fragment.

Sanger (Dideoxy) DNA Sequencing

Sanger sequencing, developed by Fred Sanger, uses dideoxynucleotides (ddNTPs) to terminate DNA synthesis at specific bases, allowing the sequence to be determined by fragment length and fluorescent labeling.

Components: DNA template, primer, DNA polymerase, four dNTPs, four fluorescently labeled ddNTPs.
Mechanism: Incorporation of a ddNTP terminates chain elongation, producing fragments of varying lengths.
Detection: Fragments are separated by capillary electrophoresis, and the terminal base is identified by fluorescence.

Sanger sequencing: ddCTP reaction Sanger sequencing: ddGTP reaction Sanger sequencing: ddTTP reaction Sanger sequencing: ddATP reaction Capillary electrophoresis of sequencing products Capillary electrophoresis readout

Genome Assembly

After sequencing, computational algorithms assemble the short DNA sequences into longer contiguous sequences (contigs) by identifying overlaps. Gaps are filled manually or with additional sequencing.

Genome assembly from contigs

What Did We Learn from the Human Genome Sequence?

Genome Composition

Only about 2% of human DNA codes for proteins.
Approximately 26% is introns, and 20% is intergenic regions (promoters, enhancers).
About 44% consists of repetitive elements, including transposons and viral relics.

Pie chart of human genome composition

Gene Number and Comparison with Other Species

Humans have about 20,000–22,000 protein-coding genes, similar to other complex eukaryotes.
Gene prediction relies on bioinformatics algorithms to identify coding regions, start/stop codons, and splice sites.

Gene numbers in various species

Bioinformatics Tools

GENBANK: A public database at NCBI containing all available DNA sequence data.
BLAST (Basic Local Alignment Search Tool): Compares a query sequence to known sequences to identify similarities and predict function.

BLAST alignment output BLAST graphical output BLAST sequence alignment

Gene Function Categories

Largest categories: cell signaling, gene regulation, unknown function (about 20%).

Pie chart of gene function categories

Comparative Genomics

Humans share a high percentage of DNA with other species (e.g., 99% with Neanderthals, 96% with chimpanzees, 50% with bananas).

Human Genetic Variation

Types of Genetic Variation

Polymorphic Sites: Sites with two or more alleles present in >1% of the population.
Single Nucleotide Polymorphisms (SNPs): The most common type, accounting for 90% of human genetic variation.
Other Variants: Insertions, deletions, copy number variants, and structural variants.

The Thousand Genomes Project

This project sequenced genomes from diverse populations to catalog human genetic variation. A typical person’s genome differs from the reference at over 4 million sites, with most variation shared globally.

Each genome contains 2,100–2,500 structural variants, 10,000–12,000 protein-altering variants, and 150–200 truncating variants.
There is no single 'human genome'; variation is extensive and complex.

Applications of Human Genetic Variation

Mapping disease genes
Determining disease susceptibility
Forensics
Understanding human evolution and ancestry

Genome-Wide Association Studies (GWAS)

GWAS identify associations between SNPs and diseases by comparing the genomes of affected and unaffected individuals using SNP chips, which can analyze over 900,000 SNPs simultaneously.

SNP Chips: Contain allele-specific probes; DNA fragments hybridize to matching probes and are detected by fluorescence.
Genotypes: Individuals can be homozygous or heterozygous at each SNP (e.g., G/G, C/C, G/C).
Polygenic Risk Scores: Estimate disease risk based on the combination of multiple SNPs.

Examples of GWAS Findings

First GWAS identified genes for age-related macular degeneration.
Studies have found SNPs associated with heart disease, diabetes, cancer, neurological disorders, and more.
Stuttering, for example, is now linked to neurological rather than behavioral causes.

Personal Genomics and Direct-to-Consumer Testing

Companies like 23andMe offer SNP analysis and genome sequencing to assess disease risk and ancestry.
Reports include monogenic and polygenic risk predictions.
Ethical concerns exist regarding information for untreatable conditions and behavioral genetics.

Pharmacogenomics

Pharmacogenomics studies how genetic variation affects drug response. For example, mutations in the CYP2D6 gene affect codeine metabolism, influencing drug efficacy and safety.

10% of people lack functional CYP2D6 and do not benefit from codeine.
30% of East Africans have multiple copies and require higher doses.

Preimplantation Genetic Diagnosis (PGD)

PGD is used during in vitro fertilization to test embryos for genetic variants or chromosomal abnormalities before implantation, allowing selection of embryos with desired genetic traits.

Key Equations and Definitions

Polymorphic Site: A DNA locus with two or more alleles, each present in >1% of the population.
SNP (Single Nucleotide Polymorphism): A single base pair change in the genome.
Polygenic Risk Score: where is the effect size of SNP , and is the genotype (0, 1, or 2 risk alleles).

Summary Table: Types of Human Genetic Variation

Type	Description	Frequency
SNP	Single base pair change	Most common (90%)
Insertion/Deletion	Gain or loss of small DNA segments	~10%
Copy Number Variant	Variation in the number of copies of a gene or region	Common
Structural Variant	Large-scale rearrangements (deletions, inversions, duplications)	2,100–2,500 per genome

Conclusion

The sequencing of the human genome and the study of human genetic variation have revolutionized our understanding of biology, medicine, and evolution. Advances in sequencing technology and bioinformatics continue to drive discoveries in disease genetics, pharmacogenomics, and personalized medicine.