BackThe Human Genome and Human Genetic Variation: Structure, Sequencing, and Implications
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
The Human Genome and Human Genetic Variation
Introduction to the Human Genome
The human genome is the complete set of genetic information for Homo sapiens, encoded as DNA within the 23 pairs of chromosomes. Understanding the human genome is a central goal of modern biology, providing insights into gene function, evolution, and the genetic basis of human diversity and disease.
Genome Size: Approximately 3 billion base pairs (bp).
Protein-Coding Genes: About 20,000 genes scattered across 23 chromosomes.
Key Questions: How many genes do we have? What do they do? How do our genomes compare to other species? How much variation exists between individuals and populations?
Sequencing the Human Genome
The Human Genome Project
The Human Genome Project (HGP) was an international, publicly funded effort launched in 1990 to determine the sequence of the entire human genome. It was completed in 2003, with a parallel effort by the private company Celera Genomics. The project relied on years of research using model organisms and advanced sequencing technologies.
Public Project: International collaboration (US, UK, Japan, China, Germany, France).
Private Project: Celera Genomics, led by Craig Venter.
Outcome: Both teams published the genome sequence in 2003; the public project cost $3 billion, Celera $300 million (using public data).

Whole Genome Shotgun Sequencing
Whole genome shotgun sequencing is a method used to sequence the human genome by breaking it into small fragments, sequencing each fragment, and then assembling the sequences using computational methods.
Library Construction: The genome is digested into random fragments (~2000 bp), each cloned into a bacterial plasmid to create a genomic library.
Sequencing: Each fragment is sequenced, originally using Sanger (dideoxy) sequencing.
Assembly: Computer algorithms identify overlapping sequences to reconstruct the entire genome.


Making a Genomic DNA Library
To sequence the genome, DNA is partially digested to produce overlapping fragments, which are then cloned into plasmids and stored in bacteria. This collection is called a genomic library.
Partial Digestion: Restriction enzymes cut DNA at random sites, creating overlapping fragments.
Cloning: Each fragment is inserted into a plasmid and transformed into bacteria.
Storage: Bacteria are stored in plates, each well containing a unique DNA fragment.
Sanger (Dideoxy) DNA Sequencing
Sanger sequencing, developed by Fred Sanger, uses dideoxynucleotides (ddNTPs) to terminate DNA synthesis at specific bases, allowing the sequence to be determined by fragment length and fluorescent labeling.
Components: DNA template, primer, DNA polymerase, four dNTPs, four fluorescently labeled ddNTPs.
Mechanism: Incorporation of a ddNTP terminates chain elongation, producing fragments of varying lengths.
Detection: Fragments are separated by capillary electrophoresis, and the terminal base is identified by fluorescence.







Genome Assembly
After sequencing, computational algorithms assemble the short DNA sequences into longer contiguous sequences (contigs) by identifying overlaps. Gaps are filled manually or with additional sequencing.

What Did We Learn from the Human Genome Sequence?
Genome Composition
Only about 2% of human DNA codes for proteins.
Approximately 26% is introns, and 20% is intergenic regions (promoters, enhancers).
About 44% consists of repetitive elements, including transposons and viral relics.

Gene Number and Comparison with Other Species
Humans have about 20,000–22,000 protein-coding genes, similar to other complex eukaryotes.
Gene prediction relies on bioinformatics algorithms to identify coding regions, start/stop codons, and splice sites.

Bioinformatics Tools
GENBANK: A public database at NCBI containing all available DNA sequence data.
BLAST (Basic Local Alignment Search Tool): Compares a query sequence to known sequences to identify similarities and predict function.



Gene Function Categories
Largest categories: cell signaling, gene regulation, unknown function (about 20%).

Comparative Genomics
Humans share a high percentage of DNA with other species (e.g., 99% with Neanderthals, 96% with chimpanzees, 50% with bananas).
Human Genetic Variation
Types of Genetic Variation
Polymorphic Sites: Sites with two or more alleles present in >1% of the population.
Single Nucleotide Polymorphisms (SNPs): The most common type, accounting for 90% of human genetic variation.
Other Variants: Insertions, deletions, copy number variants, and structural variants.
The Thousand Genomes Project
This project sequenced genomes from diverse populations to catalog human genetic variation. A typical person’s genome differs from the reference at over 4 million sites, with most variation shared globally.
Each genome contains 2,100–2,500 structural variants, 10,000–12,000 protein-altering variants, and 150–200 truncating variants.
There is no single 'human genome'; variation is extensive and complex.
Applications of Human Genetic Variation
Mapping disease genes
Determining disease susceptibility
Forensics
Understanding human evolution and ancestry
Genome-Wide Association Studies (GWAS)
GWAS identify associations between SNPs and diseases by comparing the genomes of affected and unaffected individuals using SNP chips, which can analyze over 900,000 SNPs simultaneously.
SNP Chips: Contain allele-specific probes; DNA fragments hybridize to matching probes and are detected by fluorescence.
Genotypes: Individuals can be homozygous or heterozygous at each SNP (e.g., G/G, C/C, G/C).
Polygenic Risk Scores: Estimate disease risk based on the combination of multiple SNPs.
Examples of GWAS Findings
First GWAS identified genes for age-related macular degeneration.
Studies have found SNPs associated with heart disease, diabetes, cancer, neurological disorders, and more.
Stuttering, for example, is now linked to neurological rather than behavioral causes.
Personal Genomics and Direct-to-Consumer Testing
Companies like 23andMe offer SNP analysis and genome sequencing to assess disease risk and ancestry.
Reports include monogenic and polygenic risk predictions.
Ethical concerns exist regarding information for untreatable conditions and behavioral genetics.
Pharmacogenomics
Pharmacogenomics studies how genetic variation affects drug response. For example, mutations in the CYP2D6 gene affect codeine metabolism, influencing drug efficacy and safety.
10% of people lack functional CYP2D6 and do not benefit from codeine.
30% of East Africans have multiple copies and require higher doses.
Preimplantation Genetic Diagnosis (PGD)
PGD is used during in vitro fertilization to test embryos for genetic variants or chromosomal abnormalities before implantation, allowing selection of embryos with desired genetic traits.
Key Equations and Definitions
Polymorphic Site: A DNA locus with two or more alleles, each present in >1% of the population.
SNP (Single Nucleotide Polymorphism): A single base pair change in the genome.
Polygenic Risk Score: where is the effect size of SNP , and is the genotype (0, 1, or 2 risk alleles).
Summary Table: Types of Human Genetic Variation
Type | Description | Frequency |
|---|---|---|
SNP | Single base pair change | Most common (90%) |
Insertion/Deletion | Gain or loss of small DNA segments | ~10% |
Copy Number Variant | Variation in the number of copies of a gene or region | Common |
Structural Variant | Large-scale rearrangements (deletions, inversions, duplications) | 2,100–2,500 per genome |
Conclusion
The sequencing of the human genome and the study of human genetic variation have revolutionized our understanding of biology, medicine, and evolution. Advances in sequencing technology and bioinformatics continue to drive discoveries in disease genetics, pharmacogenomics, and personalized medicine.