BackGenomics, Sequencing Technologies, and Haplotype Analysis: Study Notes
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Genomics and the Human Genome Project
Types of Genomics Studies
Genomics is the study of the structure, function, evolution, and mapping of genomes. Different types of genomics studies include:
Whole Genome Sequencing (WGS): Sequencing the entire DNA content of an organism.
Whole Exome Sequencing (WES): Sequencing only the protein-coding regions (exons) of the genome.
Transcriptome Analysis (RNA-seq): Sequencing all RNA transcripts present in a cell or tissue at a given time.
Example: WGS is used to identify all genetic variants in a patient with a rare disease.
The Human Genome Project (HGP)
The Human Genome Project was an international effort to sequence the entire human genome. It used shotgun genome sequencing, which involves breaking DNA into small fragments, sequencing them, and assembling the sequences using computational methods.
Genome size: 3.2 billion base pairs
Protein-coding genes: ~25,000 (about 2% of the genome)
Gene average size: 3,000 bp
Exon average size: 150 bp
Introns: Can be >100 kb, much larger than exons
Repetitive DNA: Large portion of the genome
Genetic similarity: Two unrelated people share 99.5% of their DNA sequence
Example: The HGP revealed that only a small fraction of the genome codes for proteins, with many regions consisting of non-coding or repetitive DNA.
Shotgun Sequencing and Contig Assembly
Shotgun sequencing involves randomly breaking up DNA sequences into small pieces, sequencing them, and then assembling the overlapping fragments into a continuous sequence (contig).
Contig: A set of overlapping DNA segments that together represent a consensus region of DNA.
Genomic library: A collection of DNA fragments cloned into vectors (e.g., plasmids) for sequencing.
Why clone contigs into plasmid vectors?
To provide a universal primer binding site for sequencing
To allow reproducible sequencing of each fragment
To facilitate assembly of overlapping sequences
DNA Sequencing Technologies
Sanger Sequencing
Sanger sequencing, also known as chain-termination sequencing, is a method for determining the nucleotide sequence of DNA. It was the primary technology used in the HGP.
Relies on incorporation of dideoxynucleotides (ddNTPs) to terminate DNA synthesis at specific bases
Produces DNA fragments of varying lengths, which are separated by electrophoresis
The sequence is read from the pattern of terminated fragments
Limitation: Sanger sequencing is low-throughput and best for sequencing small DNA fragments.
Next-Generation Sequencing (NGS)
NGS, or sequencing by synthesis, allows for massively parallel sequencing of millions of DNA fragments.
DNA fragments (contigs) are attached to beads or a solid surface and amplified
Nucleotides are added one at a time; incorporation emits a signal detected by a computer
Short read lengths (300–700 bp) require computational assembly
Challenges: Difficulties with repetitive regions, insertions/deletions, and computational reassembly
Example: Illumina sequencing is a widely used NGS platform.
Third-Generation Sequencing
Third-generation sequencing, such as nanopore sequencing, reads single DNA molecules without amplification or cloning.
DNA is passed through a nanopore; changes in electrical current indicate the identity of each base
Allows for much longer read lengths (tens of kilobases)
Can detect modified bases (e.g., methylated CpG)
Reduces the need for sequence assembly
Advantages over NGS:
No need for shotgun cloning
Longer reads simplify assembly
Direct detection of base modifications
Gene Annotation and Functional Genomics
Gene Annotation
Gene annotation is the process of identifying the locations and functions of genes within a genome.
Distinguishes between genes and pseudogenes (non-functional gene copies)
Determines gene structure: exons, introns, regulatory elements
Describes gene expression patterns and functional roles
Functional genomics combines molecular biology, cell biology, and biochemistry to study gene function.
Finding Genes in the Genome
Genes are identified by characteristic sequence features:
Promoters (e.g., TATA box, CAAT box)
Exons and introns
Splice sites (GT/AG rule)
Transcription start and termination sites
Polyadenylation signals
Example: Computational gene prediction algorithms scan for these features to annotate genes.
The -Omics Revolution
Major -Omics Fields
Genomics: Study of entire genomes
Transcriptomics: Study of all RNA transcripts (the transcriptome)
Metagenomics: Study of genetic material recovered directly from environmental samples
Other fields: proteomics, pharmacogenomics, metabolomics, glycomics, epigenomics, toxicogenomics, interactomics
All -omics fields rely heavily on bioinformatics for data analysis.
Bioinformatics and Genomic Data Analysis
Bioinformatics
Bioinformatics combines biology and computer science to analyze and interpret biological data, especially large datasets from sequencing projects.
BLAST (Basic Local Alignment Search Tool): Compares nucleotide or protein sequences to sequence databases and calculates statistical significance
OMIM (Online Mendelian Inheritance in Man): Database of human genes and genetic disorders
Example: BLAST is used to identify homologous genes in different species.
Genotyping and Haplotype Analysis
Pre-Genome Sequencing Genotyping Methods
Cytogenetics: Chromosome staining to visualize chromosomal abnormalities
Single gene sequencing: Using Sanger sequencing for specific genes
Haplotype mapping: Identifying patterns of genetic variation using RFLP or DNA arrays
Haplotypes and Haplogroups
Haplotype: A group of alleles or SNPs inherited together from a single parent
Haplogroup: A group of similar haplotypes that share a common ancestor
Haplotype Mapping and DNA Markers
Haplotype mapping uses DNA markers, such as SNPs, to create detailed maps of genetic variation.
One SNP is present every ~1,000 bp in human DNA (over 13 million SNPs)
Tag SNP: Representative SNPs in a region of the genome with high linkage disequilibrium
Used in genetic screening and association studies
Array-Based Genotyping (DNA Microarrays)
DNA microarrays are used to genotype thousands of SNPs simultaneously.
Solid support contains ssDNA probes complementary to tag SNPs
Sample DNA is fragmented, labeled, and hybridized to the array
Fluorescence indicates the presence of specific SNPs
Computer scans and maps fluorescence across the chip
Step | Description |
|---|---|
Probe attachment | ssDNA probes fixed to chip at known locations |
Sample preparation | Genomic DNA fragmented and labeled |
Hybridization | Labeled DNA binds to complementary probes |
Detection | Fluorescence measured to determine SNP presence |
Example: 23andMe uses a proprietary DNA array to genotype 600,000 SNPs for direct-to-consumer genetic testing.
Summary Table: Sequencing Technologies
Technology | Key Features | Read Length | Cloning Required? |
|---|---|---|---|
Sanger Sequencing | Chain-termination, low-throughput | ~700 bp | Yes |
Next-Generation Sequencing | Massively parallel, short reads | 300–700 bp | No |
Third-Generation Sequencing | Single-molecule, long reads, direct detection | 10,000+ bp | No |
Key Terms and Definitions
Contig: Overlapping DNA segments that together represent a consensus region
Pseudogene: A gene sequence that resembles a gene but is non-functional
Gene annotation: Process of identifying gene locations and functions
Bioinformatics: Application of computational tools to analyze biological data
Haplotype: Set of DNA variations inherited together
Tag SNP: Representative SNP used to infer the presence of other linked SNPs
Equations and Concepts
Genetic similarity between individuals:
Linkage disequilibrium (LD): Non-random association of alleles at different loci
Additional info: Equations added for academic completeness.