DNA Sequencing and Gene Annotation: Methods and Applications in Biochemistry

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

DNA Sequencing Methods

Overview of Sequencing Technologies

DNA sequencing is a fundamental technique in biochemistry for determining the precise order of nucleotides within a DNA molecule. Modern sequencing methods enable the analysis of entire genomes, gene expression, and protein-coding regions.

Sanger Sequencing: Uses chain-terminating nucleotides and fluorescent signals to read DNA sequences. Suitable for shorter DNA fragments.
Illumina Sequencing: Employs massively parallel sequencing of short DNA fragments, generating millions of reads per run. Used for whole-genome sequencing and gene expression analysis.
Oxford Nanopore Sequencing: Directly reads long DNA fragments by measuring changes in electrical current as DNA passes through a nanopore.

Sanger Sequencing: Principles and Workflow

Sanger sequencing relies on the incorporation of dideoxynucleotides (ddNTPs) during DNA synthesis, which terminate chain elongation. Fluorescently labeled ddNTPs allow for detection and analysis of the resulting DNA fragments.

Standard PCR: Amplifies the DNA region of interest using specific primers.
Chain Termination: ddNTPs are incorporated at random, producing fragments of varying lengths.
Detection: Fragments are separated by capillary electrophoresis, and fluorescence is measured to determine the sequence.

Quality Scores: Indicate the confidence in each base call. Higher scores reflect greater accuracy.

Q20: 99% accuracy (1% error rate)
Q40: 99.99% accuracy (0.01% error rate)

Applications: Sanger sequencing is used for targeted sequencing, mutation analysis, and validation of next-generation sequencing results.

Next Generation Sequencing (NGS): Illumina Platform

Illumina sequencing enables high-throughput analysis of entire genomes by sequencing millions of short DNA fragments in parallel.

Library Preparation: Genomic DNA is fragmented and adapters are added.
Bridge Amplification: DNA fragments are amplified on a flow cell to form clusters.
Sequencing by Synthesis: Fluorescently labeled nucleotides are incorporated, and signals are detected cycle by cycle.
Data Output: Sequencing produces FASTQ files containing reads and quality scores.

Quality Control: Reads are filtered and trimmed to remove low-quality bases and adapter sequences.

Oxford Nanopore Sequencing: Long-Read Technology

Nanopore sequencing allows for direct, real-time analysis of long DNA molecules by measuring changes in ionic current as DNA passes through a protein nanopore.

Advantages: Can sequence very long fragments (up to hundreds of kilobases).
Applications: Useful for resolving repetitive regions and structural variants in genomes.

Sequence Assembly and Genome Annotation

Sequence Assembly

Sequence assembly is the process of combining short DNA reads into longer contiguous sequences (contigs) and scaffolds, ultimately reconstructing the original genome.

Trimming and Quality Filtering: Remove adapters and low-quality reads.
Alignment: Overlapping reads are aligned to form contigs.
Scaffolding: Contigs are ordered and oriented to form larger scaffolds.
Challenges: Repetitive regions can complicate assembly, especially with short reads.

Equation for Coverage:

Coverage () is defined as the number of times the genome is sequenced:

Where:

= number of reads
= length of each read (bp)
= genome length (bp)

Probability of Coverage: The probability that a given nucleotide is not sequenced can be calculated using Poisson statistics.

Genome Annotation

Genome annotation identifies and labels important features in DNA sequences, such as genes, RNA elements, and protein-coding regions.

Intrinsic (ab initio) Methods: Use algorithms to predict gene boundaries and coding regions based on sequence characteristics.
Extrinsic Methods: Incorporate external evidence, such as homology to known genes (BLAST), conserved domains, comparative genomics, and experimental data (RNA-seq, proteomics).

Intrinsic Methods: Gene Prediction

Intrinsic methods analyze raw DNA sequences to identify open reading frames (ORFs), start/stop codons, and regulatory elements.

Prokaryotic Genes: Typically lack introns; gene prediction focuses on start codons (ATG, TTG, GTG), stop codons, and promoter regions.
Eukaryotic Genes: Contain introns and exons; prediction includes splice sites, polyadenylation signals, and untranslated regions (UTRs).

Extrinsic Methods: Homology and Experimental Evidence

Extrinsic methods use information from outside the raw DNA sequence to improve annotation accuracy.

Homology: BLAST searches for similar sequences, conserved domains indicate function, and comparative genomics identifies synteny.
Experimental Evidence: RNA-seq identifies transcribed regions; proteomics detects expressed proteins.

Annotation Software Tools

Several bioinformatics tools are used for genome annotation:

PROKKA: Prokaryotic annotation; integrates intrinsic and extrinsic methods.
GENSCAN: Eukaryotic gene prediction using intrinsic methods.
Augustus: Flexible tool for eukaryotic and prokaryotic annotation; supports both intrinsic and extrinsic data integration.

Gene Expression Analysis

Central Dogma of Molecular Biology

The central dogma describes the flow of genetic information from DNA to RNA to protein.

Transcription: DNA is transcribed into RNA.
Translation: RNA is translated into protein.

Transcription in Prokaryotes and Eukaryotes

Prokaryotes: Genes are often organized in operons; transcription and translation are coupled.
Eukaryotes: Genes contain introns and exons; mRNA is processed (splicing, polyadenylation) before translation.

Gene Regulation in Prokaryotes

Gene expression is tightly regulated to ensure proteins are produced only when needed.

Negative Regulation: Genes are turned "off" when not needed.
Positive Regulation: Genes are expressed when required.

Measuring Gene Expression

Gene expression can be measured directly or indirectly using various biochemical techniques.

Direct Methods: Northern blot, qPCR, RNA-seq.
Indirect Methods: Microarrays, expressed sequence tags (ESTs).

qPCR: Quantitative PCR measures the amount of specific RNA transcripts, allowing for comparison between control and experimental conditions.

Summary Table: Sequencing Methods Comparison

Method	Read Length	Throughput	Applications	Limitations
Sanger	~700 bp	Low	Targeted sequencing, mutation analysis	Not suitable for large genomes
Illumina	100-300 bp	High	Whole-genome sequencing, gene expression	Short reads, issues with repeats
Nanopore	Up to 100 kb+	Medium-High	Structural variants, long reads	Higher error rates, complex data analysis

Key Equations

Coverage:
Probability of Coverage (Poisson): (probability a base is not covered)

Conclusion

DNA sequencing and genome annotation are essential tools in biochemistry for understanding genetic information, gene expression, and protein function. Advances in sequencing technologies and bioinformatics have enabled comprehensive analysis of genomes, transcriptomes, and proteomes, facilitating research in molecular biology, genetics, and biotechnology.