Introduction
DNA sequencing: determining the precise order of nucleotides (A, T, C, G) in a DNA molecule. History: Sanger method (1977) → Human Genome Project (2003, $2.7 billion) → next-generation sequencing (2005) → $100 genome (2022). Speed: human genome now sequenced in hours (vs. 13 years for HGP). Impact: personalized medicine, rare disease diagnosis, cancer genomics, evolutionary biology. Current: >10 million human genomes sequenced worldwide.
"DNA sequencing is the most transformative technology in biology since the microscope. Reading the code of life enables us to understand disease, trace evolution, identify organisms, and ultimately guide personalized treatment." -- Genomics pioneer
Sanger Sequencing
Chain Termination Method
Principle: DNA synthesis using dideoxynucleotides (ddNTPs) that terminate chain. Components: template DNA, primer, DNA polymerase, dNTPs + ddNTPs. Mechanism: ddNTPs lack 3'-OH → cannot form next phosphodiester bond → chain terminates. Result: fragments of every possible length, each ending at specific base.
Modern Capillary Electrophoresis
Separation: fragments migrate through polymer-filled capillary (by size). Detection: laser excites fluorescent ddNTP at end of each fragment. Four colors: one per base (A, T, C, G). Readout: chromatogram (peaks indicate base identity at each position). Read length: 700-1000 bases per reaction (high quality).
Applications Today
Validation: confirm variants found by NGS. Single-gene testing: specific mutations in known genes. Plasmid verification: confirm cloned sequences. Small projects: < 100 sequences. Cost: ~$5-15 per reaction. Limitation: one sequence at a time (low throughput).
Limitations
Throughput: ~96 samples per run (one capillary per sample). Cost per base: ~$0.01-0.001 (expensive for genomes). Read length: limited to ~1000 bp. Speed: 2-3 hours per run. Not suitable: large-scale projects (genomes, transcriptomes).
Next-Generation Sequencing Overview
Massively Parallel Sequencing
Concept: millions to billions of DNA fragments sequenced simultaneously. Throughput: 100 Gb to 6 Tb per run (vs. ~100 Kb for Sanger). Cost: $0.01-1.00 per million bases (1000x cheaper than Sanger). Speed: 1-14 days per run (depends on platform). Revolution: enabled genome-scale studies to become routine.
General Workflow
1. Sample preparation: extract DNA/RNA2. Library preparation: fragment, add adapters3. Amplification: clonal amplification on surface/beads4. Sequencing: iterative base detection5. Data analysis: base calling → alignment → variant calling6. Interpretation: biological/clinical significancePlatform Comparison
| Platform | Read Length | Output/Run | Error Rate | Cost/Gb |
|---|---|---|---|---|
| Illumina NovaSeq X | 2×150 bp | 6 Tb | 0.1% | ~$2 |
| PacBio Revio | 15-25 kb | 90 Gb (HiFi) | 0.1% (HiFi) | ~$15 |
| ONT PromethION | N50 >30 kb | 290 Gb | ~1-5% | ~$10 |
| Element AVITI | 2×150 bp | 1 Tb | 0.1% | ~$4 |
Illumina Sequencing
Sequencing by Synthesis (SBS)
Principle: detect fluorescent label as each nucleotide is incorporated. Bridge amplification: create clonal clusters on flow cell surface. Cycle: add all 4 labeled reversible terminators → image → cleave terminator → repeat. Read length: 2 × 150 bp (paired-end, most common), up to 2 × 300 bp. Quality: highest accuracy of any NGS platform (>99.9% per base).
Platforms
MiSeq: benchtop, 15 Gb, targeted sequencing. NextSeq 2000: mid-range, 360 Gb. NovaSeq X Plus: production-scale, 16 Tb, $200 genome. Market share: ~80% of all sequencing worldwide (dominant). Ecosystem: extensive software, reagent, and analysis tools.
Paired-End Sequencing
Concept: sequence both ends of each DNA fragment. Insert size: 300-500 bp (distance between reads). Advantage: improved alignment, detect structural variants, span repetitive regions. Application: whole genome, RNA-seq, ChIP-seq. Standard: most Illumina applications use paired-end.
Index/Barcode Multiplexing
Concept: add unique barcode to each sample during library prep. Pooling: combine multiple samples in single run. Demultiplexing: sort reads by barcode after sequencing. Capacity: 96-384+ samples per run. Benefit: share run cost across many samples (reduce per-sample cost).
Ion Torrent Sequencing
Semiconductor Sequencing
Principle: detect H+ ion released during nucleotide incorporation. No optics: purely electronic detection (simpler, cheaper instrument). Process: flood with one nucleotide type → measure pH change at each well. Speed: 2-4 hours per run (fastest NGS platform). Application: targeted panels, microbial genomics, clinical diagnostics.
Platforms
Ion GeneStudio S5: benchtop clinical sequencer. Ion Torrent Genexus: sample-to-report in one day. Read length: 200-600 bp. Output: 1-50 Gb per run. Strength: fast turnaround, simple workflow. Weakness: homopolymer errors (insertions/deletions in runs of same base).
Long-Read Sequencing
PacBio (SMRT Sequencing)
Principle: single molecule real-time sequencing. ZMW (zero-mode waveguide): tiny well containing single polymerase. Detection: fluorescent nucleotide incorporation observed in real-time. Read length: 15-25 kb average (HiFi mode), up to >100 kb (CLR mode). HiFi: circular consensus (multiple passes → 99.9% accuracy). Application: genome assembly, structural variants, phasing, methylation.
Advantages of Long Reads
Repetitive regions: span repeats that short reads cannot (telomeres, centromeres). Structural variants: detect insertions, deletions, inversions, translocations. Phasing: determine which variants are on same chromosome. Methylation: detect DNA modification directly (no bisulfite conversion). Assembly: fewer, larger contigs (often chromosome-scale).
T2T (Telomere-to-Telomere) Assembly
Achievement: first complete human genome assembly (2022, T2T Consortium). Technology: PacBio HiFi + Oxford Nanopore ultra-long reads. Coverage: 100% of genome (previous assemblies missed ~8%). New sequence: 200 Mb of previously unresolved regions. Significance: complete reference genome for first time.
Oxford Nanopore
Nanopore Sequencing Principle
Concept: single DNA strand passes through protein nanopore in membrane. Detection: ionic current changes as each base transits pore. No amplification: sequence native DNA directly (single molecule). Read length: no theoretical limit (record >4 Mb single read). Speed: ~400 bases/second per pore. Portability: MinION (USB-powered, pocket-sized).
Platforms
MinION: portable (512 pores), $1,000 device. GridION: benchtop (5 flow cells simultaneously). PromethION: production (48-96 flow cells, 290 Gb per flow cell). Flongle: disposable (rapid, cheap testing). Application: field sequencing, real-time pathogen identification, structural variants.
Unique Capabilities
Direct RNA sequencing: sequence RNA without cDNA conversion. Modified bases: detect methylation directly (5mC, 6mA). Real-time: data available during sequencing (adaptive sampling). Portable: sequenced DNA in space (ISS), Antarctic, Ebola outbreak field sites. Speed: pathogen identified in minutes from raw sample.
Accuracy
Raw: ~95-99% per read (improving rapidly with new chemistry). Consensus: >99.9% with sufficient coverage (multiple reads). Duplex: sequence both strands → >99% per-read accuracy. Improvement: new basecallers (AI-based) continuously improving quality. Challenge: still lower per-read accuracy than Illumina/PacBio HiFi.
Library Preparation
Fragmentation
Mechanical: sonication (Covaris), nebulization. Enzymatic: tagmentation (Tn5 transposase, Illumina Nextera). Size selection: gel, bead-based (SPRI beads). Target: 300-500 bp for Illumina, longer for long-read. Quality: uniform size distribution important for even coverage.
Adapter Ligation
Adapters: platform-specific oligonucleotides ligated to fragment ends. Function: enable amplification, sequencing initiation, surface binding. Index/barcode: unique sequence for sample identification (multiplexing). End repair: blunt-end fragments before ligation. A-tailing: add adenine overhang for T-adapter ligation.
Target Enrichment
Whole genome: no enrichment (sequence everything). Exome: capture protein-coding regions (~1.5% of genome, 50 Mb). Panel: targeted capture of specific genes (10-500 genes). Amplicon: PCR-based targeting of specific regions. Hybrid capture: biotinylated probes hybridize to targets → streptavidin pulldown. Application: clinical panels (cancer genes), exome for rare disease.
Quality Control
Fragment size: Bioanalyzer or TapeStation (electropherogram). Quantity: fluorometric (Qubit) or qPCR. Library complexity: unique molecules (avoid PCR duplicates). Sequencing QC: PhiX spike-in control, quality scores, cluster density.
Bioinformatics Analysis
Primary Analysis
Base calling: convert raw signal to nucleotide sequence + quality scores. FASTQ: standard format (sequence + quality per base). Quality scores: Phred scale (Q30 = 1 error per 1000 bases). Filtering: remove low-quality reads, adapter sequences. Tools: bcl2fastq (Illumina), Guppy (Nanopore), SMRT Link (PacBio).
Secondary Analysis
Alignment: map reads to reference genome. Tools: BWA-MEM2 (short reads), minimap2 (long reads). SAM/BAM format: aligned read storage. Variant calling: identify differences from reference. SNVs: GATK HaplotypeCaller, DeepVariant. Structural variants: DELLY, SVIM, Sniffles. Output: VCF (Variant Call Format).
De Novo Assembly
No reference: assemble reads into contigs from scratch. Short-read: SPAdes, MEGAHIT. Long-read: hifiasm, Flye, Canu. Hybrid: combine short + long reads. Quality: N50 (half of assembly in contigs this size or longer). Application: new species, novel genomes, metagenomics.
Tertiary Analysis
Annotation: identify variant consequences (missense, nonsense, splice). Databases: ClinVar, gnomAD, COSMIC (clinical significance). Filtering: remove common variants, prioritize damaging. Classification: ACMG guidelines (pathogenic, likely pathogenic, VUS, likely benign, benign). Reporting: clinical report for physician/patient.
Whole Genome Sequencing
Coverage
Definition: average number of reads covering each base. Clinical WGS: 30-40x coverage (standard). Research: 15-30x (adequate for common variant calling). Deep sequencing: 100x+ (detect rare variants, somatic mutations). Calculation: (read count × read length) / genome size.
Applications
Rare disease diagnosis: find causative mutations. Cancer genomics: somatic mutation profiling. Population genetics: large-scale cohort studies (UK Biobank, All of Us). Pharmacogenomics: comprehensive drug response prediction. Prenatal: non-invasive prenatal testing (cell-free fetal DNA). Pathogen: whole genome epidemiology (outbreak tracking).
$100 Genome
Milestone: Illumina NovaSeq X (2022) approaches $200 per genome. Projection: $100 genome by 2025. Impact: enables population-scale sequencing. Challenge: analysis and interpretation cost now exceeds sequencing cost. Storage: each genome = ~100 GB raw data (storage infrastructure needed).
Clinical Genomics
Rare Disease Diagnosis
Diagnostic odyssey: average 5-7 years to diagnosis. WGS/WES: diagnoses ~25-40% of previously undiagnosed patients. Trio analysis: patient + both parents (identify de novo and recessive variants). Impact: ends diagnostic odyssey, enables targeted treatment. Example: 100,000 Genomes Project (UK), Undiagnosed Diseases Program (NIH).
Cancer Genomics
Tumor profiling: identify actionable mutations for targeted therapy. Panels: 50-500 genes (Foundation Medicine, Tempus). Liquid biopsy: detect circulating tumor DNA (minimal invasion). Treatment selection: match mutations to approved drugs. Monitoring: track tumor evolution during treatment.
Pharmacogenomics
Drug metabolism: CYP2D6, CYP2C19 variants affect drug processing. Dosing: warfarin, clopidogrel, codeine dosing guided by genotype. Adverse reactions: HLA-B*5701 → abacavir hypersensitivity. Implementation: pre-emptive genotyping programs (St. Jude, Vanderbilt). Impact: reduce adverse drug reactions, optimize efficacy.
Newborn Screening
Current: heel-prick biochemical tests (~30-50 conditions). Future: genome sequencing at birth (screen thousands of conditions). Pilot: BabySeq, Genomics England Newborn Screening. Ethical: incidental findings, consent, data storage. Potential: earliest possible diagnosis for treatable conditions.
Future Directions
Single-Cell Sequencing
Concept: sequence genome/transcriptome of individual cells. Technology: 10x Genomics Chromium (droplet-based). Application: tumor heterogeneity, immune cell profiling, developmental biology. Scale: 10,000-1,000,000 cells per experiment. Impact: revealed cellular diversity invisible to bulk sequencing.
Spatial Transcriptomics
Concept: measure gene expression while preserving tissue location. Technology: Visium (10x Genomics), MERFISH, Slide-seq. Resolution: approaching single-cell in situ. Application: tumor microenvironment, brain mapping, developmental biology. Impact: add spatial context to genomic data.
Epigenomic Sequencing
DNA methylation: bisulfite sequencing, nanopore direct detection. Chromatin accessibility: ATAC-seq. Histone modifications: ChIP-seq. Single-cell epigenomics: emerging (scATAC-seq). Application: cancer epigenetics, aging, development.
Population Genomics
Biobanks: UK Biobank (500,000), All of Us (1,000,000), Our Future Health (5,000,000). GWAS: genome-wide association studies at scale. Polygenic risk scores: predict disease risk from thousands of variants. Diversity: expanding beyond European populations (critical for equity). Impact: transform disease risk prediction and prevention.
References
- Sanger, F., Nicklen, S., and Coulson, A. R. "DNA Sequencing with Chain-Terminating Inhibitors." Proceedings of the National Academy of Sciences, vol. 74, no. 12, 1977, pp. 5463-5467.
- Goodwin, S., McPherson, J. D., and McCombie, W. R. "Coming of Age: Ten Years of Next-Generation Sequencing Technologies." Nature Reviews Genetics, vol. 17, 2016, pp. 333-351.
- Nurk, S., Koren, S., Rhie, A., et al. "The Complete Sequence of a Human Genome." Science, vol. 376, no. 6588, 2022, pp. 44-53.
- Jain, M., Koren, S., Miga, K. H., et al. "Nanopore Sequencing and Assembly of a Human Genome with Ultra-Long Reads." Nature Biotechnology, vol. 36, 2018, pp. 338-345.
- Shendure, J., Balasubramanian, S., Church, G. M., et al. "DNA Sequencing at 40." Nature, vol. 550, 2017, pp. 345-353.