Dna Sequencing (Analytical Techniques)

Introduction

DNA sequencing: determining the precise order of nucleotides (A, T, C, G) in a DNA molecule. History: Sanger method (1977) → Human Genome Project (2003, $2.7 billion) → next-generation sequencing (2005) → $100 genome (2022). Speed: human genome now sequenced in hours (vs. 13 years for HGP). Impact: personalized medicine, rare disease diagnosis, cancer genomics, evolutionary biology. Current: >10 million human genomes sequenced worldwide.

"DNA sequencing is the most transformative technology in biology since the microscope. Reading the code of life enables us to understand disease, trace evolution, identify organisms, and ultimately guide personalized treatment." -- Genomics pioneer

Sanger Sequencing

Chain Termination Method

Principle: DNA synthesis using dideoxynucleotides (ddNTPs) that terminate chain. Components: template DNA, primer, DNA polymerase, dNTPs + ddNTPs. Mechanism: ddNTPs lack 3'-OH → cannot form next phosphodiester bond → chain terminates. Result: fragments of every possible length, each ending at specific base.

Modern Capillary Electrophoresis

Separation: fragments migrate through polymer-filled capillary (by size). Detection: laser excites fluorescent ddNTP at end of each fragment. Four colors: one per base (A, T, C, G). Readout: chromatogram (peaks indicate base identity at each position). Read length: 700-1000 bases per reaction (high quality).

Applications Today

Validation: confirm variants found by NGS. Single-gene testing: specific mutations in known genes. Plasmid verification: confirm cloned sequences. Small projects: < 100 sequences. Cost: ~$5-15 per reaction. Limitation: one sequence at a time (low throughput).

Limitations

Throughput: ~96 samples per run (one capillary per sample). Cost per base: ~$0.01-0.001 (expensive for genomes). Read length: limited to ~1000 bp. Speed: 2-3 hours per run. Not suitable: large-scale projects (genomes, transcriptomes).

Next-Generation Sequencing Overview

Massively Parallel Sequencing

Concept: millions to billions of DNA fragments sequenced simultaneously. Throughput: 100 Gb to 6 Tb per run (vs. ~100 Kb for Sanger). Cost: $0.01-1.00 per million bases (1000x cheaper than Sanger). Speed: 1-14 days per run (depends on platform). Revolution: enabled genome-scale studies to become routine.

General Workflow

1. Sample preparation: extract DNA/RNA2. Library preparation: fragment, add adapters3. Amplification: clonal amplification on surface/beads4. Sequencing: iterative base detection5. Data analysis: base calling → alignment → variant calling6. Interpretation: biological/clinical significance

Platform Comparison

Platform	Read Length	Output/Run	Error Rate	Cost/Gb
Illumina NovaSeq X	2×150 bp	6 Tb	0.1%	~$2
PacBio Revio	15-25 kb	90 Gb (HiFi)	0.1% (HiFi)	~$15
ONT PromethION	N50 >30 kb	290 Gb	~1-5%	~$10
Element AVITI	2×150 bp	1 Tb	0.1%	~$4

Illumina Sequencing

Sequencing by Synthesis (SBS)

Principle: detect fluorescent label as each nucleotide is incorporated. Bridge amplification: create clonal clusters on flow cell surface. Cycle: add all 4 labeled reversible terminators → image → cleave terminator → repeat. Read length: 2 × 150 bp (paired-end, most common), up to 2 × 300 bp. Quality: highest accuracy of any NGS platform (>99.9% per base).

Platforms

MiSeq: benchtop, 15 Gb, targeted sequencing. NextSeq 2000: mid-range, 360 Gb. NovaSeq X Plus: production-scale, 16 Tb, $200 genome. Market share: ~80% of all sequencing worldwide (dominant). Ecosystem: extensive software, reagent, and analysis tools.

Paired-End Sequencing

Concept: sequence both ends of each DNA fragment. Insert size: 300-500 bp (distance between reads). Advantage: improved alignment, detect structural variants, span repetitive regions. Application: whole genome, RNA-seq, ChIP-seq. Standard: most Illumina applications use paired-end.

Index/Barcode Multiplexing

Concept: add unique barcode to each sample during library prep. Pooling: combine multiple samples in single run. Demultiplexing: sort reads by barcode after sequencing. Capacity: 96-384+ samples per run. Benefit: share run cost across many samples (reduce per-sample cost).

Ion Torrent Sequencing

Semiconductor Sequencing

Principle: detect H+ ion released during nucleotide incorporation. No optics: purely electronic detection (simpler, cheaper instrument). Process: flood with one nucleotide type → measure pH change at each well. Speed: 2-4 hours per run (fastest NGS platform). Application: targeted panels, microbial genomics, clinical diagnostics.

Platforms

Ion GeneStudio S5: benchtop clinical sequencer. Ion Torrent Genexus: sample-to-report in one day. Read length: 200-600 bp. Output: 1-50 Gb per run. Strength: fast turnaround, simple workflow. Weakness: homopolymer errors (insertions/deletions in runs of same base).

Long-Read Sequencing

PacBio (SMRT Sequencing)

Principle: single molecule real-time sequencing. ZMW (zero-mode waveguide): tiny well containing single polymerase. Detection: fluorescent nucleotide incorporation observed in real-time. Read length: 15-25 kb average (HiFi mode), up to >100 kb (CLR mode). HiFi: circular consensus (multiple passes → 99.9% accuracy). Application: genome assembly, structural variants, phasing, methylation.

Advantages of Long Reads

Repetitive regions: span repeats that short reads cannot (telomeres, centromeres). Structural variants: detect insertions, deletions, inversions, translocations. Phasing: determine which variants are on same chromosome. Methylation: detect DNA modification directly (no bisulfite conversion). Assembly: fewer, larger contigs (often chromosome-scale).

T2T (Telomere-to-Telomere) Assembly

Achievement: first complete human genome assembly (2022, T2T Consortium). Technology: PacBio HiFi + Oxford Nanopore ultra-long reads. Coverage: 100% of genome (previous assemblies missed ~8%). New sequence: 200 Mb of previously unresolved regions. Significance: complete reference genome for first time.

Oxford Nanopore

Nanopore Sequencing Principle

Concept: single DNA strand passes through protein nanopore in membrane. Detection: ionic current changes as each base transits pore. No amplification: sequence native DNA directly (single molecule). Read length: no theoretical limit (record >4 Mb single read). Speed: ~400 bases/second per pore. Portability: MinION (USB-powered, pocket-sized).

Platforms

MinION: portable (512 pores), $1,000 device. GridION: benchtop (5 flow cells simultaneously). PromethION: production (48-96 flow cells, 290 Gb per flow cell). Flongle: disposable (rapid, cheap testing). Application: field sequencing, real-time pathogen identification, structural variants.

Unique Capabilities

Direct RNA sequencing: sequence RNA without cDNA conversion. Modified bases: detect methylation directly (5mC, 6mA). Real-time: data available during sequencing (adaptive sampling). Portable: sequenced DNA in space (ISS), Antarctic, Ebola outbreak field sites. Speed: pathogen identified in minutes from raw sample.

Accuracy

Raw: ~95-99% per read (improving rapidly with new chemistry). Consensus: >99.9% with sufficient coverage (multiple reads). Duplex: sequence both strands → >99% per-read accuracy. Improvement: new basecallers (AI-based) continuously improving quality. Challenge: still lower per-read accuracy than Illumina/PacBio HiFi.

Library Preparation

Fragmentation

Mechanical: sonication (Covaris), nebulization. Enzymatic: tagmentation (Tn5 transposase, Illumina Nextera). Size selection: gel, bead-based (SPRI beads). Target: 300-500 bp for Illumina, longer for long-read. Quality: uniform size distribution important for even coverage.

Adapter Ligation

Adapters: platform-specific oligonucleotides ligated to fragment ends. Function: enable amplification, sequencing initiation, surface binding. Index/barcode: unique sequence for sample identification (multiplexing). End repair: blunt-end fragments before ligation. A-tailing: add adenine overhang for T-adapter ligation.

Target Enrichment

Whole genome: no enrichment (sequence everything). Exome: capture protein-coding regions (~1.5% of genome, 50 Mb). Panel: targeted capture of specific genes (10-500 genes). Amplicon: PCR-based targeting of specific regions. Hybrid capture: biotinylated probes hybridize to targets → streptavidin pulldown. Application: clinical panels (cancer genes), exome for rare disease.

Quality Control

Fragment size: Bioanalyzer or TapeStation (electropherogram). Quantity: fluorometric (Qubit) or qPCR. Library complexity: unique molecules (avoid PCR duplicates). Sequencing QC: PhiX spike-in control, quality scores, cluster density.

Bioinformatics Analysis

Primary Analysis

Base calling: convert raw signal to nucleotide sequence + quality scores. FASTQ: standard format (sequence + quality per base). Quality scores: Phred scale (Q30 = 1 error per 1000 bases). Filtering: remove low-quality reads, adapter sequences. Tools: bcl2fastq (Illumina), Guppy (Nanopore), SMRT Link (PacBio).

Secondary Analysis

Alignment: map reads to reference genome. Tools: BWA-MEM2 (short reads), minimap2 (long reads). SAM/BAM format: aligned read storage. Variant calling: identify differences from reference. SNVs: GATK HaplotypeCaller, DeepVariant. Structural variants: DELLY, SVIM, Sniffles. Output: VCF (Variant Call Format).

De Novo Assembly

No reference: assemble reads into contigs from scratch. Short-read: SPAdes, MEGAHIT. Long-read: hifiasm, Flye, Canu. Hybrid: combine short + long reads. Quality: N50 (half of assembly in contigs this size or longer). Application: new species, novel genomes, metagenomics.

Tertiary Analysis

Annotation: identify variant consequences (missense, nonsense, splice). Databases: ClinVar, gnomAD, COSMIC (clinical significance). Filtering: remove common variants, prioritize damaging. Classification: ACMG guidelines (pathogenic, likely pathogenic, VUS, likely benign, benign). Reporting: clinical report for physician/patient.

Whole Genome Sequencing

Coverage

Definition: average number of reads covering each base. Clinical WGS: 30-40x coverage (standard). Research: 15-30x (adequate for common variant calling). Deep sequencing: 100x+ (detect rare variants, somatic mutations). Calculation: (read count × read length) / genome size.

Applications

Rare disease diagnosis: find causative mutations. Cancer genomics: somatic mutation profiling. Population genetics: large-scale cohort studies (UK Biobank, All of Us). Pharmacogenomics: comprehensive drug response prediction. Prenatal: non-invasive prenatal testing (cell-free fetal DNA). Pathogen: whole genome epidemiology (outbreak tracking).

$100 Genome

Milestone: Illumina NovaSeq X (2022) approaches $200 per genome. Projection: $100 genome by 2025. Impact: enables population-scale sequencing. Challenge: analysis and interpretation cost now exceeds sequencing cost. Storage: each genome = ~100 GB raw data (storage infrastructure needed).

Clinical Genomics

Rare Disease Diagnosis

Diagnostic odyssey: average 5-7 years to diagnosis. WGS/WES: diagnoses ~25-40% of previously undiagnosed patients. Trio analysis: patient + both parents (identify de novo and recessive variants). Impact: ends diagnostic odyssey, enables targeted treatment. Example: 100,000 Genomes Project (UK), Undiagnosed Diseases Program (NIH).

Cancer Genomics

Tumor profiling: identify actionable mutations for targeted therapy. Panels: 50-500 genes (Foundation Medicine, Tempus). Liquid biopsy: detect circulating tumor DNA (minimal invasion). Treatment selection: match mutations to approved drugs. Monitoring: track tumor evolution during treatment.

Pharmacogenomics

Drug metabolism: CYP2D6, CYP2C19 variants affect drug processing. Dosing: warfarin, clopidogrel, codeine dosing guided by genotype. Adverse reactions: HLA-B*5701 → abacavir hypersensitivity. Implementation: pre-emptive genotyping programs (St. Jude, Vanderbilt). Impact: reduce adverse drug reactions, optimize efficacy.

Newborn Screening

Current: heel-prick biochemical tests (~30-50 conditions). Future: genome sequencing at birth (screen thousands of conditions). Pilot: BabySeq, Genomics England Newborn Screening. Ethical: incidental findings, consent, data storage. Potential: earliest possible diagnosis for treatable conditions.

Future Directions

Single-Cell Sequencing

Concept: sequence genome/transcriptome of individual cells. Technology: 10x Genomics Chromium (droplet-based). Application: tumor heterogeneity, immune cell profiling, developmental biology. Scale: 10,000-1,000,000 cells per experiment. Impact: revealed cellular diversity invisible to bulk sequencing.

Spatial Transcriptomics

Concept: measure gene expression while preserving tissue location. Technology: Visium (10x Genomics), MERFISH, Slide-seq. Resolution: approaching single-cell in situ. Application: tumor microenvironment, brain mapping, developmental biology. Impact: add spatial context to genomic data.

Epigenomic Sequencing

DNA methylation: bisulfite sequencing, nanopore direct detection. Chromatin accessibility: ATAC-seq. Histone modifications: ChIP-seq. Single-cell epigenomics: emerging (scATAC-seq). Application: cancer epigenetics, aging, development.

Population Genomics

Biobanks: UK Biobank (500,000), All of Us (1,000,000), Our Future Health (5,000,000). GWAS: genome-wide association studies at scale. Polygenic risk scores: predict disease risk from thousands of variants. Diversity: expanding beyond European populations (critical for equity). Impact: transform disease risk prediction and prevention.

References

Sanger, F., Nicklen, S., and Coulson, A. R. "DNA Sequencing with Chain-Terminating Inhibitors." Proceedings of the National Academy of Sciences, vol. 74, no. 12, 1977, pp. 5463-5467.
Goodwin, S., McPherson, J. D., and McCombie, W. R. "Coming of Age: Ten Years of Next-Generation Sequencing Technologies." Nature Reviews Genetics, vol. 17, 2016, pp. 333-351.
Nurk, S., Koren, S., Rhie, A., et al. "The Complete Sequence of a Human Genome." Science, vol. 376, no. 6588, 2022, pp. 44-53.
Jain, M., Koren, S., Miga, K. H., et al. "Nanopore Sequencing and Assembly of a Human Genome with Ultra-Long Reads." Nature Biotechnology, vol. 36, 2018, pp. 338-345.
Shendure, J., Balasubramanian, S., Church, G. M., et al. "DNA Sequencing at 40." Nature, vol. 550, 2017, pp. 345-353.