What is insertional mutagenesis?
Insertional mutagenesis is a forward-genetic technique in which a fragment of foreign DNA — a transposable element (Tn5, P-element, piggyBac, Sleeping Beauty), a retroviral provirus, T-DNA delivered by Agrobacterium tumefaciens, or a synthetic gene-trap cassette — is introduced into a host and integrates into its genome at near-random positions. When an insert lands inside a gene's body it disrupts the gene: a premature stop, a frameshift, a fractured intron, or a strong promoter trap is enough to abolish or alter the protein product. Crucially, every mutant now carries a known DNA tag, so once a phenotype is observed the responsible gene can be cloned by sequencing the regions flanking the insertion. This makes insertional screens fundamentally different from chemical mutagenesis (EMS, ENU), where each mutation is a single base change that must be mapped by linkage.
Why the saturation curve has the shape it does
For one insertion at a uniformly random base, the probability of hitting a specific gene equals that gene's body length divided by the genome size — call it p = s / L. After n independent insertions the gene is still untouched with probability (1 − p)ⁿ, so the probability that it has been knocked out at least once is 1 − (1 − p)ⁿ. Average this over every gene in a class (all protein-coding genes, say) and you get the expected knockout fraction the simulation plots as a dashed line. Two practical consequences fall out: long genes saturate first because they are bigger targets, and gene density sets the absolute number of insertions needed for a useful screen — orders of magnitude more in a sparse vertebrate genome than in a compact bacterial one. The convexity of the curve is also why the marginal yield drops steeply: the first thousand insertions do far more useful work than the millionth.
Why density varies so wildly across species
Genome size and gene count are decoupled — the celebrated C-value paradox. Bacterial genomes are essentially wall-to-wall genes, and the smallest known cellular genomes belong to obligate endosymbionts that have shed everything their host can supply: Carsonella ruddii (160 kb, ~180 genes) and Nasuia deltocephalinicola (112 kb) pack coding sequence into more than 95% of their DNA. At the other extreme, the marbled lungfish (Protopterus aethiopicus, ~130 Gb) and the plant Paris japonica (~149 Gb) have genomes thousands of times the size of a human's without proportionally more genes — the bulk is repetitive elements, transposon graveyards, expanded introns, and other non-coding DNA. The same number of random insertions will saturate Carsonella almost completely while sampling only a tiny fraction of Paris japonica's gene set. The species selector in the simulation lets you watch this play out: E. coli's curves climb to 100% within thousands of insertions, while the human curves still have room to grow at n = 100,000.
Gene density across species
Selected reference genomes ordered from densest to sparsest. Genes/Mb is total annotated genes (protein-coding + ncRNA) per megabase. Genic % is the fraction of the genome occupied by gene bodies including introns. Numbers are rounded; sources noted below.
| Species | Domain | Genome size | Total genes | % ncRNA | Genes / Mb | Genic % | Note |
|---|---|---|---|---|---|---|---|
| Carsonella ruddii PV | Bacteria | 160 kb | 182 | ~2% | ~1,140 | ~97% | Densest known cellular genome (psyllid endosymbiont) |
| Mycoplasma genitalium G37 | Bacteria | 580 kb | 525 | ~7% | ~905 | ~89% | Smallest free-living bacterium often cited as a minimal cell |
| Escherichia coli K-12 | Bacteria | 4.64 Mb | ~4,460 | ~4% | ~960 | ~88% | Operons packed back-to-back; the canonical model bacterium |
| Saccharomyces cerevisiae | Fungi | 12.1 Mb | ~6,600 | ~13% | ~545 | ~72% | Introns rare; densest commonly studied eukaryote |
| Caenorhabditis elegans | Animalia | 100 Mb | ~24,000 | ~17% | ~240 | ~27% | Compact for a metazoan; small introns |
| Arabidopsis thaliana | Plantae | 135 Mb | ~32,800 | ~16% | ~245 | ~50% | Reference angiosperm; kilobase intergenic gaps |
| Drosophila melanogaster | Animalia | 144 Mb | ~17,500 | ~20% | ~120 | ~30% | Long-running insertional screen target (P-element, piggyBac) |
| Homo sapiens | Animalia | 3.10 Gb | ~44,800 | ~55% | ~14 | ~45% | PC exons cover only ~1.5%; introns inflate "genic" fraction |
| Protopterus aethiopicus | Animalia | ~130 Gb | ~20–30k | — | < 0.3 | < 5% | Marbled lungfish — largest known animal genome |
| Paris japonica | Plantae | ~149 Gb | ~30k (est.) | — | < 0.3 | < 5% | Largest known eukaryotic genome (Japanese canopy plant) |
Sources: NCBI RefSeq, Ensembl, Araport11, WormBase, FlyBase, GENCODE 45, and the original genome papers — Carsonella ruddii (Nakabachi et al. 2006), Mycoplasma genitalium (Fraser et al. 1995), P. aethiopicus C-value (Pedersen 1971; Meyer et al. 2021 for related species), Paris japonica C-value (Pellicer et al. 2010). Gene counts for the two extreme low-density species are approximate — annotation is incomplete and most of the genome is repetitive DNA.