The biology behind the simulation

What insertional mutagenesis is, why it follows the curve, and how gene density spans roughly four orders of magnitude across cellular life.

What is insertional mutagenesis?

Insertional mutagenesis is a forward-genetic technique in which a fragment of foreign DNA — a transposable element (Tn5, P-element, piggyBac, Sleeping Beauty), a retroviral provirus, T-DNA delivered by Agrobacterium tumefaciens, or a synthetic gene-trap cassette — is introduced into a host and integrates into its genome at near-random positions. When an insert lands inside a gene's body it disrupts the gene: a premature stop, a frameshift, a fractured intron, or a strong promoter trap is enough to abolish or alter the protein product. Crucially, every mutant now carries a known DNA tag, so once a phenotype is observed the responsible gene can be cloned by sequencing the regions flanking the insertion. This makes insertional screens fundamentally different from chemical mutagenesis (EMS, ENU), where each mutation is a single base change that must be mapped by linkage.

Why the saturation curve has the shape it does

For one insertion at a uniformly random base, the probability of hitting a specific gene equals that gene's body length divided by the genome size — call it p = s / L. After n independent insertions the gene is still untouched with probability (1 − p)ⁿ, so the probability that it has been knocked out at least once is 1 − (1 − p)ⁿ. Average this over every gene in a class (all protein-coding genes, say) and you get the expected knockout fraction the simulation plots as a dashed line. Two practical consequences fall out: long genes saturate first because they are bigger targets, and gene density sets the absolute number of insertions needed for a useful screen — orders of magnitude more in a sparse vertebrate genome than in a compact bacterial one. The convexity of the curve is also why the marginal yield drops steeply: the first thousand insertions do far more useful work than the millionth.

Why density varies so wildly across species

Genome size and gene count are decoupled — the celebrated C-value paradox. Bacterial genomes are essentially wall-to-wall genes, and the smallest known cellular genomes belong to obligate endosymbionts that have shed everything their host can supply: Carsonella ruddii (160 kb, ~180 genes) and Nasuia deltocephalinicola (112 kb) pack coding sequence into more than 95% of their DNA. At the other extreme, the marbled lungfish (Protopterus aethiopicus, ~130 Gb) and the plant Paris japonica (~149 Gb) have genomes thousands of times the size of a human's without proportionally more genes — the bulk is repetitive elements, transposon graveyards, expanded introns, and other non-coding DNA. The same number of random insertions will saturate Carsonella almost completely while sampling only a tiny fraction of Paris japonica's gene set. The species selector in the simulation lets you watch this play out: E. coli's curves climb to 100% within thousands of insertions, while the human curves still have room to grow at n = 100,000.

Gene density across species

Selected reference genomes ordered from densest to sparsest. Genes/Mb is total annotated genes (protein-coding + ncRNA) per megabase. Genic % is the fraction of the genome occupied by gene bodies including introns. Numbers are rounded; sources noted below.

Species Domain Genome size Total genes % ncRNA Genes / Mb Genic % Note
Carsonella ruddii PV Bacteria 160 kb 182 ~2% ~1,140 ~97% Densest known cellular genome (psyllid endosymbiont)
Mycoplasma genitalium G37 Bacteria 580 kb 525 ~7% ~905 ~89% Smallest free-living bacterium often cited as a minimal cell
Escherichia coli K-12 Bacteria 4.64 Mb ~4,460 ~4% ~960 ~88% Operons packed back-to-back; the canonical model bacterium
Saccharomyces cerevisiae Fungi 12.1 Mb ~6,600 ~13% ~545 ~72% Introns rare; densest commonly studied eukaryote
Caenorhabditis elegans Animalia 100 Mb ~24,000 ~17% ~240 ~27% Compact for a metazoan; small introns
Arabidopsis thaliana Plantae 135 Mb ~32,800 ~16% ~245 ~50% Reference angiosperm; kilobase intergenic gaps
Drosophila melanogaster Animalia 144 Mb ~17,500 ~20% ~120 ~30% Long-running insertional screen target (P-element, piggyBac)
Homo sapiens Animalia 3.10 Gb ~44,800 ~55% ~14 ~45% PC exons cover only ~1.5%; introns inflate "genic" fraction
Protopterus aethiopicus Animalia ~130 Gb ~20–30k < 0.3 < 5% Marbled lungfish — largest known animal genome
Paris japonica Plantae ~149 Gb ~30k (est.) < 0.3 < 5% Largest known eukaryotic genome (Japanese canopy plant)
Densest known    Sparsest known
Gene density range
~3,800× · Carsonella vs Paris
Genome size range
~10⁶× · 160 kb → 149 Gb
Coefficient of variation, genic %
~0.85 · across the 10 species

Sources: NCBI RefSeq, Ensembl, Araport11, WormBase, FlyBase, GENCODE 45, and the original genome papers — Carsonella ruddii (Nakabachi et al. 2006), Mycoplasma genitalium (Fraser et al. 1995), P. aethiopicus C-value (Pedersen 1971; Meyer et al. 2021 for related species), Paris japonica C-value (Pellicer et al. 2010). Gene counts for the two extreme low-density species are approximate — annotation is incomplete and most of the genome is repetitive DNA.