1. The question
Imagine throwing darts at a wall. Somewhere on the wall there is a small target. Each throw lands at a completely random spot on the wall.
If you throw n darts, what is the probability that at least one hits the target?
This is the same question as: if random insertions are scattered along a DNA segment, what is the probability a given gene gets hit at least once?
A caveat on the analogy. Real darts aren't thrown randomly — even a poor player aims at the board, so their throws cluster around an intended point rather than spreading evenly across the wall. The maths below only works for genuinely uniform-random landings; a truer mental picture is raindrops falling on a roof, or buckshot fired blindfolded. The same idealisation reappears in the biology: most insertional mutagens (Sleeping Beauty, many retroviruses, Agrobacterium T-DNA) have target-site preferences, so "uniform random insertion" is a working approximation rather than a literal description.
2. One throw
Let the target cover a fraction p of the wall's area. Then for a single random throw:
3. Throwing n darts — the independence trick
Each throw is independent. So the probability that all n darts miss is the miss-probability multiplied by itself n times:
Now use the complement rule: "at least one hit" is the opposite of "every single one missed", so
That's it — the whole formula. It's the CDF of the geometric distribution, which asks: "by throw number n, have we had our first success?"
4. Play with it
Move the sliders and watch the probability change.
5. A worked example
Suppose a gene covers 2% of a DNA segment, so p = 0.02. Then after n random insertions:
| n (insertions) | (1 − p)n | P(gene knocked out) |
|---|
Notice how quickly the probability climbs — but it only approaches 1, it never quite reaches it. There is always a tiny chance a gene survives, even after hundreds of insertions.
6. From one gene to thirty
The simulation has 30 genes with different sizes s₁, s₂, …, s₃₀, on DNA of total length L. For gene i, its "target fraction" is
so the probability that gene i has been knocked out by insertion number n is
The expected number of knocked-out genes is just the sum of these probabilities (by linearity of expectation — a result that works even though the events aren't fully independent):
That sum is exactly what the green dashed line on the simulation plots. Each gene contributes its own geometric CDF; because smaller genes have smaller pi, they take longer to saturate, which is why the last genes to be knocked out are almost always the tiny ones.
7. Sanity checks
- n = 0: formula gives 1 − (1 − p)⁰ = 1 − 1 = 0. ✓ No throws, no hits.
- p = 0: formula gives 1 − 1ⁿ = 0. ✓ Zero-size target, impossible to hit.
- p = 1: formula gives 1 − 0 = 1 for any n ≥ 1. ✓ Target covers the whole wall.
- n → ∞: (1 − p)n → 0, so probability → 1. ✓ Given enough throws, you will eventually hit.