Where does 1 − (1 − p)ⁿ come from?

An A-level walk-through of the geometric CDF — the formula behind the expected-knockout curve.

1. The question

Imagine throwing darts at a wall. Somewhere on the wall there is a small target. Each throw lands at a completely random spot on the wall.

If you throw n darts, what is the probability that at least one hits the target?

This is the same question as: if random insertions are scattered along a DNA segment, what is the probability a given gene gets hit at least once?

A caveat on the analogy. Real darts aren't thrown randomly — even a poor player aims at the board, so their throws cluster around an intended point rather than spreading evenly across the wall. The maths below only works for genuinely uniform-random landings; a truer mental picture is raindrops falling on a roof, or buckshot fired blindfolded. The same idealisation reappears in the biology: most insertional mutagens (Sleeping Beauty, many retroviruses, Agrobacterium T-DNA) have target-site preferences, so "uniform random insertion" is a working approximation rather than a literal description.

2. One throw

Let the target cover a fraction p of the wall's area. Then for a single random throw:

P(hit) = p
  
P(miss) = 1 − p
target wall (whole area)
Fraction of wall covered by the target = p.

3. Throwing n darts — the independence trick

Each throw is independent. So the probability that all n darts miss is the miss-probability multiplied by itself n times:

P(all n miss) = (1 − p) × (1 − p) × … × (1 − p)  =  (1 − p)n

Now use the complement rule: "at least one hit" is the opposite of "every single one missed", so

P(at least one hit in n throws) = 1 − (1 − p)n

That's it — the whole formula. It's the CDF of the geometric distribution, which asks: "by throw number n, have we had our first success?"

4. Play with it

Move the sliders and watch the probability change.

Darts landing randomly. Red = hit, grey = miss.
P(at least one hit) as n increases. Dot = current n.

5. A worked example

Suppose a gene covers 2% of a DNA segment, so p = 0.02. Then after n random insertions:

n (insertions)(1 − p)nP(gene knocked out)

Notice how quickly the probability climbs — but it only approaches 1, it never quite reaches it. There is always a tiny chance a gene survives, even after hundreds of insertions.

6. From one gene to thirty

The simulation has 30 genes with different sizes s₁, s₂, …, s₃₀, on DNA of total length L. For gene i, its "target fraction" is

pi = si / L

so the probability that gene i has been knocked out by insertion number n is

P(gene i KO) = 1 − (1 − si/L)n

The expected number of knocked-out genes is just the sum of these probabilities (by linearity of expectation — a result that works even though the events aren't fully independent):

E[K(n)] = Σi=130 [ 1 − (1 − si/L)n ]

That sum is exactly what the green dashed line on the simulation plots. Each gene contributes its own geometric CDF; because smaller genes have smaller pi, they take longer to saturate, which is why the last genes to be knocked out are almost always the tiny ones.

7. Sanity checks

← Back to the simulation