]>
The birthday paradox
is the observation that, in a set of 23 people
chosen independently of where in the year their birthdays fall (the manner of
chosing may depend on their age otherwise, though; so a class of pupils in a
school is the common example), you have about an even chance that two of them
share their birthday. The paradox
here is that one might naïvely
expect to need a group, with about half as many members as there are days in
the year, to get odds as good as (about) one in two of a collision. After
all, to be certain of having at least two people with the same
birthday, you'd need to select one person more than the number of days in the
year.
The error in that naïve expectation is that it isn't analyzing probabilities competently. Later, I'll look at the theory for the general case; but first, let's look at the simplest case, where all days are (presumed to be) equally probable.
A typical arrangement, at least in much of the world, admits each child to the first year of school at the start of a particular academic year precisely if the child complets some specified number of years between a specified date in that year and the corresponding date a year later. For example, if the child's seventh birthday falls between September 1st this year (included) and September 1st next year (excluded), then the child joins this year's new intake of pupils. All pupils move up one class through the system each year, with the result that any class contains pupils who were all born between two dates exactly a year apart; either that year included a February 29th or it didn't, so there may be 365 or 366 days in that year, but either way each day of that year is equally likely. So let's consider a class of children all born in a particular interval of Y days, each of them equally probable; whether Y is 365 or 366 doesn't matter.
So let's go round the class asking for dates of birth and keep track of which ones we've seen. If the first n pupils all have distinct birthdays then the probability that the next pupil's birthday coincides with that of some earlier pupil is n/Y and the probability that it doesn't is 1−n/Y; so the probability that the first 1+n pupils have distinct birthdays is this multiplied by the probability that the first n have distinct birthdays. We can thus infer that the probability that our first n pupils have distinct birthdays is simply product(: 1−i/Y ←i |n); subtracting this from 1, we get the probability that at least two children share a common birthday. So let's define a function,
which gives us collide(Y, n) as the probability that, among n random samples from a uniformly distributed discrete random variate with Y possible outcomes, at least two of the samples gave the same outcome. Brute computation reveals that, to three decimal places, collide(365, 22) = 0.476, collide(365, 23) = 0.507, collide(366, 22) = 0.475, collide(366, 23) = 0.506; so, for Y in {365, 366}, collide(Y, n) first exceeds 1/2 when n is 23. Thus a class of 23 pupils has roughly an even chance of containing at least one pair of pupils with the same birthday; a class of 30 has about a 70% chance.
Before moving on to the more complex case, where p is non-constant, let's
consider another example: the SHA1 fingerprint of a text (used, for example,
as the identifier for an object in the git
revision-control
system) is a 160-bit summary
of the data. Assuming SHA1 is actually
ideal (which is a little optimistic), each of the 2160 possible
fingerprints is equally likely to be the SHA1 value of a randomly chosen
text. So, given a collection of n texts, the probability of two of them
having the same SHA1 fingerprint is collide(Y, n), as for the birthday
problem, only now with a much larger Y = 2160 – which is
about 1048, so any realistic git
repository's number
of objects is going to be tiny compared to it.
On my 64-bit computer, 1 −1/2n is indistinguishable from 1 for any n>53; this is quite normal for computers. This means that, if I try to compute 1−i/2160 for any i less than about 2107, my computer shall find the value to be exactly 1. So brute computation isn't going to give us useful answers unless we apply a bit of ingenuity; it'll claim collide(Y, n) is zero even when n is almost 2107, which (as we'll shortly see) is wildly wrong.
So let's start by thinking about a modest-sized n; consider collide(Y, n) for n = 3, say. We have 1 −1.(1−1/Y).(1−2/Y) = 1−( 1 −(1+2)/Y +1×2/Y/Y) = 3/Y −2/Y/Y. Since Y is huge, 2/Y is tiny compared to 3, so this value is effectively just 3/Y. Indeed, for n suitably tiny compared to Y, we can write product(: 1−i/Y ←i :n) as simply 1 −sum(n)/Y plus and minus assorted terms with at least two factors of Y in the denominator. Ignoring these later terms, we estimate collide(Y, n) ≈ sum(n)/Y = n.(n−1)/2/Y. This suggests that we can expect the probability of a collision to reach about 1/2 when n is about √Y.
Unfortunately, it's not quite that simple. Each of the terms we ignored with two Y factors in the denominator was of form +i.j/Y/Y with i < j in n; so a typical one of these terms is of order n.n/Y/Y/4. That is, indeed, tiny: but there are n.(n−1)/2 choices of i < j in n, so the sum of these terms is of order (n.n/Y)2/8. Thus, once n is of order √Y, these ignored terms actually add up to a significant contribution to collide(Y, n). Furthermore, when we do a similar analysis for each of the subsequent orders of terms, we find each to depend on n and Y only via a power of n.n/Y; so all of the terms we've been ignoring actually turn out to matter, once n is of order √Y. That doesn't undermine the general point that we can expect collide(Y, n) to be non-tiny roughly when n is of order √Y; it just means that this approximation is likely to be very rough indeed.
To refine the approximation, we need to quantify the terms of each order in Y. The terms with k factors of Y in their denominator are products of k terms −i/Y for distinct i in n. Let grow(n, k) = {list (n:q:k): natural i < j < k implies q(i) < q(j)} be the collection of strictly increasing lists, of length k, of members of n. Then our sum of terms with k factors of Y can be broken down in terms of the highest entry in each such list as:
in which we can separate out the k factors of −1/Y, leaving:
which we can simplify, by partitioning the sum according to the last entry in q, to:
but, given q(k−1) = h, q in grow(n, k)
iff (: q
:k−1) in grow(h, k−1)
; so this is just
so T(1+k, n) = sum(: h.T(k, h) ←h :n) and we've already established that T(1, n) = n.(n−1)/2, which we could now infer from T(0, n) = 1. Given that grow(n, k) is empty for k>n, we can infer that T(k, n) = 0 for k>n. Once we work out the details of T(k, n), we'll be able to write our probability of a collision among the first n samples of a random variate uniformly distributed over Y possible outcomes as:
So now let's look at the T(k, n) for successive k. Fortunately, we only really need to concern ourselves with the leading power of n and its coefficient. Multiplying T(k, h), a polynomial in h, by h necessarilly yields a polynomial of rank one higher; and (: sum(: p :n) ←p :) also increases the rank of any polynomial by one; so we can immediately see that the rank of T(k, n) is simply 2.k, confirming my earlier claim that each term in collide(Y, n)'s sum depends on n and Y only via a power of n.n/Y, to leading order. Furthermore, the leading-order term in sum(: power(k) :n) is power(1+k, n)/(1+k). For k = 0 and k = 1, we have already seen that T(k, n)'s leading order term is power(k, n.n/2)/k!; consider any natural k for which this arises. The leading-order term of T(1+k, n) = sum(: h.T(k, h) ←h :n) is thus the leading order term in sum(: h.power(k, h.h/2)/k! ←h :n) = sum(: power(2.k+1, h) ←h :n)/power(k, 2)/k!. This has leading-order term power(2.k+2, n)/(2.k+2)/power(k, 2)/k! = power(k+1, n.n/2)/(k+1)!, which is the anticipated value for k+1. We are thus able to induce that T(k, n)'s leading-order term is power(k, n.n/2)/k! for every natural k. Thus, ignoring terms with more factors of 1/Y than of n.n,
which gives us a revised estimate of when collide(Y, n) is about
half, namely when n is about √(2.ln(2).Y). With Y = 365, this is almost
22.5, right in the interval between the last natural with collide(Y, n) below
half and the first above. Still, what matters is that this is of order
√Y. So the birthday paradox
can be summarized as: collide(Y, n)
gets to be significant when n is of order √Y, notwithstanding any
naïve expectation that this would need n of order Y/2.
For our SHA1 fingerprints, then, we can expect the probability of a
collision to remain ignorable as long as the number of texts involved is tiny
compared to √Y: that's still a crazy-big number, 280 ≈
1024, but it's still nowhere near as crazy-big as Y itself. We
shouldn't need to worry about collisions among git
's identifiers
until the number of texts we're dealing with approaches millions of millions
of millions of millions.
By computational experiment, I find that T(k) always has a factor of (: n!/(n−k−1)! ←n :) for k > 0. From our earlier leading-order analysis, we can infer that the quotient's leading-order term is power(k−1, n/2)/2/k!; so use the standard function chose(n, k) = n!/k!/(n−k)! to define S = (: (: T(k, n).power(k, 2)/chose(n, 1+k) ←n :) ←k :{positive naturals}), with S(1) = 2 and the leading-order term in S(k, n) being power(k−1, n).(1+k). Our recurrence relation becomes:
and computational experiment (again) reveals that S(k) also has, for odd k > 1, factors of (: 2.n.(n−1) ←n :).
The earlier analysis's assumption that all candidate outcomes are equally likely is a simplification. Let's now consider the case of a random variate, with a set Y of possible outcomes whose relative frequencies are given by a function ({scalar x: 0≤x≤1}: p |Y), which we'll no longer assume to be constant, with sum(:p|Y) = 1. If the set of values we've seen from prior samples is N then the probability that our next sample isn't in that set is 1−sum(:p|N).
Suppose we look at a random sample of people born under jurisdictions using the Gregorian calendar: of any four hundred years under this calendar, 97 are leap years and the other 303 are ordinary years. Of course, leap years are slightly longer, so there's a marginally higher probability of being born in one than in any given non-leap year. In 400 years, February 29th happens 97 times, out of a total of 400×365 + 97 = 146097 days; while each of the other 365 days of the calendar happens exactly 400 times. So the probability of a randomly chosen person, from our population, having February 29th as date of birth is 97/146097, while that for any other day is 400/146097.
If, instead, we look at a random sample of people who are alive today, however, the proportion of them who were born in 1900 is tiny compared to the proportion born in any one of the last 50 years – unless our sample includes a strong bias in favour of old people – and none were born in an earlier year which was a multiple of 100 (hence of four, yet not a leap year). Thus it's more accurate to ignore the century anomaly in the leap year rule than to take it into account. So let's simply ignore the anomaly: of any 1461 days (four years), one is February 29th and each of the other possible dates of birth shows up four times. Thus the probability of a randomly chosen living person having February 29th as birthday is 1/1461 while that for any other date is 4/1461.
So, in general, the probabilities for various dates aren't all the same and depend on the population from which we draw our sample. In fact, even when the year-length variation is eliminated, as explained above for a school class, not all days are actually equally probable, since there are variations in the birth-rate through the course of each year, and there are some patterns to this that vary from year to year (because rates of conception are higher during holidays, for example). Thus some birthdays are more common than others. Likewise, for SHA1, the fingerprinting algorithm probably isn't exactly ideal, so there probably is some variation among the possible 160-bit summaries in how often each actually arises, in a large body of texts. We thus need to know how much this is apt to perturb the results.
We can reasonably expect the probability of collision to rise faster, with number of samples, when the distribution departs from uniformity: those values which arise relatively frequently are apt to appear both in your earlier samples and as your next; while those which arise less frequently are both less likely to already be in your sample and less likely to be what comes next. The net effect is that the probability distribution for what you select next is biassed in favour of the samples you've (most probably) already seen, increasing the chance of a collision. Indeed, in the extreme case where some of the nominally possible values have negligible frequency, this effectively reduces the number of possible values and hence its square root, which is the scale of number of samples needed to make a collision likely.
Consider a sequence I(0), I(1), … of sets of values, with I(0) = {} and each I(1+n) being the result of adding, to I(n), one more member i(n) in Y but i(n) not in I(n). The probability of witnessing the sequence i(0), i(1), …, i(n−1), for any natural n, is product(:p&on;i|n) = product(:p:I(n)); but if we witness the same values in any of their n! possible orders, each with this same probability, we still end up with the same set I(n) of values at the end. Thus, for a set N of n distinct members of Y, the probability that the first n samples have, as their values, the members of N is n!.product(:p:N); and the probability of the first n samples being distinct is just the sum of this over all candidate sets, N, subsumed by Y, of size n; if Y has C members, there are C!/n!/(C−n)! of these. When p is constant it's 1/C so the sum of these is just n!.power(n, 1/C).product(: C−i ←i |n)/n! = product(: 1−i/C ←i |n) as previously.
So let distinct(p, n) be the probability that, given relative frequency function (:p:Y), our first n samples are distinct. Impose an order < on the C members of Y, whether arbitrarily or by exploiting some intrinsic property, and define grow(Y, k) = {monic list (Y:q|k): i<j<k implies q(i)<q(j)} as before (albeit now applied to Y rather than a natural). Then we can write the foregoing as, for any natural n:
As before, we can decompose this in terms of the last item in q. (Note that, although q is a list here, its order has nothing to do with the order in which we actually see its entries when we select our n samples; grow(Y, n) is merely serving as the set of subsets, of size n, of Y – except that the ordering helps us decompose the sum.) For convenience, I'll offset n by 1, so we now examine:
Now introduce D = ({subsets of Y}: {b in Y: b<c} ←c :Y) to make this distinct(p, 1+n) = sum(: p(c).distinct((:p:D(c)), n) ←c :Y) and it should be clear that, for any a in Y, distinct((:p:D(a)), 1+n) = sum(: p(c).distinct((:p:D(c)), n) ←c :D(a)).
This gives us a framework for computing distinct(p, 1+n) but the details are inevitably very dependent on p. So let's consider an example: the birthday problem, allowing for February 29th's lower incidence but assuming all other days to be equally likely. So p(Feb 29) = q and p(d) = (1−q)/365 = r for any other date within the year, d; and q is approximately r/4. Impose an ordering on dates for which any other date is < Feb 29, so all other dates are in D(Feb 29) and Feb 29 isn't in D(d) for any date d. As for the uniform distribution analysis, we can introduce T(k, n) ≡ sum(: product(q) ←q :grow(n, k)) and, when D(d) has 1+m members, infer that