# Information content

The notion dual to information content is how random is it ? For discrete probability distributions, I have a notion for that, whose form may potentially be extended to non-discreet situations. It goes like this:

Consider a finite collection, of size M, whose members I'll call buckets; and a finite collection, of size N, whose members I'll call blobs. If I throw each blob into a random bucket, ignoring such matters as where the buckets are and how many blobs are already in them, I can examine the probability of any given distribution of blobs arising among the buckets. This lets us infer a probability distribution on distributions of blobs among buckets. If we characterize this probability distribution, we can discover what characteristics the most probable distributions have.

All choices being independent, when I examine a randomly chosen bucket after such distribution, I can ask what the probability is of finding 0, 1, 2 or … blobs in the chosen bucket. The probability of finding i blobs in some bucket is formed as a product of three terms: each of the i blobs in our bucket had a probability of 1/M of landing in that bucket, contributing pow(1/M, i); each of the N−i blobs not in our bucket had a probability of (M−1)/M of not landing in that bucket, contributing pow((M−1)/M, N−i); and there are chose(N,i) = N!/i!/(N−i)! ways of choosing which i of the available N were to be the ones which did land in our given bucket.

We obtain a random distribution over our buckets, which we can view as a probability distribution scaled by N, characterized by a mapping from buckets to how many blobs landed in each, ({naturals}: n |M). We have sum(n) = N, and the probability of any given n arising is proportional to the number of ways of partitioning our N blobs over M for which the distribution is our given n, times pow(1/M, N) for the probability of each blob landing in the slot in which it did land. The number of partitions involved is simply N!/product(: n(i)! ←i |M), so the probability of seeing distribution n is N! / pow(M, N) / product(n!), in which N = sum(n).

Let p = (: n(i)/N ←i |M), so sum(p) = 1, and use the leading terms in Stirling's formula, log(N!) = (N+1/2).log(N) −N, to obtain

log(N!/product(n!))
≈ log(N).(N+1/2) −N −sum(: log(n(i)).(n(i)+1/2) −n(i) ←i |M)
= log(N).(N +1/2) −sum(log(n).(n+1/2))

albeit with some complications needed to deal with those i for which n(i) isn't large enough to apply Stirling's formula; but we can eliminate these by ignoring the i for which n(i) is zero (in product(n!), these don't contribute) and increasing M enough to ensure that all but negligibly few of the other n(i) are large enough. This also ensures that we can ignore the +1/2 terms, leaving us with N.log(N) −sum(n.log(n)). We know log(pow(M,N)) = N.log(M), so the probability of seeing n has, as its log,

N.log(N/M) −sum(n.log(n))
= sum(n).log(N/M) −sum(n.log(n))
= −sum(n.log(n/N)) −N.log(M)
= N.sum(p.log(1/p)) −N.log(M)

Thus the probability of seeing distribution p only depends on p via sum(p.log(1/p)). A sufficiently well-behaved integration (or procedure for infinite summation) over some domain can reduce probability measures on that domain to functions, like p: for which we have a measure of probability (relative to the integration, in fact) expressed by integral(p.log(1/p)). The original probability measure is p@integral and we can equally construct (p.log(1/p))@integral, whose total measure is thus our indicator of the probability of seeing distribution p. In any case, for any distribution p, define I(p) = sum(p.log(1/p)), with sum replaced by a suitable measure if appropriate.

What kinds of distribution are most probable ? We can use Lagrange's method of multipliers to solve the problem. Let p encode the system's state, so D is the collection of distributions on M, V and W are {scalars}, f is (V: sum(p.log(1/p)) ←p |D) and g is sum, with constraint value g(p) = 1. We have h as linear (V: h |W) with W = {scalars} = V, so h is just a scalar. So, differentiation with respect to p(i), for some i in M, takes f to −1−log(p(i)) and g to 1, so ∂f = h·∂g becomes 0 = 1+log(p(i))+h, which makes p(i) = exp(−1−h) so p is constant. This is exactly the kind of distribution we used in deciding what was the most random way to distribute p, so this shouldn't be surprising. Note that h.g−f, with derivative h+1+log(p), has diagonal second derivative, with entries 1/p all positive: so h.g−f is at a minimum and f−h.g is at a maximum: we've maximized f with respect to the constraints (the method of multipliers could as readily have given us a minimum, in general).

When we have more constraints, W has higher dimension, h is more interesting and p can take quite different forms.