]> Statistical Physics

Quantum Distribution

Statistical physics describes what happens when a system has (typically infinitely) many more degrees of freedom than constraints; from the basic foundation I'll lay out below, one may specialize to look at what happens when the system is made up of many sub-systems (typically thought of as particles) and how the energy of the system expresses itself (via temperature). The foundation, however, is more general and only depends on what quantum-mechanics has to say about the system's possible states and how these relate to the physically observed properties of the system.


Quantum mechanics models any system in terms of a Hilbert space, S, equipped with a non-singular Hermitian metric, s. Every solution to the underlying dynamical equations of the system is represented by a unit vector, ν in S with s(ν, ν) = 1. Each physically observable property of the system is encoded as a linear map (S:X|S) for which s&on;X is also Hermitian; such a linear map is termed an observable of the system. The expected value of the outcome of an experiment to measure X, when the state is initially in state ν in S with s(ν, ν) = 1, is s(ν, X(ν)). Such an observation forces the system into a state which actually has the observed value for X; if the system was not previously in such a state, this changes the state of the system and, potentially, the values of any previously-measured observables, thereby making those prior measurements out of date. This imposes a compatibility constraint on what we can experimentally observe. Our first task shall be to find which ν in S represent states of the system in which some compatible actual physically measured quantities take the values actually found by experiment.

When there are many such solutions, Gell-Mann's totalitarian principle (that which is not forbidden is compulsory) implies that the system shall be in a superposition of all of the possible solutions. We must then ask: which superpositions are actually selected ? To even pose that question properly requires the theory of distributions; to answer it we shall need the notion of information content; but, first, let us look a little more closely at our physically observed data.


The foregoing account of our state space, S, its metric, s, and their observables was deliberately terse (yet still rather longer than I would have it). Let me now expand somewhat upon it. A Hilbert space is a vector space over the complex numbers with certain topological properties that ensure that, even if the space is infinite-dimensional, it still follows some rules that would apply naturally were it finite-dimensional.

A Hermitian product on a Hilbert space S is an antilinear mapping (dual(S):x|S) with the nearest thing to symmetry this lets it have. that it is antilinear means that, for any u in S and complex scalar k, x(u.k) = x(u).*k; multiplying x's input by a complex scalar has the same effect as multiplying its output by the complex conjugate of this scalar. As for a linear map, an antilinear map also respects addition, so x(u+v) = x(u)+x(v) for all u and v in S. That x is a mapping (dual(S): |S) says that it accepts any u in S as an input and its output is always in dual(S), the vector space of linear maps from S to the complex numbers. Thus, for u and v in S, x(u) is a linear map from S to complex numbers, so we can give it v as input and get x(u, v), which is a complex number. Scaling u by some complex number k scales x(u, v) by *k, the complex conjugate of k; scaling v by such a k scales x(u, v) simply by k. Consequently, for any u, v with x(u, v) non-zero, there are some complex k for which x(k.u, v) and x(v, k.u) are distinct; even if they are equal for some k, varying k's phase shall vary their phases in opposite ways. Thus the nearest thing to symmetry available to a non-zero antilinear (dual(S):x|S) is to have x(u, v) equal to the complex conjugate of x(v, u), for all u, v in S.

So a Hermitian product combines two members of S – one antilinearly, the other linearly – to produce a complex number as output; and swapping the order of its inputs (hence which is taken linearly and which antilinearly) merely conjugates the output. In particular, for any u in S and Hermitian x on S, this implies x(u, u) is real: swapping the order of inputs leaves the expression (and hence its value) unchanged, yet also complex conjugates it; so its value must be self-conjugate, i.e. real. Our metric, s, is a Hermitian product on S; indeed, it is a positive-definite Hermitian product, which means s(u, u) is always positive unless u is zero. In particular the zero member of dual(S), which maps every member of S to 0, only appears as the output of s if the input was itself the zero member of S; any other u in S gives s(u, u) ≠ 0 whence s(u) maps at least some member of S to a non-zero output, so s(u) is not the zero member of dual(S). Thus if two inputs to s yield the same output, s(u) = s(v), we can infer s(u−v) = s(u) −s(v) = 0 whence u−v is the zero member of S, thus u = v; so positive-definite s is necessarily monic. For a finite-dimensional S this suffices to make s an isomorphism between S and dual(S); one of the effects of S being Hermitian is to ensure that any positive-definite Hermitian (dual(S):s|S) likewise has an inverse (S:|dual(S)); this is necessarily antilinear, like s, and posesses the same conjugated symmetry, hence a Hermitian product on dual(S).

Now, as long as s is invertible, we can compose its inverse after any other Hermitian (dual(S):x|S); the resulting s\x (reading \ as division from the left in the same way that a/b divides a by b on the right; if b is a number, a/b and b\a are the same thing, but the order matters when b is a mapping) then takes any u in S, consumes it antilinearly to give x(u) in dual(S), then feeds it backwards through s, also antilinearly to get a member of S; if we scale u by complex k, the intermediate value is *k.x(u) and the inverse of s duly conjugates *k back to k, so (s\x)(k.u) = k.(s\x)(u) and (S: s\x |S) is simply linear. Thus, given s, every Hermitian x on S can be turned into a linear (S: s\x |S); call this X and observe that, in the composite s&on;X = s·(s\x), s and its inverse cancel, so s&on;X = x. Any linear (S:X|S) for which s&on;X is Hermitian is an observable, so s mediates an isomorphism between Hermitian products on S and observables.

In quantum mechanics we model any system by a state vector, ν, in a Hilbert space, S, on which we have a non-singular Hermitian metric (dual(S): s |S) with s(ν, ν) = 1. [Hermitian means s is antilinear – for complex scalar k and any v in S, s(k.v) = *k.s(v), where *k is the complex conjugate of k – and as symmetric as that allows it to be – for u, v in S, s(u,v) is equal to the complex conjugate of s(v,u).] The quantities we can know are expressed as linear maps (S:X|S) for which s&on;X is Hermitian; such a linear map is called an observable; the ones we do know (i.e., not only are they observable, but we have observed them) necessarily commute with one another. This implies that they are diagonal with respect to some basis which unit-diagonalises s. Call one such basis (S: b |dim) and its dual (dual(S): q |dim), with dim the dimension of our Hilbert space; then

in which ({scalars}: * |{scalars}) is complex conjugation; and each q(i) is in dual(S) = {linear ({scalars}: |S)}, so *&on;q(i) is antilinear ({scalars}: |S).

Thus we have a family of observables, which we can model as a set Observed of linear maps (S: |S); and we're given the value of s(ν, X(ν)) for each X in Observed, which we can encode as a mapping Val = (: s(ν, X(ν)) ←X |Observed). For any Hermitian t and vector u in S, swapping the two inputs of t(u,u) conjugates the output (because t is Hermitian) without actually changing the expression, so *(t(u,u)) = t(u,u), i.e. t(u, u) is real. For each X in Observed, s&on;X is Hermitian, whence s(X(ν), ν) = (s&on;X)(ν, ν) is real, hence equal to its conjugate *(s(X(ν), ν)) = s(ν, X(ν)) = Val(X) thanks to s being Hermitian. Thus, in fact, ({reals}: Val |Observed).

One observable always available to us is the identity, which commutes with all observables. Any observation of the identity renders an answer of 1, which expresses the above-mentioned truth that s(ν, ν) = 1, so we include the identity among Observed as a way of encoding this constraint. Since S, as a collection, is synonymous with its identity, we thus have S in Observed, with Val(S) = 1. For convenience, take J = {X in Observed: X is not S}, so that Observed = J&unite;{S} and S is not in J. Thus J is the collection of real physical observeds.

Note that J will typically be a collection of observables which contain as much information about our system as we know how to obtain: if we identified an observable compatible with but independent of the others, we would observe it. Consequently, there's a sense in which we should expect to find little information content in the degrees of freedom remaining to the solution, given the data actually observed. This is an anthropic contribution to the situation.

We have as many degrees of freedom as we have members of dim (which is typically infinite): we have as many constraints as there are members of Observed (which is typically finite). There are typically plenty more degrees of freedom than constraints.


When we have more degrees of freedom than constraints, we can expect to find many solutions; our system is capable of being in any of some very large collection of actual states. When we have many solutions (especially when we have so many that Avagadro's number doesn't seem big), the study of the situation is called Statistical Physics. Even when only few solutions are available, however, we need to describe how the system choses among the available solutions.

The Hilbert space S is the span of all possible solutions of our underlying dynamical system; any s-unit vector in it is a possible state of that system. Our observation of the physical properties of the system constrains the state to the (s-unit vectors in) a sub-space of S on which each observable has, as eigenvalue, its observed value. Any suitably normalised linear combination of states in this sub-space shall also be a state in this sub-space; and, between such states, we expect there to be scattering channels by which a solution at one moment evolves, over time, to form a superposition of solutions, in which the original is just one possible contributor.

The appropriate tool for doing superpositions of solutions is a distribution (or measure) on the solution-set, E, of our equations: formally, this is a linear map ({scalars}: :{mappings ({scalars}: :E)}), which may be though of as integrating over E. We can always extend a distribution, m, to also serve as a linear map (V: :{mappings (V::E)}) for any linear space V, by taking components in V, applying m to the scalar functions from E that result, then using the resulting scalars as components in V to reconstitute an answer. For a given distribution, m, on E and any mapping ({scalars}: f :E), we can form a new distribution on E, which I'll write as f@m, defined by:

On a discrete set, there is a standard distribution, sum, which simply adds together the outputs of the function: and any function f from the discrete set to scalars thus yields a distribution f@sum. A superposition of solutions has the form m(S: e←e |E), the integral with respect to some distribution m of E's natural embedding in S. The available scattering processes will imply some kind of redistribution over E, collectively implying a flow dynamics for m whose character we know poorly save that it is jam packed full of randomness. The basic idea of statistical physics is to reason that the actual superposition of solutions we see will be as random as it can get while yielding a solution (i.e. a member of E) as its integral of (S: e ←e :E).

Now, two distributions on E may produce the same member of S as their superposition: and the form our constraints take becomes complex and unhelpful. However, each member of E must be some superposition of basis vectors (because b spans S), so any superposition of solutions to the equations is equally a superposition of b: and two distributions superposing b to the same member of S must be equal (because b is linearly independent), so the ambiguity among distributions on E presents no problem. Because b diagonalises our constraints, it will render them into a simpler form than E could offer.

We thus deal with (S:b|dim) rather than (S: e ←e|E), so the corresponding tool for superpositions becomes a distribution on (|b:), which (since b is monic) amounts to a distribution on dim. We already have a standard distribution on dim, namely sum (if dim is discrete; if continuous, we need to use integral in place of sum; but this makes no difference to the form of the discussion). As noted above, any mapping ({scalars}: f :dim) yields f@sum as another distribution on dim. [When dim is a continuum, some distributions on it might not be of form f@integral; but we suppose that distributions we actually want to use will be of this form.] We can thus identify any distribution with the ({scalars}: f :dim) for which f@sum is the given distribution; indeed, doing so is exactly the process of taking coordinates in S, using b as basis, of the superposition that results from the distribution. So, now, let's describe an arbitrary solution in terms of its co-ordinates, and see what form our constraints take.

Co-ordinates and randomness

We can use our bases, b and q, of S and dual(S) to express ν in co-ordinates: define ρ = ({complex}: q(i)·ν ←i |dim) and we obtain ν = sum(S: ρ(i).b(i) ←i |dim). Since q unit-diagonalised s, we have s(b(i)) = q(i), so we can infer s(ν) = sum(: *(ρ(i)).q(i) ←i |dim). Since b and q also real-diagonalise each X in Observed, we can define


= s(ν, X(ν)) = s(ν)·X(ν)
= sum(: *(ρ(i)).q(i)·X(ν) ←i |dim)
= sum(: *(ρ(i)).Ob(X,i).ρ(i) ←i |dim)

so define

r = (: ρ(i).*(ρ(i)) ←i |dim)
noting that its outputs are non-negative reals, and obtain
Val(X) = sum(: r.Ob(X) :)

From S in Observed, as the identity, we get Ob(S) = (: 1←i |dim) whence 1 = s(ν, ν) = sum(r), and we know r's outputs are non-negative, so each output of r is at most 1 (at least when dim is discrete). This makes r@sum a (candidate to be used as a) probability distribution: equally one may think of r as indicating the proportions in which we are mixing the base states represented by b when we build up ν (but this should not be treated too literally, since r lacks the phase information of ρ).

For each i in dim we have b(i)×q(i) in {linear (S:|S)} and composing it before s yields q(i)×(*&on;q(i)) which is Hermitian, so b(i)×q(i) is an observable (in the technical sense, not necessarily a physical one); it also commutes with each of X in Observed, since these are diagonal. Its expected value in state ν is simply r(i), and quantum mechanics regards it as the probability of observing a state (previously) in state ν to be (when observed) in state b(i). In principle, at least, r is consequently observable.

Now, r is defined in terms of b, our basis, but none of the definition of either cares about any relationships among the labels (members of dim): so we should likewise consider all members of dim independently. Our constraints would complicate that, but the optimisation technique I'll be using allows us to ignore the complication until we've done the optimisation.

We need a quantifiable notion of how random r is: information theory comes to the rescue, with sum(: r.log(1/r) :), that is sum(: r(i).log(1/r(i)) ←i :). This is the information-theoretic entropy of the distribution described by r. The probability of getting distribution r at random depends on r only via its entropy, and increases with it. [Aside: log(x)/x tends to zero as x tends to infinity (because log grows slower than k.x for any positive k), so t.log(1/t) does likewise as t tends to zero (from above), so treat 0.log(1/0) as 0 if r takes the value 0 anywhere.] So we can use f(r) = sum(r.log(1/r)) as the function we aim to maximise, within our constraints.

Our constraints are (: sum(r.Ob(X)) ←X |Observed) = Val, which has the form g(r) = given required to use a standard optimisation technique. [K(i) is the value Val would have if ν were b(i).] Introduce K = transpose(Ob) = (: (: Ob(X, i) ←X |Observed) ←i |dim), so K(i,X) = Ob(X,i), and obtain

= (: s(ν, X(ν)) ←X |Observed)
= (: sum(: r(i).K(i,X) ←i |dim) ←X |Observed)
= sum(: r(i).K(i) ←i |dim)
= sum(r.K) in {mappings ({reals}: |Observed)}

We'll be looking for some r for which f'(r) and h&on;g'(r) are parallel for some linear ({reals}: h :{mappings ({reals}: :Observed)}), which is a distribution on Observed. Given that Observed is finite (so that we can sum over it, for instance), we have a measure called sum on it already, and we can express any measure on Observed as a scalar function by which to weight its members when summing over them. This leads us to some ({reals}: H |Observed) equivalent to h as follows:


Any sufficiently small change, d, in r produces a change in g of near enough g'(r)·d, which must be zero if g(r) and g(r+d) both have the correct value. At a stationary point (e.g. any maximal or minimal point), no allowed displacement changes the value of f, which requires that every d with g'(r)·d zero also has f'(r)·d likewise zero. This implies that f'(r) can be factorised as −h&on;g'(r) for some h, as above (I could equally have used f'(r)=h&on;g'(r), but we have a choice here and f'(r)+h&on;g'(r)=0 gives H positive when we work out the details).

Now, (: t.log(1/t) = −t.log(t) ←t :)' = (: −1 + log(1/t) ←t :) so differentiation with respect to r(i), for any given i in dim, turns f(r) = sum(r.log(1/r)) into −1+log(1/r(i)) and g(r) = sum(r.K) into K(i) = (: Ob(X,i) ←X :). We have h&on;g'(r) = h&on;K = (: h(K(i)) = sum(: H(X).Ob(X,i) ←X :) ←i :) = sum(H.Ob), so we obtain 1+log(r(i)) = sum(: H(X).Ob(X,i) ←X :) or

Now, S is the identity, and each s(b(i), b(i)) is 1, so Ob(S, i) is always 1 and we have log(r(i)) = H(S)−1 + (sum(:H.Ob:J))(i), so the constraint given as S becomes 1= sum(r) = exp(H(S)−1).sum(: (exp&on;sum)(:H.Ob:J) |dim), so exp(1−H(S)) = sum(: exp&on;sum(:H.Ob:J) |dim) so we are able to express one of our constraints as a constraint on H(S), in terms of the remaining (:H:J) independent Observeds.

Pause to consider whether we got a maximum: differentiating f with respect to first r(i) then r(n), for n, i in dim, we get r(n)'s derivative of −(1+log(r(i))), which is zero unless i=n, when it is simply −1/r(i), which is negative; g's derivative is independent of r, so its second derivative (hence that of h&on;g) is zero. Thus f+h&on;g has negative definite second derivative with respect to r, implying that our stationary point was a maximum (when you move far enough away from r to get a change in f+h&on;g, it'll be negative no matter what direction you took), which is what we wanted.

Further analysis


Our solution now becomes r(i) = zed(H,i) / exp(Y(h)), with both zed and Y ignoring H(S) so, in effect, H is only defined ({scalars}:|J) from here onwards; and h ignores the S = 1 component of Val. Pause, now, to consider the derivative of zed and, thereby, of Y.

For any X in J, differentiation with respect to H(X) turns log&on;zed(H) = (: sum(:H.K(i):J) ←i |dim) into (: K(i,X) ←i |dim) = Ob(X), hence zed(H) into Ob(X).exp&on;sum(:H.Ob:J). Consequently, exp(Y(h)) = sum(: zed(H) |dim) has derivative, with respect to H(X), sum(: Ob(X).exp&on;sum(:H.Ob:J) |dim). Now, from g(r) = Val,

= g(r,X)
= sum(:
= Ob(X,i).exp(sum(:H.K(i):J)) / exp(Y(h))
←i |dim)
= sum(: Ob(X).exp&on;sum(:H.Ob:J) |dim) / exp(Y(h))

This just divides our derivative of exp&on;Y by exp&on;Y, which is thus the simple derivative of Y with respect to H(X), so (:Val:J) is Y's derivative. Thus

We have exp(Y(h)) = sum(: zed(H) |dim) and, for each X in J ⊂ Observed, X = sum(: Ob(X, i).b(i)×q(i) ←i |dim) so, reading J as the (linear) identity mapping on J,

h(J) = sum(:H.J:)
= sum(: H(X).X ←X :J)
= sum(: H(X).sum(: Ob(X,i).b(i)×q(i) ←i |dim) ←X :J)
= sum(: sum(: H(X).Ob(X,i) ←X :J).b(i)×q(i) ←i |dim)
= sum(: log(zed(H,i)).b(i)×q(i) ←i |dim),

which is diagonal with respect to our basis, hence formally an observable. In particular, we can define exp() on {linear (S:|S)}, and it maps one diagonal mapping to another, mapping each diagonal entry to its own exp() as a scalar: so exp(sum(:H.J:)) is just sum(: zed(H,i).b(i)×q(i) ←i |dim) = sum(: zed(H).b×q :dim) and its trace is exp(Y(h)). Let Z = exp(sum(:H.J:)) = exp(h(J)) and observe that Z is defined independently of our choice of basis, b (albeit we compute it using that basis) and is an observable compatible with all those in Observed. We have Y = log&on;trace&on;Z so, again, Y doesn't depend on our choice of basis (whereas zed manifestly does – it's the co-ordinates of Z wrt b, q).

Final entropy

It is interesting, now, to look at the final value of the function whose maximum we've found. That's

= sum(r.log(1/r))
= sum(: r(i).(log(1/zed(H,i)) + Y(h)) ←i |dim)
= Y(h).sum(r) −sum(: r(i).log(zed(H,i)) ←i |dim)
= Y(h) −sum(: r(i).sum(: H(X).Ob(X, i) ←X :J) ←i |dim)
= Y(h) −sum(: H(X).sum(: r(i).Ob(X, i) ←i |dim) ←X :J)
= Y(h) −sum(: H(X).Val(X) ←X :J)
= Y(h) −h(Val)

which is manifestly independent (because Y and Z are) of our choice of simultaneous eigenbasis for our observables, even though f was defined in terms of that basis: the final value depends only on the observables and their values. Furthermore, h(Val) = h·Val = h·Y'(h), so Y(h) −h(Val) is just the estimate we would make, by linear extrapolation from Y's value and derivative at h, of the value of Y(zero).

Lack of large numbers

Thus far the only time I've needed many of anything is hidden inside how I assess the randomness of the function r. The derivation of the measure of randomness, as sum(r.log(1/r)), involved making up r of a large number of dollops. Its validity may involve more measure theory than I know, but the dollops of which it needs large numbers are purely a fiction of the construction: they make no pretence at having anything at all to do with anything physical. Any distribution which arises as a result given one number of dollops also arises for at least some larger number of dollops (e.g. any multiple of the original number), indeed the distributions made of many dollops are taken to be dense among the possible distributions. Since, for these distributions at least, we know the probability will depend on the value of r only via sum(r.log(1/r)), we infer that the same will apply to any probability distribution we manage to infer on the space of (: distributions :{mappings ({scalars}:|dim)}) will depend on position, r, in this space, only via the same sum: which may be evaluated without reference to how many dollops were involved. So the only large number needed thus far was used in an abstract context to justify using a measure of randomness which can be evaluated without reference to the large number in question.

Thus the conclusions derived here apply to any quantum system, even if it consists only of one molecule or, indeed, a single atom. In so far as its internal dynamics permit it various states with its observed properties, we can expect it to be in an entropy-maximising superposition, of these states, of the kind described here.

Insensitivity to redundant observables

Suppose we add a new observable, M, to J, but that it's entirely determined by the other observables, i.e. its expressible as a function of the observables already in J. Intuitively, this shouldn't make any difference; so let's see what the theory predicts. Let I be the set of independent X in J, excluding M, and let μ be the function which computes M from the observables in I; so K(i, M) = μ(:K(i):I) for each i in dim and can reasonably expect the matching result for its expectation, Val(M) = μ(:Val:I).

We have Val(X) = ∂Y/∂H(X) = ∂(log(sum(: exp(sum(H.K(i))) ←i :dim)))/∂H(X) and we're adding an extra term to the sum(H.K(i)), H(M).K(i,M). Our partial differentiation formally treats H(M) as being independent of (:H:I), so this won't change Val(X) for X in I; but shall imply a value for Val(M) which needs to agree with what μ(:Val:I) returns. Each K(i) gets a new component, K(i, M), which can be expressed as μ(:K(i):I); we get H(M) as multiplier for this new component. So, differentiating formally with respect to H(M) while formally keeping all other H(X) constant, we expect μ(:Val:I) to be equal to:

= ∂(log(trace(exp(sum(:H.J:))))/∂H(M)
= ∂(log(sum(: exp(sum(: H.K(i) :J)) ←i :dim)))/∂H(M)
= ∂(log(sum(: exp(sum(: H.K(i) :I) + H(M).μ(: K(i) :I)) ←i :dim)))/∂H(M)
= sum(: μ(: K(i) :I).exp(sum(H.K(i))) ←i :dim)/exp(Y)
= sum(: Ob(M, i).q(i)·Z(h)·b(i) ←i :dim)/trace(Z(h))
= trace(M·Z(h))/trace(Z(h))



What's shown above is that:


Valid CSSValid XHTML 1.1 Written by Eddy.