Statistical physics describes what happens when a system has (typically infinitely) many more degrees of freedom than constraints; from the basic foundation I'll lay out below, one may specialize to look at what happens when the system is made up of many sub-systems (typically thought of as particles) and how the energy of the system expresses itself (via temperature). The foundation, however, is more general and only depends on what quantum-mechanics has to say about the system's possible states and how these relate to the physically observed properties of the system.

Quantum mechanics models any system in terms of a Hilbert space, S,
equipped with a non-singular Hermitian metric, s. Every solution to the
underlying dynamical equations of the system is represented by a unit vector,
ν in S with s(ν, ν) = 1. Each physically observable property of the
system is encoded as a linear map (S:X|S) for which s&on;X is also Hermitian;
such a linear map is termed an **observable** of the system. The
expected value of the outcome of an experiment to measure X, when the state is
initially in state ν in S with s(ν, ν) = 1, is s(ν,
X(ν)). Such an observation forces the system into a state which actually
has the observed value for X; if the system was not previously in such a
state, this changes the state of the system and, potentially, the values of
any previously-measured observables, thereby making those prior measurements
out of date. This imposes a compatibility constraint on what we *can*
experimentally observe. Our first task shall be to find which ν in S
represent states of the system in which some compatible actual physically
measured quantities take the values actually found by experiment.

When there are many such solutions,
Gell-Mann's totalitarian
principle (that which is not forbidden is compulsory) implies that the
system shall be in a superposition of *all* of the possible
solutions. We must then ask: which superpositions are actually selected ?
To even pose that question properly requires the theory of distributions; to
answer it we shall need the notion of information content; but, first, let us
look a little more closely at our physically observed data.

The foregoing account of our state space, S, its metric, s, and their observables was deliberately terse (yet still rather longer than I would have it). Let me now expand somewhat upon it. A Hilbert space is a vector space over the complex numbers with certain topological properties that ensure that, even if the space is infinite-dimensional, it still follows some rules that would apply naturally were it finite-dimensional.

A Hermitian product on a Hilbert space S is an antilinear mapping
(dual(S):x|S) with the nearest thing to symmetry this lets it have. that it
is antilinear

means that, for any u in S and complex scalar k, x(u.k) =
x(u).*k; multiplying x's input by a complex scalar has the same effect as
multiplying its output by the complex conjugate of this scalar. As for a
linear map, an antilinear map also respects addition, so x(u+v) = x(u)+x(v)
for all u and v in S. That x is a mapping (dual(S): |S) says that it accepts
any u in S as an input and its output is always in dual(S), the vector space
of linear maps from S to the complex numbers. Thus, for u and v in S, x(u) is
a linear map from S to complex numbers, so we can give it v as input and get
x(u, v), which is a complex number. Scaling u by some complex number k scales
x(u, v) by *k, the complex conjugate of k; scaling v by such a k scales x(u,
v) simply by k. Consequently, for any u, v with x(u, v) non-zero, there are
some complex k for which x(k.u, v) and x(v, k.u) are distinct; even if they
are equal for some k, varying k's phase shall vary their phases in opposite
ways. Thus the nearest thing to symmetry available to a non-zero antilinear
(dual(S):x|S) is to have x(u, v) equal to the complex conjugate of x(v, u),
for all u, v in S.

So a Hermitian product combines two members of S – one antilinearly, the other linearly – to produce a complex number as output; and swapping the order of its inputs (hence which is taken linearly and which antilinearly) merely conjugates the output. In particular, for any u in S and Hermitian x on S, this implies x(u, u) is real: swapping the order of inputs leaves the expression (and hence its value) unchanged, yet also complex conjugates it; so its value must be self-conjugate, i.e. real. Our metric, s, is a Hermitian product on S; indeed, it is a positive-definite Hermitian product, which means s(u, u) is always positive unless u is zero. In particular the zero member of dual(S), which maps every member of S to 0, only appears as the output of s if the input was itself the zero member of S; any other u in S gives s(u, u) ≠ 0 whence s(u) maps at least some member of S to a non-zero output, so s(u) is not the zero member of dual(S). Thus if two inputs to s yield the same output, s(u) = s(v), we can infer s(u−v) = s(u) −s(v) = 0 whence u−v is the zero member of S, thus u = v; so positive-definite s is necessarily monic. For a finite-dimensional S this suffices to make s an isomorphism between S and dual(S); one of the effects of S being Hermitian is to ensure that any positive-definite Hermitian (dual(S):s|S) likewise has an inverse (S:|dual(S)); this is necessarily antilinear, like s, and posesses the same conjugated symmetry, hence a Hermitian product on dual(S).

Now, as long as s is invertible, we can compose its inverse after any
other Hermitian (dual(S):x|S); the resulting s\x (reading \ as division
from the left

in the same way that a/b divides a by b on the right; if b
is a number, a/b and b\a are the same thing, but the order matters when b is a
mapping) then takes any u in S, consumes it antilinearly to give x(u) in
dual(S), then feeds it backwards through s, also antilinearly to get a member
of S; if we scale u by complex k, the intermediate value is *k.x(u) and the
inverse of s duly conjugates *k back to k, so (s\x)(k.u) = k.(s\x)(u) and (S:
s\x |S) is simply linear. Thus, given s, every Hermitian x on S can be turned
into a linear (S: s\x |S); call this X and observe that, in the composite
s&on;X = s·(s\x), s and its inverse cancel, so s&on;X = x. Any linear
(S:X|S) for which s&on;X is Hermitian is an observable, so s mediates an
isomorphism between Hermitian products on S and observables.

In quantum mechanics we model any system by a state vector, ν, in a
Hilbert space, S, on which we have a non-singular Hermitian metric (dual(S): s
|S) with s(ν, ν) = 1. [Hermitian means s is antilinear – for
complex scalar k and any v in S, s(k.v) = *k.s(v), where *k is the complex
conjugate of k – and as symmetric as that allows it to be – for u,
v in S, s(u,v) is equal to the complex conjugate of s(v,u).] The quantities
we *can* know are expressed as linear maps (S:X|S) for which s&on;X is
Hermitian; such a linear map is called an **observable**; the
ones we *do* know (i.e., not only are they observ*able*, but
we *have* observ*ed* them) necessarily commute with one
another. This implies that they are diagonal with respect to some basis which
unit-diagonalises s. Call one such basis (S: b |dim) and its dual (dual(S): q
|dim), with dim the dimension of our Hilbert space; then

- s = sum(: q(i)×(*&on;q(i)) ←i |dim)

in which ({scalars}: * |{scalars}) is complex conjugation; and each q(i) is in dual(S) = {linear ({scalars}: |S)}, so *&on;q(i) is antilinear ({scalars}: |S).

Thus we have a family of observables, which we can model as a set Observed of linear maps (S: |S); and we're given the value of s(ν, X(ν)) for each X in Observed, which we can encode as a mapping Val = (: s(ν, X(ν)) ←X |Observed). For any Hermitian t and vector u in S, swapping the two inputs of t(u,u) conjugates the output (because t is Hermitian) without actually changing the expression, so *(t(u,u)) = t(u,u), i.e. t(u, u) is real. For each X in Observed, s&on;X is Hermitian, whence s(X(ν), ν) = (s&on;X)(ν, ν) is real, hence equal to its conjugate *(s(X(ν), ν)) = s(ν, X(ν)) = Val(X) thanks to s being Hermitian. Thus, in fact, ({reals}: Val |Observed).

One observable always available to us is the identity, which commutes with
all observables. Any observation of the identity renders an answer of 1,
which expresses the above-mentioned truth that s(ν, ν) = 1, so we
include the identity among Observed as a way of encoding this
constraint. Since S, as a collection, is synonymous with its identity, we
thus have S in Observed, with Val(S) = 1. For convenience, take J = {X in
Observed: X is not S}, so that Observed = J&unite;{S} and S is not in J. Thus
J is the collection of real physical

observeds.

Note that J will typically be a collection of observables which contain as much information about our system as we know how to obtain: if we identified an observable compatible with but independent of the others, we would observe it. Consequently, there's a sense in which we should expect to find little information content in the degrees of freedom remaining to the solution, given the data actually observed. This is an anthropic contribution to the situation.

We have as many degrees of freedom as we have members of dim (which is typically infinite): we have as many constraints as there are members of Observed (which is typically finite). There are typically plenty more degrees of freedom than constraints.

When we have more degrees of freedom than constraints, we can expect to
find many solutions; our system is capable of being in any of some very large
collection of actual states. When we have many solutions (especially when we
have so many that Avagadro's number doesn't seem big), the study of the
situation is called Statistical Physics. Even when only few solutions are
available, however, we need to describe how the system choses

among the
available solutions.

The Hilbert space S is the span of all possible solutions of our
underlying dynamical system; any s-unit vector in it is a possible state of
that system. Our observation of the physical properties of the system
constrains the state to the (s-unit vectors in) a sub-space of S on which each
observable has, as eigenvalue, its observed value. Any suitably normalised
linear combination of states in this sub-space shall also be a state in this
sub-space; and, between such states, we expect there to be scattering
channels

by which a solution at one moment evolves, over time, to form a
superposition of solutions, in which the original is just one possible
contributor.

The appropriate tool for doing superpositions of solutions is
a distribution (or measure) on the
solution-set, E, of our equations: formally, this is a linear map ({scalars}:
:{mappings ({scalars}: :E)}), which may be though of as integrating over
E

. We can always extend a distribution, m, to also serve as a linear map
(V: :{mappings (V::E)}) for any linear space V, by taking components in V,
applying m to the scalar functions from E that result, then using the
resulting scalars as components in V to reconstitute an answer. For a given
distribution, m, on E and any mapping ({scalars}: f :E), we can form a new
distribution on E, which I'll write as f@m, defined by:

- f@m = (: m(: f(e).g(e) ←e :E) ←(: g :E) :)

On a discrete set, there is a standard distribution, sum, which
simply adds together the outputs of the function: and any function f from the
discrete set to scalars thus yields a distribution f@sum. A superposition of
solutions has the form m(S: e←e |E), the integral with respect to some
distribution m of E's natural embedding in S. The available scattering
processes will imply some kind of redistribution

over E, collectively
implying a flow dynamics for m whose character we know poorly save that it is
jam packed full of randomness. The basic idea of statistical physics is to
reason that the actual superposition of solutions we see will be as random
as it can get

while yielding a solution (i.e. a member of E) as its
integral of (S: e ←e :E).

Now, two distributions on E may produce the same member of S as their superposition: and the form our constraints take becomes complex and unhelpful. However, each member of E must be some superposition of basis vectors (because b spans S), so any superposition of solutions to the equations is equally a superposition of b: and two distributions superposing b to the same member of S must be equal (because b is linearly independent), so the ambiguity among distributions on E presents no problem. Because b diagonalises our constraints, it will render them into a simpler form than E could offer.

We thus deal with (S:b|dim) rather than (S: e ←e|E), so the corresponding tool for superpositions becomes a distribution on (|b:), which (since b is monic) amounts to a distribution on dim. We already have a standard distribution on dim, namely sum (if dim is discrete; if continuous, we need to use integral in place of sum; but this makes no difference to the form of the discussion). As noted above, any mapping ({scalars}: f :dim) yields f@sum as another distribution on dim. [When dim is a continuum, some distributions on it might not be of form f@integral; but we suppose that distributions we actually want to use will be of this form.] We can thus identify any distribution with the ({scalars}: f :dim) for which f@sum is the given distribution; indeed, doing so is exactly the process of taking coordinates in S, using b as basis, of the superposition that results from the distribution. So, now, let's describe an arbitrary solution in terms of its co-ordinates, and see what form our constraints take.

We can use our bases, b and q, of S and dual(S) to express ν in co-ordinates: define ρ = ({complex}: q(i)·ν ←i |dim) and we obtain ν = sum(S: ρ(i).b(i) ←i |dim). Since q unit-diagonalised s, we have s(b(i)) = q(i), so we can infer s(ν) = sum(: *(ρ(i)).q(i) ←i |dim). Since b and q also real-diagonalise each X in Observed, we can define

- Ob = (: ({reals}: q(i)·X(b(i)) ←i |dim) ←X |Observed), yielding
- X = sum(: Ob(X, i).b(i)×q(i) ←i |dim) for each X in Observed, so
- X(ν) = sum(: Ob(X, i).b(i).ρ(i) ←i |dim)

whence

- Val(X)
- = s(ν, X(ν)) = s(ν)·X(ν)
- = sum(: *(ρ(i)).q(i)·X(ν) ←i |dim)
- = sum(: *(ρ(i)).Ob(X,i).ρ(i) ←i |dim)
so define

- r = (: ρ(i).*(ρ(i)) ←i |dim)
- noting that its outputs are non-negative reals, and obtain
- Val(X) = sum(: r.Ob(X) :)

From S in Observed, as the identity, we get Ob(S) = (: 1←i |dim)
whence 1 = s(ν, ν) = sum(r), and we know r's outputs are non-negative,
so each output of r is at most 1 (at least when dim is discrete). This makes
r@sum a (candidate to be used as a) probability

distribution: equally
one may think of r as indicating the proportions

in which we are mixing
the base states represented by b when we build up ν (but this should not be
treated too literally, since r lacks the phase information of ρ).

For each i in dim we have b(i)×q(i) in {linear (S:|S)} and composing
it before s yields q(i)×(*&on;q(i)) which is Hermitian, so
b(i)×q(i) is an observable (in the technical sense, not necessarily a
physical one); it also commutes with each of X in Observed, since these are
diagonal. Its expected value in state ν is simply r(i), and quantum
mechanics regards it as the probability of observing a state (previously) in
state ν to be (when observed) in state b(i). In principle, at least, r is
consequently observable

.

Now, r is defined in terms of b, our basis, but none of the definition of
either cares

about any relationships among the labels (members of dim):
so we should likewise consider all members of dim independently. Our
constraints would complicate that, but
the optimisation technique I'll be
using allows us to ignore the complication until we've done the
optimisation.

We need a quantifiable notion of how random

r
is: information theory comes to
the rescue, with sum(: r.log(1/r) :), that is sum(: r(i).log(1/r(i)) ←i
:). This is the information-theoretic entropy of the distribution described by
r. The probability of getting distribution r at random

depends on r
only via its entropy, and increases with it. [Aside: log(x)/x tends to zero
as x tends to infinity (because log grows slower than k.x for any positive k),
so t.log(1/t) does likewise as t tends to zero (from above), so treat
0.log(1/0) as 0 if r takes the value 0 anywhere.] So we can use f(r) =
sum(r.log(1/r)) as the function we aim to maximise, within our
constraints.

Our constraints are (: sum(r.Ob(X)) ←X |Observed) = Val, which has
the form g(r) = given

required to use
a standard optimisation
technique. [K(i) is the value Val would have if ν were b(i).]
Introduce K = transpose(Ob) = (: (: Ob(X, i) ←X |Observed) ←i |dim),
so K(i,X) = Ob(X,i), and obtain

- g(r)
- = (: s(ν, X(ν)) ←X |Observed)
- = (: sum(: r(i).K(i,X) ←i |dim) ←X |Observed)
- = sum(: r(i).K(i) ←i |dim)
- = sum(r.K) in {mappings ({reals}: |Observed)}

We'll be looking for some r for which f'(r) and h&on;g'(r) are
parallel for some linear ({reals}: h :{mappings ({reals}: :Observed)}), which
is a distribution on Observed. Given that Observed is finite (so that we can
sum over it, for instance), we have a measure called sum on it already, and we
can express any measure on Observed as a scalar function by which
to weight

its members when summing over them. This leads us to some
({reals}: H |Observed) equivalent to h as follows:

- H = ({reals}: h({1}:|{X}) ←X :Observed), where ({1}:|{X}) maps 1←X and effectively maps any other U in Observed to 0.
- h = H@sum = (: sum(: H(X).v(X) ←X :Observed) ←v :)

Any sufficiently small change, d, in r produces a change in g of near enough g'(r)·d, which must be zero if g(r) and g(r+d) both have the correct value. At a stationary point (e.g. any maximal or minimal point), no allowed displacement changes the value of f, which requires that every d with g'(r)·d zero also has f'(r)·d likewise zero. This implies that f'(r) can be factorised as −h&on;g'(r) for some h, as above (I could equally have used f'(r)=h&on;g'(r), but we have a choice here and f'(r)+h&on;g'(r)=0 gives H positive when we work out the details).

Now, (: t.log(1/t) = −t.log(t) ←t :)' = (: −1 + log(1/t) ←t :) so differentiation with respect to r(i), for any given i in dim, turns f(r) = sum(r.log(1/r)) into −1+log(1/r(i)) and g(r) = sum(r.K) into K(i) = (: Ob(X,i) ←X :). We have h&on;g'(r) = h&on;K = (: h(K(i)) = sum(: H(X).Ob(X,i) ←X :) ←i :) = sum(H.Ob), so we obtain 1+log(r(i)) = sum(: H(X).Ob(X,i) ←X :) or

- log(r(i)) = −1 + H(S).Ob(S,i) + sum(: H(X).Ob(X,i) ←X :J).

Now, S is the identity, and each s(b(i), b(i)) is 1, so Ob(S, i) is always 1 and we have log(r(i)) = H(S)−1 + (sum(:H.Ob:J))(i), so the constraint given as S becomes 1= sum(r) = exp(H(S)−1).sum(: (exp&on;sum)(:H.Ob:J) |dim), so exp(1−H(S)) = sum(: exp&on;sum(:H.Ob:J) |dim) so we are able to express one of our constraints as a constraint on H(S), in terms of the remaining (:H:J) independent Observeds.

Pause to consider whether we got a maximum: differentiating f with respect to first r(i) then r(n), for n, i in dim, we get r(n)'s derivative of −(1+log(r(i))), which is zero unless i=n, when it is simply −1/r(i), which is negative; g's derivative is independent of r, so its second derivative (hence that of h&on;g) is zero. Thus f+h&on;g has negative definite second derivative with respect to r, implying that our stationary point was a maximum (when you move far enough away from r to get a change in f+h&on;g, it'll be negative no matter what direction you took), which is what we wanted.

Introduce

- ({mappings ({reals}: |dim)}: zed :{mappings ({reals}: :J)}) defined by log(zed(H,i)) = sum(: H(X).Ob(X,i) ←X :J) = sum(: H.K(i) :J), and
- ({reals}: Y :{distributions on J}) defined by Y(h) = log(sum(: zed(H) :dim)).

Our solution now becomes r(i) = zed(H,i) / exp(Y(h)), with both zed and Y ignoring H(S) so, in effect, H is only defined ({scalars}:|J) from here onwards; and h ignores the S = 1 component of Val. Pause, now, to consider the derivative of zed and, thereby, of Y.

For any X in J, differentiation with respect to H(X) turns log&on;zed(H) = (: sum(:H.K(i):J) ←i |dim) into (: K(i,X) ←i |dim) = Ob(X), hence zed(H) into Ob(X).exp&on;sum(:H.Ob:J). Consequently, exp(Y(h)) = sum(: zed(H) |dim) has derivative, with respect to H(X), sum(: Ob(X).exp&on;sum(:H.Ob:J) |dim). Now, from g(r) = Val,

- Val(X)
- = g(r,X)
- = sum(:
- r(i).Ob(X,i)
- = Ob(X,i).exp(sum(:H.K(i):J)) / exp(Y(h))

- = sum(: Ob(X).exp&on;sum(:H.Ob:J) |dim) / exp(Y(h))

This just divides our derivative of exp&on;Y by exp&on;Y, which is thus the simple derivative of Y with respect to H(X), so (:Val:J) is Y's derivative. Thus

- Y factorises the degree of freedom taken out by (the piece of information expressed by) S as identity via the remaining degrees of freedom in h: thus, in effect, Y expresses the information that sum(r) = 1, i.e. s(ν, ν) = 1.
- How Y varies with the remaining degrees of freedom in h suffices to express all the remaining information.

We have exp(Y(h)) = sum(: zed(H) |dim) and, for each X in J ⊂ Observed, X = sum(: Ob(X, i).b(i)×q(i) ←i |dim) so, reading J as the (linear) identity mapping on J,

- h(J) = sum(:H.J:)
- = sum(: H(X).X ←X :J)
- = sum(: H(X).sum(: Ob(X,i).b(i)×q(i) ←i |dim) ←X :J)
- = sum(: sum(: H(X).Ob(X,i) ←X :J).b(i)×q(i) ←i |dim)
- = sum(: log(zed(H,i)).b(i)×q(i) ←i |dim),

which is diagonal with respect to our basis, hence formally an
observable. In particular, we can define exp() on {linear (S:|S)}, and it
maps one diagonal mapping to another, mapping each diagonal entry to its own
exp() as a scalar: so exp(sum(:H.J:)) is just sum(: zed(H,i).b(i)×q(i)
←i |dim) = sum(: zed(H).b×q :dim) and its trace is exp(Y(h)). Let
Z = exp(sum(:H.J:)) = exp(h(J)) and observe that Z is defined independently of
our choice of basis, b (albeit we *compute* it using that basis) and is
an observable compatible with all those in Observed. We have Y =
log&on;trace&on;Z so, again, Y doesn't depend on our choice of basis (whereas
zed manifestly does – it's the co-ordinates of Z wrt b, q).

It is interesting, now, to look at the final value of the function whose maximum we've found. That's

- f(r)
- = sum(r.log(1/r))
- = sum(: r(i).(log(1/zed(H,i)) + Y(h)) ←i |dim)
- = Y(h).sum(r) −sum(: r(i).log(zed(H,i)) ←i |dim)
- = Y(h) −sum(: r(i).sum(: H(X).Ob(X, i) ←X :J) ←i |dim)
- = Y(h) −sum(: H(X).sum(: r(i).Ob(X, i) ←i |dim) ←X :J)
- = Y(h) −sum(: H(X).Val(X) ←X :J)
- = Y(h) −h(Val)

which is manifestly independent (because Y and Z are) of our choice of simultaneous eigenbasis for our observables, even though f was defined in terms of that basis: the final value depends only on the observables and their values. Furthermore, h(Val) = h·Val = h·Y'(h), so Y(h) −h(Val) is just the estimate we would make, by linear extrapolation from Y's value and derivative at h, of the value of Y(zero).

Thus far the only time I've needed many

of anything is hidden
inside how I assess the randomness

of the function
r. The derivation of the measure
of randomness, as sum(r.log(1/r)), involved making up r of a large
number

of dollops. Its validity may involve more measure theory than I
know, but the dollops of which it needs large numbers

are purely a
fiction of the construction: they make no pretence at having anything at all
to do with anything physical. Any distribution which arises as a result given
one number of dollops also arises for at least some larger number of dollops
(e.g. any multiple of the original number), indeed the distributions made
of many

dollops are taken to be dense

among the possible
distributions. Since, for these distributions at least, we know
the probability

will depend on the value of r only via sum(r.log(1/r)),
we infer that the same will apply to any probability distribution

we
manage to infer on the space of (: distributions :{mappings ({scalars}:|dim)})
will depend on position, r, in this space, only via the same sum: which may be
evaluated without reference to how many dollops were involved. So the
only large number

needed thus far was used in an abstract context to
justify using a measure of randomness which can be evaluated without reference
to the large number in question.

Thus the conclusions derived here apply to any quantum system, even if it consists only of one molecule or, indeed, a single atom. In so far as its internal dynamics permit it various states with its observed properties, we can expect it to be in an entropy-maximising superposition, of these states, of the kind described here.

Suppose we add a new observable, M, to J, but that it's entirely determined by the other observables, i.e. its expressible as a function of the observables already in J. Intuitively, this shouldn't make any difference; so let's see what the theory predicts. Let I be the set of independent X in J, excluding M, and let μ be the function which computes M from the observables in I; so K(i, M) = μ(:K(i):I) for each i in dim and can reasonably expect the matching result for its expectation, Val(M) = μ(:Val:I).

We have Val(X) = ∂Y/∂H(X) = ∂(log(sum(: exp(sum(H.K(i))) ←i :dim)))/∂H(X) and we're adding an extra term to the sum(H.K(i)), H(M).K(i,M). Our partial differentiation formally treats H(M) as being independent of (:H:I), so this won't change Val(X) for X in I; but shall imply a value for Val(M) which needs to agree with what μ(:Val:I) returns. Each K(i) gets a new component, K(i, M), which can be expressed as μ(:K(i):I); we get H(M) as multiplier for this new component. So, differentiating formally with respect to H(M) while formally keeping all other H(X) constant, we expect μ(:Val:I) to be equal to:

- Val(M)
- = ∂(log(trace(exp(sum(:H.J:))))/∂H(M)
- = ∂(log(sum(: exp(sum(: H.K(i) :J)) ←i :dim)))/∂H(M)
- = ∂(log(sum(: exp(sum(: H.K(i) :I) + H(M).μ(: K(i) :I)) ←i :dim)))/∂H(M)
- = sum(: μ(: K(i) :I).exp(sum(H.K(i))) ←i :dim)/exp(Y)
- = sum(: Ob(M, i).q(i)·Z(h)·b(i) ←i :dim)/trace(Z(h))
- = trace(M·Z(h))/trace(Z(h))

*Unfinshed*…

What's shown above is that:

- given a system, a family J of observables of that system and the values taken by those observables, ({reals}: Val :J), in some state ν of that system, we can
- define a mapping ({observables}: Z |{distributions on J}) by Z(h) = exp(h(J)) – where h(J) is the integral of the identity J = ({observables}: X←X :J) using the distribution h, as routinely extended to the linear space {observables} – and infer a mapping ({reals}: Y |{distributions on J}) given by Y(h) = log(trace(Z(h))), for which
- for any distribution, h, on J for which Val = Y'(h) and any basis b (with dual q) of eigenstates of all our observables J, the most likely state for our system to take, given Val as its values on J, is
- ν = sum(: ρ.b |dim) for some ({scalars}: ρ |dim) for which, for each i in dim,
- r(i) = ρ(i).*(ρ(i)) = q(i)·Z(h)·b(i) / trace(Z(h))

Notes:

- r is entirely determined by Z(h)/trace(Z(h)), which is an observable with trace 1.
- another way to encode the system's state is via the operator Ν = ν×s(ν) = sum(: ρ(i).b(i)×q(j).*ρ(j) ←[i,j] :), which isn't diagonal; but its diagonal entries are the co-ordinates of Z(h)/trace(Z(h)), with respect to our basis, b.
- for any observable Q, its expected value in state ν is s(ν, Q(ν)) = sum(: *(ρ(i)).q(i)·Q·b(j).ρ(j) ←[i,j] :) = trace(Ν·Q), and if Q is diagonal wrt b this is just trace(Z·Q)/trace(Z).
- the information-theoretic entropy is sum(r.log(1/r)) = Y(h) −h(Val) = Y(h) −h·Y'(h)
- If we add a new member to J which can be inferred from the existing observables in J, it shouldn't change ν, b or q; hence it shouldn't change r; hence nor should it change Z(h)/trace(Z(h)). To do: show that this actually holds true !