Recording imprecise information in `python`

I maintain a package of python software for use in scientific study. I'm not especially interested in numerical precision, so much as tracking the known errors in experimental data. I also care greatly about tracking the units of measurement associated with a quantity (and like to be able to receive and supply data in arcane systems of units; the package includes information on various of these). The currently usable version of the package has its own page and contains its own documentation; this page exists to track my thoughts on how to do the job better.

My existing code uses a lazy attribute infrastructure I developed some time ago; python 2.2 (or thereabouts) introduced a nicer solution to that problem, which I now use for new code: but older code is unlikely to catch up without a major re-write. Then again, a major re-write is in order – the lessons I've learned during development of the present package give me plenty of other things I want to change. Other features in python, added since about 2.0, are likely to also be relevant to what I can do in the re-write; I'm a bit out of date as to the available features.

Separating specification from observation

Most objects constructed by the package describe physical objects and/or their properties. Their construction generally involves providing observed data for them. However, I have data from several sources and sporadically find new data from new sources. This necessitates adjusting the constructor or using the .observe() method of some values; it also complicates the business of keeping track of where I found which data. A better solution would be to create the object, attaching to it only the information which defines what the object is; and then to have source-specific methods that supply data about the object.

At the same time, it would make sense for the fine structure constant to know that it is defined by a certain formula combining Dirac's and Millikan's quanta with some properties of the electromagnetic field in vacuum. Crucially, it would know this as a specification, rather than simply being computed by supplying values for the assorted quantities in the given formula and computing alpha. Knowing it as a specification would enable it to compute that value, if the need should arise; it might also enable an expression using it to determine a kindred specification, possibly eliminating some less precisely-know terms from the formulae combined by the expression, by which to specify the expression being computed.

This approach should also be applicable to quantities such as non-canonical units of measurement – the unit is what it is and has defined relationships with certain other units, but we should not convert it to some standard unit (e.g. SI) until we need to do so. A unit may even (e.g. the electron Volt) have an error bar on its value, when expressed in terms of some other unit (e.g. the Joule). A quantity's value may be known more precisely in electron Volts than in Joules; when it is, some uses of the quantity may indeed eliminate the electron Volt (e.g. by dividing by some other quantity measured in that unit), leading to highly precise answers if the computation is done using the electron Volt as unit. However, if everything is coerced down to some standard units (as, at present, everything is coerved to SI), the error bar on the electron charge (hence on the electron Volt as unit of energy) gets applied twice, adding an error bar due to the units to a quantity which is actually known more precisely than the unit.

Units of measurement

At present, units are defined individually, in terms of SI, and quantities display themselves by reducing to SI's base units. I want to define systems of units; provide for run-time selecton of which system of units quantities should use when displaying; and provide for the system to select a sensible unit for the quantity, based on what kind of quantity it is.

Configuration

At present, when a quantity displays itself, it does so in terms of a standard system of units. Given a framework for systems of units, it would be practical for the package to have a configuration attribute, specifying a system of units, that quantities consult when displaying themselves. It would also make sense for quantities to provide a method (via which support of the foregoing would be implemented) which takes a system and returns the correct representation for that system.

There would be some sense in having the configured system for display purposes be a bit more sophisticated than a typical system; for example, even when the UK imperial units are in use, astronomical distances should be expressed in terms of the AU, parsec and/or light year rather than stupidly big numbers of miles (or leagues). To get the answer actually in the UK system, it would be necessary to directly use a quantity's method for doing that, rather than relying on its default display semantics.

When adding quantities expressed in different units of the same kind, particularly when the conversion factor between them carries an error bar, it shall be necessary to coerce them into a common unit. To some degree it shall make sense to use whichever of their units yields an answer with the smaller error bar; however, particularly when the choice makes little difference (but equally when we're going to be adding the result to several more things, so that re-arranging the summation to group things using common units might change the choice of unit) it may be desirable to configure which units to use for such internal purposes. This, again, would call for a system of units, not necessarily the same one as is used for display, that can be configured at run-time.

Systems and families

A system shall specify a mapping from kind of quantity to family of units; the units in a family are related to one another by specific scaling factors. Some families shall be generic, e.g. the thumb, foot, stride, embrace family exists in several variants (three Scandic forms, UK/US (thumb is inch, stride is yard, embrace is fathom, from scandic favn) and French) but always with the same factors between members. Some families shall be compound, e.g. combining two other families, typically generic, and perhaps extending them using some specific units. Different uses of a family (particularly when generic) may anchor that family to others in different ways – e.g. the gallon, quart, pint family is anchord in the UK system by specifying the mass of a gallon of water under certain conditions; while the US variant defines the gallon to be 231 cubic inches – or by providing a specification which anchors a different member of the family, with the other members's values being implied by scaling from the anchored member.

Two systems may use the same family of units for some kind of quanity, but each might select a different member as standard unit (e.g. SI uses metre and kilogramme where cgs uses centimetre and gramme). A system may be specified in terms of a base set of independent units yet provide names for assorted derived units. A system must be able to work out a sensible unit to use for kinds which don't have their own particular names; it should use the provided names where these simplify the unit, rather than falling back on its base units just because the given kind has no named unit. If the Volt were not named, it'd be better to display it as Joule/Coulomb than as metre**2 * kilogramme/Ampere / second**3 (as the present implementation does).

A system would tell an object the family of units from which to select how to describe itself; the family would have to select which member to use for the quantity's size. In SI, this would usualy just amount to selecting a quantifier, although the mass family would need to quantify gram for quantities below about 100g and tonne for quantities above about 100kg; archaic units like an thumb, foot, stride, embrace family would have to select the right member of that family for describing some particular quantity. Ideally (but I'm unlikely to stretch this far) archaic units could stipulate the format used for representing the numeric part of a quantity's representation – e.g. using fractions rather than decimals. When a system generates a derived unit, it may sensibly want to incorporate quantifiers into the members in more complex ways than quantifier times base unit for the kind; e.g. the price of hard disks today would be more sensibly expressed in dollars per gigabyte rather than in nano dollars per byte.

Some systems may chose to deem dimensionless some kinds of quantity for which other systems specify units; for example, in general relativity work, it makes sense to deem velocity to be dimensionless (and the speed of light to be one); and I treat angles as dimensioned (distinguishing between units turn and radian) while SI treats them as dimensionless (treating the radian as one). This complicates some parts of selecting the right derived unit for describing a kind of quantity, since the kind is only determined modulo dimensionless kinds; thus treating the speed of light as dimensionless unit allows eV (the electron Volt) to be used as a unit of mass, even though it's formally a unit of energy. It also implies a need for coercion of data when receiving input from some contexts; if data is supplied by a source using SI, for example, it is apt to omit angular units (presuming radian) and thus supply values of the wrong kind, which the input mechanism should coerce to the right kind by exploiting quantities the source deems dimensionless.

Peculiarities

Money presents problems all its own (which'll complicate derived units involving it, of course; otherwise, I can't think of anything with the same complication): it comes in a plethora of units, but they aren't transparently iterconvertible; the conversion factors vary from day to day and aren't (quite) symmetric; and there is no primitive unit, nor do most systems of units specify a unit for money.

It'll be desirable to have a particular kind of system of units, along the line of Planck's units, which takes some quantities, deems them to be base units for their respective kinds and infers units for all other kinds from them. This implies trivial families (each with only one unit in it, name defaulting to the name of the kind of quantity) and quantities specified by a simple number times the implied unit of the relevant kind. Such a system may also deem some kinds of quantity dimensionless, as usual.

Most obviously when observed values have been supplied by several sources, a quantity may have its value expressed in several distinct units; when adding it to others it may make sense to compute the resulting quantity as one with its value expressed as several variants, expressed relative to the given units. Likewise, when combining values in other ways, each of which might know several estimates of its values, it may make sense to compute several separate estimates of the combination. This may complicate arithmetic quite significantly !

Compound systems, using units from several families, require some care in the handling of quantifiers. For example, if display semantics use foot, pound, minute, byte and Coulomb (say) we've got SI quantifiers for charge, [12, 3, 22, 10, 8, 3] for length, […, 1000, 60, 60, 24, 7, 20871/400, 1000, …] for time and the powers of 1024 for bytes. Any quantifier can only sensibly appear to the same power as its unit is appearing in the answer, and we'll be trying to make the number out front fall in the range 1 to 10 or at worst a factor of 10 outside; but that still leaves us with lots of scope for choices.

When computing several values of the same kind, it would be nice if the system were to use a common unit for them, to make comparisons easier; this would mean having numeric parts of representations use a wider range than the 0.1 to 100 range I normally aim for. However, persuading the system to do this in a sensible way is not necessarily easy. It may make more sense to have the context, that wants to display several values, select its chosen units and directly request formatting relative to it.

Several kinds of quantity may be dimensionally equivalent, yet worth having separate objects to describe them – e.g. albedo is dimensionless, but deserves an object to document what it means (and incidentally express the fact that it's dimensionless). This can be implemented as an object with doc-string, borrowing from the dimensionless kind object so as to behave the same as it, aside from the bits it elects to over-ride.

Quantifying imprecision

My approach to describing imprecision is to use a distribution to model the (probability or) likelihood distribution of a value. The present implementation uses piecewise uniform distributions (i.e. histograms) and a rather crude model for inferring such a distribution for the product, ratio, sum or difference of two values thus expressed. It may be worth improving on this. Ideally, the distribution would encode a Bayesian likelihood distribution and combining two would follow rules inferred from those semantics.

Handling of tensor quantities

I deem vectors and covectors to be particular cases of tensors, but care about distinguishing ranks of tensors even when they have equal dimension; you can't contract a vector (directly) with a vector, but you can contract it with a co-vector. There are also issues with the units of quantities; one can reasonably have a vector whose individual components are measured in distinct units (e.g. three space-like components of a 4-vector measured in metres, with a time-like component in seconds), yet it shall be more usual to want to deal with a dimensionless vector bound to units independent of component.

Dealing with error information in this case is significantly more complex: one may have a quantity whose direction is known exactly, but its scale only imprecisely; or vice versa; or one component may be known exactly, and the others imprecisely. Thus, converting between direction-and-magnitude form and cartesian form may damage our knowledge of error information.

Multiplication and division of tensors add their own complications, but these are mostly fairly well handled by orthodoxy, albeit I'll want to express that in terms of the permute-and-trace operator I use in my writings on linear algebra (methods tau and permutrace of study.maths.vector.Vector).

Description of tensor quantities shall implicate bases; any given expression of the value of the quantity shall be specified relative to some basis; several bases may be in play in parallel. As for units, some configuration (but specific to the tensor spaces involved, rather than global) shall be needed for specifying the choice of basis for display purposes; and may be needed for resolving decisions about which basis to use when evaluating the results of combining (particularly when adding) some tensor quantities.

Assorted syntactic sugar that might be nice to do, particularly in rank one (i.e. vectors and co-vectors), subject to overcoming various potential problems:

Have tensor quantities support attribute names, consequent on choice of bases, for components; e.g. v.x for the x-co-ordinate of a vector v. This could lead to conflicts if two bases share a common member, since the components parallel to that member shall differ between the bases; it is thus necessary that two bases use distinct names for any member they share.
Input of a vector whose components don't all have the same units, provided the spatial co-ordinates of the vector all have the same units and dividing these by the temporal co-ordinate's units yields either speed or its inverse; if this ratio is speed, the input quantity really is a vector, whose temporal co-ordinate is obtained from that of the input by multiplying it by the speed of light; if the ratio is a time/length, the input quantity is really a co-vector, whose temporal co-ordinate is obtained by dividing that of the input by the speed of light. However, I'm not convinced I want this (I may even allow a vector's components to have different units; e.g. polar co-ordinates with one length and two angles).
addition of a scalar with units u*time, for some u, to a 3-vector with units u*length; should have the appropriate effect of producing a 4-vector whose spatial co-ordinates are the 3-vector and temporal co-ordinate is the scalar times the speed of light; likewise, for covectors, u/time + [,,]*u/length but dividing the scalar by the speed of light.

Miscellaneous

Given that numbers handled here are not the machine-implemented ones, it'll also make sense to use my study.value.bigfloat infrastructure more widely, to avoid overflow and underflow.

Where two numbers are exactly rationally commensurate, it would be nice to keep track of this. In particular, for (anti-) symmetrisation operators, a sum of permutation operators divided by an integer would benefit from remembering the integer rather than a floating-point approximation to its inverse, e.g. to avoid idiocies such as


>>> femto
1.0000000000000001e-15

In similar vein, it may be worth remembering that a value is really a solution of some polynomial equation (e.g. it's sqrt(5)) rather than only remembering a floating-point approximation to its value. One could even do similar for transcendental numbers, 'though it would be harder. On the other hand, the extent to which this is useful may be limited !

The correct semantics for comparison, with imprecise values, is an area requiring some thought. On the one hand, two values with equal best estimate and identical distributions aren't necessarily equal; but two computations of a value may well be construed as agreeing if their distributions overlap to any significant degree.

When comparing angles, it may make sense in some contexts to deem angles equal if they differ by a whole multiple of the turn. More generally, some computations (e.g. .arcCos on a scalar between −1 and +1) may yield values with a discrete ambiguity; this is particularly common where the result is an angle (e.g. dividing an angle by an integer commonly produces a result ambiguous due to the original's equivalence to angles differing from it by whole numbers of turns), but also arises from square root and other polynomial solution. These ambiguities are relevant even when the values in question have zero-width error bars.

The existing system deals poorly with single-point data; no clarity as to whether it's an exact value or a best-estimate with no distribution. The fuzzy-number type needs to handle I have a best estimate but no distribution [which is arguably a bogus creature] as a separate matter from exact values (arguably handled using some other kind of number). Alternatively, I have to be more systematic about using some equivalent of study.value.quantity.tophat as an error bar.

Where it comes to describing particles, especially in decay processes, it is necessary to distinguish between a type of particle (electron, proton, etc.) and a particular particle – e.g. when describing the fragments escaping from a decay process, each fragment has a speed which contributes to its energy, over and above that native to its rest mass.

Written by Eddy.

Recording imprecise information in python