Wherever information is generated, stored, retrieved or consumed, data
flows and one relevant metric is the amount of it that does so in any given
amount of time: that's a data-rate. It's worth remarking that information
theory defines an information content
associated with a body of data,
that roughly corresponds to the smallest amount of data that could express the
same information (this is consequently context-dependant); this is typically
smaller (and often much smaller) than the actual amount of data
involved. (For a half-way decent idea of how much smaller, save the data to
file and compress it with your favourite compression program, whose name
probably ends in zip
.) For the purposes of this page, I'm mainly
talking about the raw data; if I mention information
, it'll be in
this compressed
sense.
There are special SI modifiers for the scale of data – 1024 happens
to be 210 and data-wranglers love powers of two (with good reason),
so use 1024 in place of 1000 as standard quantifier. Thus 1024 bytes used to
be colloquially referred to as a kilobyte; it's now properly called a
kibibyte; likewise, 1024 kibibytes is one mibibyte (previously megabyte) and
1024 mibibytes make a gibibyte (previously megabyte); I'm not aware of
official parallel quantifiers beyond that, but we can guess they'll be tibi
(1024 gibi) for tera, pibi (1024 tibi) for peta and so on, once someone gets
round to introducing them. The old colloquial and new official nomenclatures
co-exist and are mixed rather haphazardly – indeed, even before the new
terms were devised, it was not uncommon to encounter floppy disks that
held 1.4 megabytes
, by which they meant (using a factor of 1000)
1400 kilobytes
, but the kilobytes were actually meant 1024
bytes. Confusion thus persists. Fortunately, I'm only interested in the
general scale of values, so a few factors of 1.024 aren't going to matter too
much for this page's purposes. I'll use standard SI nomenclature
(i.e. factors of the third power of ten, rather than the tenth power of two),
for consistency with the rest of my scale-of-value pages.
One further complication is between the bit, the byte and the
word. Formally, the byte is whatever bundle of bits is the smallest that the
computer system under discussion knows how to handle as a whole; but, in
practice, all computers have now standardized on the 8-bit byte, a.k.a. the
octet, used by the internet as its standard unit of transfer. The word
is the standard sized chunk of data that a computer operates on in a
single clock cycle
of its processor; as technology progresses, systems
that deal with bigger words become prevalent. The number of bits in a word is
commonly used to characterize the type of a computer; when we speak
of 16-bit
computers (whose use was phased out during the 1990s), we're
referring to the size of the word on such systems; they had two-byte
words. Those were replaced by the 32-bit generation, with four-byte words,
and these in turn are now (around 2010) being replaced by 64-bit systems, with
eight-byte words.
In principle the bit (properly abbreviated as a lower-case b) is the more primitive datum; in practice, the octet (always called byte) is the usual unit in use. I shall thus aim for consistency by using the byte (properly abbreviated as an upper-case B, but lower-case b is sometimes used in the wild) in preference to the bit. Consequently, my standard unit of data-rate is the byte/second, B/s, equal to eight bits/second, 8 b/s.
One could argue that b/s is the same thing as
Hz (the unit
of frequency); but the two have distinct use in
practice. Indeed, when modern computers list their processor speeds (in GHz),
I'm fairly sure the units of data moved around that many times per second are
actually words
– at each clock cycle
, each CPU processes
one word (or, more likely, a few words), handling all the bits involved at the
same time. Thus the 64-bit four-core 2.6 GHz processor of the computer on
which I'm typing this (in October 2011) performs operations 2.6 milliards of
times per second; each core processes (conceptually) one word each time, so
the whole system processes (some small multiple of) four words of eight bytes,
for a total of 32 bytes = 256 bits. The data-rate is thus (were my computer
fully loaded, so wasting none of its capacity) 32×2.6 GB/s or about 640
Gb/s, where the processor speed is (despite having four cores) still just 2.6
GHz. Since no physical thing is actually doing what it does at 640 GHz
– rather, 256 things are doing what they do in parallel at 2.6 GHz
– I thus decline to conflate b/s with Hz.
This is the standard unit of data-rate. I type several characters per second on my key-board, thus transmitting data to my computer at a rate of a few bytes per second. I likewise read only a few words of text per second, thereby consuming data at a dozen or two bytes per second.
The Large Hadron Collider's ATLAS data-collection apparatus outputs about a million gigabytes of data per second; that's roughly 1 PB/s.