This is a rant about something that routinely bugs me about the anglo-centrism of those who write for computer audiences. It's also a rant about how those who try to teach the benefits of XML fail to actually think about informatics – doing so would enable them to make a better case for XML. The rant is provoked by the example that seemingly every tutorial on XML feels obliged to give, which they always do badly. A relatively mild example goes like this:
<?xml version="1.0"?> <person> <name> <firstname>Paul</firstname> <lastname>McCartney</lastname> </name> <job>Singer</job> <gender>Male</gender> </person>
I've actually chosen one of the less severe transgressors here; they only
give this in passing and don't belabour their error, as if it were a good way to
do things. The error is in using firstname
and lastname
as
element types – I might quibble using name
the way they do, but the
rest is good.
So what's my beef with using firstname
and lastname
?
Well, first, there's the simple fact that they're fatuous –
that is, Paul
appears first and McCartney
appears last, so why
throw in spurious mark-up to say that ? For my purposes, that's a minor
quibble, but actually it's a serious issue with a tutorial, at least when it's
trying to get across that XML is a good thing. The reasonable reader is going
to wonder how dumb the system must be, if it needs the blindingly obvious
pointed out. Still, that's incidental: my real concern is the bad
informatics implicit in the chosen schema.
What this example tells me, about the collection of information about people
that's illustrated by one person's data above, is that those collecting it have
not thought about the data they're collecting. They've taken a structure that
works for most of the people they know and adopted it as a scheme into which to
shoe-horn all the data for all the people that it doesn't match. I anticipate
that the rest of the software using this collection will, for example, start
mails to people in the list with Dear Paul,
using firstname
, which
will lead to stupid situations when you want to write to someone who is normally
addressed by their middle name: the data will then have to tag the middle
name as firstname
to make the mail be sensibly recorded. The
mark-up will then make it obvious that we have a misnomer on our hands:
that firstname
should really have been tagged as a personal
name
. Likewise, when you want to find all the members of clan McCartney in
your collection, you'll filter on that last-name; what you're really doing is
using it as family name. The case becomes even more obvious with names
from China, among other places where personal name appears after their
family name.
Now you might object, to my complaint, that I'm just grumbling about
nomenclature: I'll come back to why there's more to it than that, but let me
just, first, point out that nomenclature matters. If you name a data-field
wrong, people are going to end up putting the wrong datum in it, because they'll
supply the thing the name appears to say it wants. If you give something, that
has a generic name, a name that's specific to one culture, you run the risk of
offending folk of another culture, to whom the generic name would be appropriate
but your chosen name is misguided. You will make stupid mistakes like failing
to notice that two people with the same first name
are part of the same
family, or mistakenly thinking that two people with the same personal
name are from the same family.
Now, in fact, naming is really rather complex: sufficiently so that I'd even go so far as to suggest that those writing XML tutorials steer clear of it unless they're really willing to do their homework. On the other hand, if they do bother to do that homework, and XML is as good as they say, then they should be able to make a really good illustration out of naming (and getting their hands dirty with some real informatics might do them no harm; if nothing else, it'd earn them some respect from any readers who've thought about the problem). It has the potential to make a good illustration for the simple reason that it is really quite complex; which means it can illustrate the power and flexibility of XML. I'll have a stab at doing this, below, but I should mention up front that I'm not an XML expert, so I probably won't do it as well as it could be done.
At the very least, I'd like to see those writing tutorials
use personalname
and familyname
instead of firstname
and lastname
. It would at least set a good example to their students;
the labels are then semantic rather than merely describing a layout
fact that happens, in the examples the author bothers to use, to coincide with
the semantic property they should really be tagging.
Human culture is diverse and complex; while names are terribly important to
people – some can easilly get highly offended if you get their name wrong
or if you use the wrong parts of it for particular social
interactions. Different cultures even have different semantics to names, quite
apart from word order. The norms governing relevant customs, furthermore,
change over time – my grandfather, for example, would have expected
letters from his peers to address him as Welbourne
while I, with exactly
the same name, expect to be addressed as Edward
. The software to deploy
a datastore full of information about people needs, as a result, to encode a set
of rules that will conform to social norms; the datastore, meanwhile, must
record the information that software shall need.
Let's start with a sample of some names of real people, that don't so neatly
fit the simplistic firstname lastname
pattern assumed by so many
tutorials:
He might also have some middle
names,
but his friends call him Aubrey and he inherited the whole of de Grey
from his parents: family name may have more than one word (without being
hyphenated) and fragments of a name aren't always capitalised.
Another two-part family name; and, while we're at it, notice that a personal name that some cultures only give to girls may, in another, be given to boys. (They're given to children, who then grow up into relevant adults, of course.)
Another two-parter of a family name, this time with both halves capitalised, still with no hyphenation.
I'm not sure, but I doubt any official documents call him Tarquin – but that's how he prefers to be addressed, just as I prefer to be addressed as Eddy.
Always addressed as Louise, even though
it's not her first name
.
She inherited Ho from her parents and the right way to address her is as Su Lian; a personal name may be more than one word – and the order of parts depends on culture.
Like many (but by no means all) Norwegians
with two personal names, he's addressed using both, as Jan Vidar
.
He's Icelandic, so his family name
is actually a patronymic: it's not a patrilineal family name (i.e. the same as
his father's, and his before him), it literally says he's the son of Karl; if he
has a sister, she'll be Karlsdottir.
… and that's not even going near to Russian complications or the
Arab practice of indicating the names of sons (if a name ends abu Rashid
it means the person thus named has a son named Rashid). Notice that none of
these is unusual within relevant cultures; if we were to delve into the
complications that arise from idiosyncratic naming, it would get more complex
yet.
There are also plenty of women who, on marriage, retain their maiden name (family name at birth) as a last-but-one name and append their husband's family name; many others discard their maiden name, replacing it with their husband's family name; and some women don't change their names on marriage. There are also some men who, on marriage, discard their family name and adopt their wife's. Names that derive from family names are worth identifying as such in the mark-up. Sometimes an ancestral maiden name will be used as a penultimate name for a child; whether this counts as a personal name or a family name is a further complication.
So we have personal names, one or some of which may be used as a familiar name, and family names, some of which may be transient. We have various forms of address – formal and informal; social, professional (among peers) and commercial (e.g. when your bank or doctor addresses you); spoken and written; and, as to written, one may care about the distinction between proper forms for use on the outside of an envelope, in the salutation that begins the letter and when referring to the person in writing others shall read – and need to be able to indicate the fragments of name relevant to each. To do the job entirely fully, one would, I suspect, have to duplicate at least some fragments – or, at the very least, have the schema support doing so.
Name fragments may be derived from father's personal name or may be the same
as that of father; we may as well cater to the possibility of the same for
mother, although I don't know of a culture that does so (but check
Russian). The correct terms for those are patronymic
(from father's
personal name), patrilineal
(same as father), matronymic
and matrilineal
.
There are plenty of other complications. When folk go by a personal name
other than the first, some simply omit the first; others include its initial; I
suspect some vary, depending on context. At least one author used his first and
last names when writing one genre of fiction, but included his middle name as an
initial when writing another genre. Folk have titles or qualifications that are
coventionally included when naming them; the form this takes may vary with
context, either as a prefix or a suffix (prefix Dr. or suffix PhD., for
example). Names that a family has reused down several generations may be
qualified to indicate which of the people with a given name is indicated, as for
example the USAish convention of Senior
, Junior
and subsequent
numbering with Latin numerals.
Of course, for any given application, only some of the details shall be relevant; it's important to identify which those are and not waste too much time and effort on capturing (much) more information than you actually need; but what you do chose to capture, you need to annocate correctly, so that your software can use it correctly.
So, clearly, we can have mark-up delineating personal and family parts of names; and we can delineate preferred name directly when it's present as part of the name, but we need some form of alternatives structure for dealing with the case of preferred forms of address. I'll use a mixture of elements and classes. Our original quoted example can become:
<?xml version="1.0"?> <person> <name> <personal>Paul</personal> <patrilineal>McCartney</patrilineal> </name> <job>Singer</job> <gender>Male</gender> </person>
with nothing more than some trivial renaming, as long as we introduce some simple rules like allowing that, absent any indication of preferred from of address, normative rules are to be applied. Now let's try the examples I listed above, plus a few more for illustrative purposes:
<?xml version="1.0"?> <name> <personal>Aubrey</personal> <patrilineal>de Grey</patrilineal> </name> <name> <personal>Anne</personal> <patrilineal>van Kesteren</patrilineal> </name> <name> <personal>Tollef</personal> <patrilineal>Fog Heen</patrilineal> </name> <name> <personal> <alternative class="official">Mark</alternative> <alternative class="preferred nick">Tarquin</alternative> </personal> <patrilineal>Wilton-Jones</patrilineal> </name> <name> <personal>Margaret</personal> <personal class="preferred">Louise</personal> <patrilineal>Scot</patrilineal> </name> <name> <patrilineal>Ho</patrilineal> <personal>Su Lian</personal> </name> <name> <personal>Jan Vidar</personal> <patrilineal>Krey</patrilineal> </name> <name> <personal>Haraldur</personal> <patronymic>Karlsson</patronymic> </name> <name> <alternative class="official"> <personal>Edward</personal> <patrilineal>Welbourne</patrilineal> </alternative> <alternative class="preferred nick">Eddy</alternative> </name>
Note that I've presumed that Tarquin is OK with being addressed as Tarquin Wilton-Jones in contrast to my preference, which is for Eddy to be used only if you aren't using my surname. I'm also not sure Chinese names are inherited patrilineally, I just guessed.
Written by Eddy.