This is a page about XML. I have been working with SGML/XML stuff for a few years now, and have seen a number of uses for it "as a technology". Many of these were theoretically good ideas, some of them even worked, and a huge, overwhelming majority were just bad jokes. I will describe here the jokes, and what went wrong. I want to make it clear from the onset that I do like XML in general, and use it all the time, and have no problem with its continued adoption. I am merely making an argument about what I think people are doing wrong with it, and suggesting an alternative.
The thing one needs to realize about computers when looking at the current
XMLification of everything is that everything digital can be made
to fit into everything else. Witness ken and justin once using the netnews
group names to transmit a message, by registering
alt.justin.what.are.you.doing.tonight. The medium is
completely plastic. If you want to store or transmit a message, you can do
cryptography, steganography, normalized database tables, huffman codes,
elliptic curves, web pages, morse code, semaphores, java bytecode,
bit-mapped images, wavelet coefficients, s-expressions, basically anything
you can possibly dream up which codes for some bits. In all cases, if
you're coding bits and you are using a lossless system, the only
thing which matters is how convenient the encoding is. There's
nothing which makes one encoding "do it better" than any other, aside from
various external measurements of convenience such as size, speed of
encoding, speed of decoding, hand-editability, self-consistency,
commonality, etc.
I think this is something the XML hype artists of the world have not yet grasped. They see one convenience factor for XML, any factor really will do, and then decide that since it's socially acceptable to use XML (somehow!) for a project, they decide it is OK to ignore all the other factors. Unfortunately this is a really bad idea.
XML (well, SGML) was invented to be a convenient way for humans to "mark up" text, interactively, so it could be later searched, formatted, indexed, referenced, and hyperlinked by a computer. this is a good goal, because plain text is conventionally quite difficult to process by a computer. computers don't have any of the brains that we do, and can't extract "meaning" out of text. it's a hard problem. decades of research, little real progress. so for plain text, SGML was a good idea (badly implemented tho). XML fixed the bad implementation aspect, but otherwise its goal was mostly the same.
The XML standardization process, however, took place during this period of very rapid growth of the internet (still underway), and as a result it really had the web as a backdrop to all the discussion about text. And the web really hilighted something people had sort of known for a long time but not had the guts to mention, which is that proprietary storage and transmission formats really suck. You can't interoperate with someone serving up a proprietary standard unless you buy their product, which is the whole idea, but it sort of feels like blackmail when you know there are free standards kicking around. It feels like someone trying to put a tax on your communications, which is really crappy.
So against this backdrop of incompatible image formats, word processor formats, illustration formats, financial data formats, hell even incompatible email systems, the announcement of XML as this sort of general purpose description format put the wrong idea in a lot of people's heads. The idea is this: if XML is an extensible standard, and we're stuck with a very bad headache of incompatible non-standards, why not try to convert all these non-standards into extensions of XML? It almost sounds like a reasonable question, but I'm going to try to illustrate here why you might decide not to use XML for something.
On a lot of machines, "text" and "binary" files are the exact same thing, only the text files hold all their data in a form which a simple interactive editor like "vi" can manipulate. "Binary" usually just means that the data is packed in nice and tight, and nobody would even consider messing with it by hand. They'd use a library which has some functions built into it to manipulate the data in semantically self-consistent ways. For instance, many "binary" files have checksums in them. If you change a bit anywhere, you need to update the checksum. Many formats have ways of mangling them completely: if your file is block-coded with block headers describing the content and length of each block, one insert within a block will kill the file. Unless you update the block length, the file has become meaningless. If a file describes matrix-like data, the matrix dimensions must be consistent. You insert or delete an entry, and the matrix rows are skewed. The file is invalid. Many files have complex, built-in self consistency which must be respected by any tool which manipulates them.
I say all this to make a point, which is that text conventionally doesn't have much in the way of self-consistency checks. It's semantically vague from the get-go, so there's no invariants to talk about. SGML/XML add a simple set of invariants for the tagging structures: tags must balance, DTD nesting grammar must be valid (few tools check for this, even now), and tags themselves must lex properly (delimited quotes, delimited angle brackets, etc). The idea is that an XML editor or viewer can check these aspects of the markup, load the markup into a view/edit mode, and the user can work out the meaning of the rest of the text.
All this is fine, except that those are essentially all the checks on an XML file. There's no way for me to say "this XML file describes 3x5 matricies" or "this XML file describes a table of values wherein the first and second columns sum to the third" or "this XML file must contain numbers between 1 and 20000, plus phrases which are synonyms for the word 'money' in a locale which is the same as the locale number preceeding the phrase". Now sure, you can say that it's possible to make constraint systems for XML, and many people have tried, but ultimately you're going to fall short because it's a noncomputable problem. You can use "self describing" files which contain code in a turing-complete language for validating themselves, and then you need a checker to verify that nobody has tampered with the self-describing code.. it's turtles all the way down. At some point, if you want a constraint which is more complex than the rules enforced by an XML DTD or schema, you're going to need to invoke an external library which has the logic built in. In much the same way as "a generic stream of bytes" only guarantees you that the file will have a multiple of 8 bits, "an XML file" only guarantees you that the file will lex (and possibly validate, though as I say few people do so). After that, you're on your own.
I can see at this point the argument has begun sounding a little far fetched. Who in their right mind would require an XML file have all these complex checks built into it? XML is about openness and simplicity -- surely you can just make your tools flexible and able to recover from whatever minor errors they might detect! Well, it's a subtle issue, but I'd like to point to some existing DTDs, because I'm not making this stuff up:
Starting with the "mainstream" here, MathML is a markup for
describing mathematical expressions. Hidden within its body, there is
room for valid (or invalid) OpenMath expressions
as annotations. It is assumed that there will be a module which checks
or validates these expressions. Furthermore, since the semantics of each
individual element are more complex than XML can declare, the elements
are described using a higher-level language (MMLDefinitions)
built entirely separately from XML, and coded inside CDATA of an XML
file. This really has to be re-stated in order to fully appreciate the
insanity of it: you have an XML processor reading an MMLDefinition file
(XML), passing it to a higher layer of software which parses the
character data inside the XML stream and constructs
mathematical semantic checks out of that content, and then feeds back
those semantic checks into a different XML processor which
reads a MathML file itself, reporting back to the semantic checker to
verify that the XML it's reading in is actually valid under the
MMLDefinitions. What role does XML itself play in this?
None. There's no purpose in using it whatsoever. You need extensive
external libraries to parse even the definitions of MathML
semantics, much less implement those semantics. Here's an example of the semantics file:
|
Don't get me wrong, I'm thrilled that there might be a standard set
of definitions and type signatures for functors like this, and that I
might someday have desktop software which can handle them. But XML? it
hardly plays a passing role in the encoding here. Neither you nor I
could take a few hundred MMLDefinition entries, wave our
hands and magically know whether a MathML document makes sense. We're
going to need that extra layer of software
tpaML (trading partner agreement markup language) is a format for describing business transactions, in a way which is rougly analogous to old EDI systems, though much easier to parse. The document boldly claims that there's sufficient data in the specification to automatically generate the software which parses tpaML and invokes the appropriate subprocesses, but even by this admission is clearly stating the obvious: this is not a human document format. It's a communication protocol which is to be handled by external applications. Don't have the application? There's no way to tell what one of these documents does, or whether it's valid. Furthermore, even their "sufficient data" claim is questionable, given that it relies on you having an XSchema processor around. Most people do not, at this point, even use validating XML processors, and XSchema is a schema layer on top of XML validation.
Spacecraft Markup Language is a format for describing (guess what) space ships. For the "space community". Here's an exerpt from an example SML file they put online:
|
I think there can be no question, that this is indeed intended for a machine to read, write, enforce, and check. It has no meaning and indeed would probably cost someone their life if you were to modify some part of it without a special editor to check what you're doing
This is but a minute sampling. There are hundreds of DTDs put together, almost none of which can actually be read, written, modified or even understood by an unaided human. Here's a list: I would guess no more than 10% of them can realistically be hand-written or hand-modified without some likelihood of breaking some unwritten rule about what makes sense and what doesn't, even above the level of XSchema validation.
What all these DTDs have in common is that they need additional logic. They need supporting programs in the background to make sense of the XML. Indeed, many people "hail" XML as precisely that: a "common" way for programs to talk to one another, not for humans to use whatsoever! Such an argument has its appeal, but it's not correct: XML itself is not what we're talking about here; rather, what's under examination are extensions to XML, that is specific DTDs or XSchemas. There is no way on earth that a MathML browser, editor or proof checker is going to understand how to construct a space ship or negotiate the purchase of an airplane ticket in one of these other extensions, simply by virtue of being able to parse XML. They are entirely orthogonal concepts, artificially "unified" in the minds of their proponents by the fact that the underlying tree representations are still XML.
To put it differently, I will once again draw your attention to the
simple bit-level encoding question: if these DTDs were stored as packed
binary, the programs that read and write them would be no more or less
"compatible" with one another, as they encode radically different
meaning from one another, no nontrivial software tool will be
able to work on all XML DTDs simultaneously anyways. Think about it: at
best you could use an xml-grep on these files, or possibly
an entity resolver; beyond that, general purpose XML software will just
see them as meaningless trees of text and elements.
This brings me to my central issue: many uses of XML right now benefit not at all from being encoded in XML. Frequently the encodings are of tabular, non-tree-structured data, and frequently it is numerical data for which there is a stunning speed and size penalty to be paid for stringification. Furthermore the encoding in XML encourages people to believe in 2 extremely dangerous falsehoods:
Neither of these sentiments is at all correct. For each concept introduced in 94 19750 94 18688 0 0 20461 0 0:00:00 0:00:00 0:00:00 17520 100 19750 100 19750 0 0 21477 0 0:00:00 0:00:00 0:00:00 9291 to an XML specification, corresponding software must exist or else the specification is nothing more than a wish list. The illusion that XML "heals all wounds and writes all software" is common amongst the trade press and amongst management, since it is easy to write up a syntactically valid "fanciful use case" of a document which you'd like a program to be able to interpret, yet by using english tags you can imbue it with semantics which are far more complex than any software currently is capable of. One need look no further than the RDF specification to see that we are heading down the same road conventional artificial intelligence experiments have taken us down in the past, only with slightly pointier brackets. Here's an XML document from 6 months from now, when the madness has reached its peak, written in a ficticious BPSML (business process semantics markup language); my point here being that writing a fictional document (and providing a DTD) doesn't generate the code, and certainly doesn't get you to your goal:
|
Why, if only it were that easy to get rich, all we'd need is to standardize and adopt BPSML, and the future would be paved with gold!
This is not to say that I have something against the standardization process. Quite the opposite: I see the standardization process, presently, being overtaken by the process of publishing DTDs to much fanfare, with no working implementations (or worse, proprietary implementations which read "magic values" in XML attributes). Standards are not strictly about encoding -- they are also about setting a level of acceptable interoperability between real programs. Right now, it doesn't matter how much you talk about your DTD or XSchema, it is not going to get the program written and so the talk is nonsense. Worse, the talk perpetuates the belief that you need to use horribly inefficient encoding for data which has a narrow application range. Why bother? What we should be focusing our time on is not "converting the world to XML", but rather "coming up with standards that all programs can agree to support the feature set of". The byte-level encoding is quite a secondary issue. The standardization bodies and software authors of the world should not put performance and simplicity in a back seat simply because a performant, simple solution might not use XML. TCP/IP isn't XML. PNG isn't XML. SMTP and HTTP aren't XML. RFC822 isn't XML. MIME isn't XML. Equating "the standard solution of a problem" with "finding an XML encoding for the problem" is just a mistake.