The Case of the Tolerant Case-Folding Web Server

By Robin Cover. Draft opinion 2005-05-23. See disclaimers and request for comments.


Introduction

"Web server, ignore character case in URIs when evaluating requests for resources"

Someone has suggested that we should accommodate the fallible memory and poor spelling habits of users who try to fetch online resources using any (case) spelling whatsoever. "Mixed case in URIs and case-sensitive matching rules are a royal pain for users, so we should configure the web servers to honor requests for all variations of upper- and lower-case, regardless of canonical spelling of the resource's mixed-case URI. If there's a case-insensitive match, that's close enough: just forget about spelling and ship the resource, without further ado." Hmmmmm...

Conceptually, instructing a server to use case-insensitive string matching on URIs is like implementing myriad URI aliases for each resource. Some use cases for URI aliases are quite valid. For example, a resource that's versioned but needs to be available at a predictable location may be referenced both by a version-specific URI and by a latest-version URI, where the latter might redirect to the former. Some people think it's both useful and harmless to hand over a resource requested as example.com/faq when the canonical URI is example.com/FAQ. However, occasional URI aliasing is vastly different than programming server support for an arbitrarily large number of case-insensitive spellings, matching all permutations of upper-case and lower-case spellings: is that kind of URI aliasing a good thing? Is it benign?

A distinct set of concerns arises in connection with machine-readable formal specification components identified by URIs, where these (dereferenced) resources are fetched and processed from deep within XML applications: XML schema files, WSDL files, XML catalogs, etc. What about RDDL documents advertised by their owners as Namespace Documents, which live at the end of a dereferenced official namespace URI? In such cases, where URIs clearly function as authoritative, canonical names as well as locators, encouraging promiscuous spellings and supporting silent server resolution of mis-spelled URIs may result in name corruption — unintentional or intentional. Corrupt names supported by a server then proliferate as published name corruptions in a process very toxic to data integrity and application interoperability.


Why Case-Insensitive Matching on URIs is Not Such a Great Idea After All

Computers are Particular: Why Pretend Otherwise? Server support for agent requests that ignore case-sensitive spelling encourages an attitude that case does not matter, or it shouldn't matter. Encouraging people to be sloppy about spelling when they are interacting with computers is arguably a bad idea, because a great many computing operations are (in fact) case-sensitive. Examples:

Web Server Common Practice Although we see a growing tendency to use all lower-case characters in URLs and to instruct URL rewriting engines to perform case-folded (case-insensitive) matching on paths, many web servers are configured to treat URIs case-sensitively. Such server configurations respect, rather than disrespect, the URI owner's decision to use mixed-case in the spelling for resource identifiers. Here are some examples, for which you can try to guess the correct spelling, if you want, but these links fail [as of 2005-05-23] if you request a resource using a URI that disregards the case-sensitive spelling assigned the resource by the URI owner:

Web Architecture Good Practices The Architecture of the World Wide Web, Volume One explains why "Avoiding URI aliases" and "Consistent URI usage" are both good practices. Web servers that capitulate to users' demands for an arbitrarily large number of (case-insensitive) aliases and inconsistent usage ("because case should not matter") arguably are not encouraging good practices. The rationale is presented in the Architecture document section 2.3.1. URI aliases. While the guidelines are articulated in terms of behavior by document designers/authors (URI owners) and agents as URI consumers, they are applicable to server behavior as well: "URI aliases are harmful when they divide the Web of related resources... The problem with aliases is that if half of the neighborhood points to one URI for a given resource, and the other half points to a second, different URI for that same resource, the neighborhood is divided. Not only is the aliased resource undervalued because of this split, the entire neighborhood of resources loses value because of the missing second-order relationships that should have existed among the referring resources by virtue of their references to the aliased resource..." [credits to Norm Walsh for citing the relevance of this passage]

Case Folding in Evaluation of IRIs What happens when the URIs are IRIs? Hmmmm... I don't know (not completely understanding the significance of the Section 5.3.2.2 Note), but I would not count on the average web site administrator getting this right unless there are already publicly available resources (e.g., POSIX/Perl regex routines for IRI/URI rewriting engines). It's apparently tricky, as Section 5.1 declares: "Because IRIs exist to identify resources, presumably they should be considered equivalent when they identify the same resource. However, this definition of equivalence is not of much practical use, as there is no way for an implementation to compare two resources unless it has full knowledge or control of them... Even though it is possible to determine that two IRIs are equivalent, IRI comparison is not sufficient to determine whether two IRIs identify different resources..." Elliotte Harold wrote: "Going beyond ASCII and English, case insensitivity is very tricky. For instance the lower case of I is not the same in Turkey as it is in the United States. Ditto that the upper case of i is not the same in Turkey as it is in the United States. The upper case of é is different in Quebec and France. IRIs all get encoded as ASCII URIs; but would such URIs be recognized and would percent encoded letters be upper cased or lower cased? Both percent encoded ASCII and percent encoded non-ASCII?"

Contaminating Effects The following scenario illustrates how server support for case-insensitive matching on URI references can lead to loss of interoperability, not to mention user confusion. Suppose your technical committee creates an XML specification which includes an XML schema, living canonically at http://www.example.com/QVML/2005/01/Proto/qv.xsd, with a declared namespace URI http://www.example.com/QVML/2005/01/Proto. But the host standards body for your TC has "helpfully" implemented server case-folding heuristics. Now, an influential book or web site incorrectly publicizes that the XML Schema lives at http://www.example.com/QVML/2005/01/proto/qv.xsd, and notes that a RDDL Namespace Document lives at http://www.example.com/QVML/2005/01/proto. Bogus versions of the XML schema emerge containing an incorrect namespace declaration: developers conclude that the namespace URI is http://www.example.com/QVML/2005/01/proto — because that's what these RDDL documents do. The error propagates silently but swiftly: since the web servers transparently resolve HTTP requests based upon this incorrect information, disinformation persists and spreads; you don't notice initially. Now: what breaks? XML catalogs, maybe? Which XML applications fail to interoperate? Document instances with corrupted namespace declarations proliferate. Which sets of applications interoperate with respect to processing malformed data instances, but are non-compliant with the TC's specification?

A case similar to that given above involves the spread of an error in a filename spelling, rather than in the upper parts of the path hierarchy. If a server silently resolves the URI given as http://example.com/Schemas/PLML.xsd, the user who fetched the schema under this URI will be invited by a web browser to "SaveAs" PLML.xsd. The user then creates XML instances which use this local XML schema filename, and they nominally validate. A different person using the draft example files discovers that the sample instances fail in an application, and thinks an error has crept into the namespace in the schema file, which does not match the schema filename (dang it!) or instance spellings — and so "corrects" it case-wise to 'PLML'. Only problem: the real namespace is lower-case 'plml'. This kind of incorrect correction is attested through scribal history in manuscript transmission: a scribe "corrects" an apparent error to an incorrect (corrupted) but plausibly correct exemplar. In this sample instance involving PLML/plml, the web server configured to transparently deliver the schema requested under an incorrect URI (case spelling) seeded the chain of corruptions.

Surrendering Control Over Your Name Most people will take offense at reckless misspelling of their personal name, and will not tolerate confusion that would come from allowing an arbitrary number of variant spellings of their name in public documents. Why would you want to surrender control over a URI you own? Server support for case folding allows users (worldwide) to create and publish arbitrary variant spellings for (canonically) case-sensitive URIs — with impunity. Even deliberately, with malice. The URI owner, who cannot prohibit the publication of unauthorized and possibly undesirable variant spellings, then loses control over his/her ability to effect stability in the naming orthography. URI stability may be critical for a variety of reasons — some unanticipated.

Identity of an Identifier From a philosophical perspective, the power of naming derives from the ability to discriminate in a manner sufficient to allow unique identification (identity), whether of a class or an instance in a class. According to this model, identifiers express identity not only for the (abstract/concrete) object signified, but recursively, within themselves, through unique naming: identity of the identifier. To forfeit the right to identity in the expression of an identifier (colloquial: "case does not matter") is to forfeit a core principle. One does not stand up and shout "What...??!" in a baseball stadium when a random idiot screams out "Hey there, buttface!" URI aliasing needs to consider the consequences of surrendering the identity of the identifier by saying "OK, yeah, I'm not buttface, but I think I know what you're asking for, so here, happy to oblige... go ahead and tell the world I answer to "buttface" as one of many vulgar names, and hell, I don't even know if I have a real name or not, probably not..."

Conclusion: Millions of currently maintained resources use mixed case in the path and query portions of URIs, and in fragment components. We could argue that use of mixed case in URIs is bad practice, but many projects have made this choice and defended it on the basis of concern for usability and semantic clarity. In the end, whether we think mixed case is good or bad is a moot point; it's there in URIs. What should servers do?

This memo argues that servers should not be configured to use case-insensitive string matching on a URI request and then (if successful in finding an approximate match) transparently deliver the resource to the agent. If there is no exact case-sensitive match, one reasonable server response other than returning HTTP status code 404 might be to implement HTTP 1.1 code "300 Multiple Choices" in such a way as to prompt the user/agent with suggestions about "near matches", requiring (?) however that the agent not automatically GET (one of) the possible candidate URI(s). As mentioned in a note, W3C servers sometimes behave in this fashion when a case-munged URI is sent in a GET request. Requiring that a human intervene to initiate a successive fetch of the resource represents one minimal protection against the silent proliferation of erroneous URIs.


Responses from Readers




Why Server Case Folding is a Great Idea

This section is file.html#Foo; not the same as section file.html#foo.

Warning: No need to waste time reading this section; the important arguments were made in the first section. Proceed at your own risk...

<sarcasm setTo="yes" /> "You know, spelling rules are a big pain in the butt, especially when it comes to remembering what to type in a browser address box for a URL. We need to change all web server behaviors so that, at a minumim, it never matters whether you type a capital or a lower-case letter. Wouldn't that be a lot simpler?

The requirements for perfect spelling are so fascist: why should it matter? Probably we should get rid of capital letters, or better yet, revert to computing practices of the paper-tape era, when EVERYTHING, INCLUDING SMALL FUNCTION WORDS LIKE "AND" AND "OR" WERE REPRESENTED ONLY BY UPPER CASE LETTERS, ALONG WITH PROPER NOUNS. JUST THINK HOW MUCH EASIER IT WOULD BE IF WE DIDN'T HAVE TO PAY ATTENTION TO DIFFERENCES BETWEEN UPPER CASE AND LOWER CASE LETTERS. THIS WOULD PROBABLY IRRITATE THE GERMANS, WHO TEND TO CAPITALIZE NOUNS (BECAUSE GERMAN USES ORTHOGRAPHY IN WRITTEN LANGUAGE FOR WORD DIFFERENTIATION), BUT MAYBE THAT'S A GOOD THING, JUST TO GET EVEN. ;-)

Come to think of it, getting even with the Wiki (that's WIKI) developers would be a good idea. Who can abide all this ugly CamelCaseWriting anyhow? We need to train people to believe that correct spelling in URIs does not matter, so that we can punish the WIKI-people who think it does matter. Here's how: We note that the canonical URI for the Atom syntax web site WIKI is: http://www.intertwingly.net/wiki/pie/FrontPage. We train users to believe that case-sensitivity in URIs is stupid, and to expect that enlightened web sites indeed will allow any case whatsoever. These users will then be infuriated at resources like the Atom WIKI! When they type in http://www.intertwingly.net/wiki/pie/frontpage, they will have one kind of bad experience: "Forbidden to you, you don't have permission to access /wiki/pie/frontpage on this server, according to Apache/2.0.46 (Red Hat) Server at intertwingly.net on Port 80; go away nasty person, Atom WIKI hates you!" Go away fool: "go and boil your bottom, son of a silly person. Your mother was a hamster and your father smelt of elderberries!" Not much better luck when they try http://www.intertwingly.net/wiki/pie/frontPage; just a different kind of bad experience. So: by this means, we will stamp out all stupid web sites that insist on fascist, throw-back exact spelling rules: enlightened users will just not put up with them.

These people who want to insist on correct spelling in URIs are the same bunch of anal types who think it's way, so wrong to make gratuitous use of the apostrophe to form plurals of English words. Why should it matter if we say two day's ago or two days ago, OR Three organization's are participating or Three organizations are participating? Everybody knows what you mean, so who cares? DTD's or DTDs; schema's or schemas; Nut's for sale! or Nuts for sale! — who cares about spelling perfection? These spelling freak's who rite about correct plural's and propur formashun's of currekt akronim's jest dont git it.


Notes


Disclaimers

This document is not an official part of the Cover Pages web site, and may not represent the interests of anyone other than the author. It is an experimental opinion piece, for which feedback is requested. Please send email with your critique, corrections, suggestions for improvement, and use cases for/against the practice of instructing servers to ignore case. Being completely neutral about the matter, I am especially interested in use cases illustrating the deleterious effects of case-insensitive matching on URIs.

Colophon

The canonical URI for this document is http://xml.coverpages.org/caseIgnorance.html, featuring one obligatory upper-case I. Content is brought to you by a Netscape-Enterprise/4.1 server configured to respect case in URIs. No URI aliases are provided, though you could create an arbitrary number of them using redirect hacks like those provided by tinyurl.com.










Empty Space