Adding Embedded Language Identifiers to Plain Unicode Text Daniel Wood, Mark W. Davis and Mark Leisher dwood@crl.nmsu.edu madavis@crl.nmsu.edu mleisher@crl.nmsu.edu Computing Research Lab New Mexico State University 22 September, 1995 0. Problem Statement The general sophistication of on-line text consumers and providers has increased dramatically over the last few years. Much more information "about" the text is being provided to make the processing and presentation of that text clearer. There are many approaches to supplying that information "about" text, many of which consist of "markup" applied to that text. Differences between these many different "markup" techniques make interchange of text with information "about" text more difficult if the consumer of that text has no support for the "markup" technique used. Thus, there is still a need for what is commonly referred to as "plain text" or "simple text" (text with no information "about" that text) to support a minimal interchange capability. Unfortunately, plain text is by itself insufficient. It lacks one piece of information "about" text that is strictly necessary for correct usage of text in a computing environment: language information. Opinions expressed via the Unicode Consortium mailing list (unicode@unicode.org) seem to indicate that the majority viewpoint is that language information most properly lies in the "markup" domain. Until the day a single "markup" technique is identified as being used by some significant majority of on-line text consumers, an interim solution is needed that incorporates support for language information into plain text. While language information is not properly associated with a character set standard (the Unicode Consortium's primary focus), the Unicode Consortium is in a unique position that allows it to recommend behavior or interpretation with regard to the Unicode Standard character set. The ability to recommend behavior or interpretation is a motivating factor for the tendering of this proposal to the Unicode Consortium as opposed to other standardization bodies. This proposal attempts to provide a relatively simple approach to incorporating support for language information into plain text encoded using the Unicode Standard character set. In the remainder of this proposal, the phrase "language identifier" will be used to indicate something that provides language information. 1. Solution Considerations No matter the proposed solutions to incorporating support for language information in plain Unicode text, the following considerations remain the same: A. The approach should have little or no impact on the current interpretation of a text stream to avoid incurring expense (economic or engineering) to those implementations that depend on the current interpretation. B. The approach should allow the interpretation of language information to be optional. C. The approach should allow sufficient space to represent some significant subset of the currently identified human languages, space to incorporate new languages as they are identified and space to allow private identifiers for special purpose work. D. The approach should not, in the general case, cause an unreasonable increase in the storage size of the text. E. The approach should allow easy mapping to and from language identifiers of a higher-level protocol. F. The approach should be technically feasible and easy to incorporate into existing implementations. 2. Structure and Interpretation This approach basically consists of two parts: A. Allocation of sixteen codepoints from the Private Use area of the Unicode Standard character set and identification of their properties (in the Unicode Character Properties sense). B. A technique for constructing a language identifier when some contiguous subset of these sixteen codepoints is encountered in a Unicode text stream. A. Allocation of codepoints We propose that sixteen codepoints in the Private Use Area be allocated for this approach. Having the codepoints in this area will lessen the impact on existing implementations which are free to ignore anything in this region. The codepoint positions for these sixteen codepoints were chosen strictly for demonstration purposes. The allocation consists of a block of sixteen codepoints with proposed naming conventions in the spirit of the Unicode Standard publications: Block Naming Convention: START END BLOCK NAME ----- --- ---------- E100 E10F LANGUAGE ID BITS Codepoint Naming Convention: CODE NAME ---- ---- E100 LANGUAGE ID BIT ZERO E101 LANGUAGE ID BIT ONE E102 LANGUAGE ID BIT TWO E103 LANGUAGE ID BIT THREE E104 LANGUAGE ID BIT FOUR E105 LANGUAGE ID BIT FIVE E106 LANGUAGE ID BIT SIX E107 LANGUAGE ID BIT SEVEN E108 LANGUAGE ID BIT EIGHT E109 LANGUAGE ID BIT NINE E10A LANGUAGE ID BIT TEN E10B LANGUAGE ID BIT ELEVEN E10C LANGUAGE ID BIT TWELVE E10D LANGUAGE ID BIT THIRTEEN E10E LANGUAGE ID BIT FOURTEEN E10F LANGUAGE ID BIT FIFTEEN Additional Codepoint Information: We propose that these codepoints have a general category of "Control or Format Character" and a bidirectional category of "Other Neutral". This could allow some (possibly useful) preservation of their positioning relative to the text stream following the application of a bidirectional reordering algorithm. Each of these codepoints represents a particular bit in an atomic numeric type that can hold 16 or more bits. The assumption is that the computer language used to implement this has numeric type(s) which can be identified as having at least 16 bits. B. Construction of a language identifier The construction of a language identifier occurs during the interpretation of the text stream. It does not matter if the text stream is being interpreted in a forward direction (beginning to end) or in a backward direction (end to beginning), the construction principle remains the same. While interpreting the text stream, for each contiguous group of codepoints that lie within the "LANGUAGE ID BITS" block, set the bit that codepoint represents in a variable of an appropriate numeric type. Demonstration code fragment in C: unsigned short c, mask; mask = 0; while (c >= 0xe100 && c <= 0xe10f) { mask |= (1 << (c - 0xe100)); c = next_character(); } /* * If the mask is 0, then no language id bit codepoints * were encountered. */ if (mask != 0) change_to_language(mask); If the result of this construction is a number between 1 and 65535, then it would act as the language identifier. 3. Introduction and Removal of Language Identifiers from the Text Stream It is easy to visualize a scenario in which introduction and removal of language identifiers occurs. Consider some text processing system that uses a higher-level protocol to represent language identifiers. When this system imports plain Unicode text containing language identifiers encoded using this approach, it can easily construct the language identifier from the text stream and map it directly to its analog in the higher-level protocol. This conceptually "removes" the language identifiers from the text stream. When this system exports plain Unicode text, it can directly map the language identifier from the higher-level protocol to an integer, and sixteen iterations over that integer to determine which bits are set are all that is needed to "introduce" a language identifier into the text stream. If the text processing system does not have language identifiers in some higher-level protocol, or has no higher-level protocol, it can leave the identifiers in the text stream. If the codepoints in this approach are considered control or format codes, then they have the same implementation and interpretation ramifications as those control or format codes that already exist in the Unicode Standard character set. In short, a conforming implementation already has experience with control or format codes. 4. Possible Problems This approach is susceptible to corruption of the text stream as many multi-code approaches are. Reconstructing a correct language identifier would be very difficult. Programming languages that do not support or have little support for bit-level manipulations would require an arithmetic technique to construct a language identifier from interpretation of the text stream and to introduce a language identifier into a text stream. 5. Consideration for the Unicode Character Equivalence Algorithm In this approach, if introduction of the language identifiers into the text stream is done in a consistent manner, no reordering would be necessary prior to application of the Unicode Character Equivalence algorithm. Consequently, we propose the following rule for introduction and interpretation of these codepoints: When looking through the text stream from the beginning to the end, any contiguous group of codepoints occuring in the text stream that lie within the "LANGUAGE ID BITS" block will *always* appear in numerically increasing order. This implies that the process of introducing a language identifier into the text stream will be something like this C code fragment: unsigned short langid; for (i = 0; i < 16; i++) if ((langid >> i) & 1) insert_ucs2_langid_bit(output_stream, 0xe100 + i); 6. Technical Merits A. Only sixteen codepoints are used. B. Conventional semantics of the codepoint 0 as used in handling of text on computers is preserved. C. State information is unnecessary. D. Construction of a language identifier is consistent and unambiguous regardless of the direction in which the text stream is being interpreted. E. Introduction of language identifiers into the text stream is a matter of a short iteration over an integer in the order specified in part 5. F. The ordering of the language identifier codepoints for the Unicode Character Equivalence is implicit given a consistent order of appearance in the text stream. 7. Language Identifier Efficiency The efficiency and simplicity of the this approach from implementation and algorithmic standpoints is clear. The storage efficiency of this approach is highly dependent on the assignment of numbers to languages. Languages that are assigned numbers with a high number of non-zero bits could possibly cause a significant increase in storage requirements. This potential increased storage requirement could have significant impact on industries that need the text stream to be as compact as possible. 8. Languages and Number Assignment If the Unicode Technical Committee agrees to discuss the issue of incorporating language identifiers into plain Unicode text, we would like to recommend the list of languages from the Ethnologue (World Genetic Tree) maintained by the Summer Institute of Linguistics as an initial language reference. It has been noted that the incorporation of languages into this list has, from a standardization viewpoint, been inconsistent. We view this simply as an artifact of the SIL mission. However inconsistent from a standardization viewpoint, we feel that this list provides an excellent resource for a standardization process in one significant respect: adopting all or some subset of this list as substance for language identifiers incorporated in plain Unicode text would provide a body of experience that would allow consistent and measured proposals to supplement ISO 639. It may be the case that research efforts initiated by the group working on ISO 639 would provide these supplements in a more timely manner, but having no experience in standardization processes, we have no real sense of what are considered appropriate measures. Now, assuming the existence of a list of languages to work from, the next problem is to assign numbers to those languages in such a manner as to minimize possible storage size increases that this approach could allow. We have unfortunately been unable to allocate time to research the possibilities as of this writing. 9. Conclusion As some would have it, plain text is beginning to disappear, and if not, should disappear, but the reality is that consideration of plain text is critical to truly widespread adoption of any character set that is intended to replace the many national standards. Plain text is essentially unusable in most text processing contexts and must have some other information "about" the text. The many different higher-level protocols that supply that other information "about" the text provide the on-line text consumer with varying and sometimes incompatible capabilities as well as inherent difficulties with interchange. We feel this proposal introduces a reasonable support mechanism for the most fundamental and necessary piece of information "about" text, language information.