Language Identification Issues 2002-02-04

Date:      Mon, 4 Feb 2002 11:59:41 -0500
From:      [email protected]
To:        [email protected]
Subject:   update on some recent activity

I thought I'd provide a brief update to people on this list of some recent activity regarding language identification issues. These are not in any particular order.

1. I presented a paper (co-authored by Gary Simons) at IUC 20 in Washington this past week, presenting an analysis of the existing ISO 639 code sets, a mapping from the ISO 639 codes to entries in the Ethnologue, and the principles by which we made those mappings. Also, complete details of our proposed mapping and the analysis as well as our paper have been published on the Ethnologue site at http://www.ethnologue.com/iso639/.

2. The task force created by ISO/TC 37/SC 2 to look into the need for future work on ISO 639 also met this past week. The original task force members are Gerhard Budin (convener), Havard Hjulstad, John Clews, and Jennifer DeCamp; Sue Ellen Wright and I were recently added. Unfortunately, Gerhard fell sick and was not able to come from Vienna. Also, John Clews was not able to come, but David Dalby of the Linguasphere Observatory was sent by BSI in John's place. Monty George (US DOD) and Rebecca Guenther (LOC and JAC chair) also participated in our discussions. Havard chaired in Gerhard's absense.

We met on Tuesday afternoon, and sat together on a panel at the Unicode conference the next day. That panel discussion immediately followed my presentation. Most of us met again afterward to debrief (Rebecca had to leave before we had decided to meet again).

During our discussions, we moved toward consensus on several points. I had suggested that there was a need to come up with an ontological model of language-related catogories for which we need to make distinctions for IT purposes; this model would serve to guide the solutions we create. It was decided that TC 37 should have a work project to develop a model and propose it as a standard. (I will be preparing a paper discussing an ontological model, to be at the Unicode conference in May.)

Another work project would be proposed to work on an overall framework for language and language-related-category identification. We didn't discuss this in detail, but I inferred that this would cover issues such as how codes from various sources (language codes, script codes, country or domain codes) would be combined to create category IDs, and how language codes would be maintained and documented. It also seems likely that an alpha-four code set will be created. (Havard felt that existing constraints would not make it feasible to have a comprehensive set of codes as part of ISO 639-2.)

There was also a little discussion of the respective contributions that might be made by Linguasphere and Ethnologue. David Dalby and I were in agreement that the two projects took different but complementary approaches: Linguasphere attempts to capture relationships between linguistic varieties at various levels, reflecting sociolinguistic reality in which language varies in continua rather than discrete entities. On the other hand, Ethnologue abstracts from those continua a set of discrete language entities that are inferred for practical purposes of language development. I think there was agreement among most of us that Linguasphere is useful for helping us understand the sociolinguistic phenomena to be observed in the world out there, and that Ethnologue is useful in that the categories it proposes corresponds more or less to what are generally of most interest for IT purposes.

Havard will be working with Gerhard to prepare a report on proposed work projects. I think Jennifer and Sue Ellen to going to be looking into funding for such projects.

During the panel discussion and in some individual conversations over the next two days, I heard some concerns expressed from software vendors. These related to two general issues:

(A) Why hasn't there been more involvement from representatives of the software industry? On this point, I agree there has been a gap, and for a while have been suggesting some people I thought should be involved. Sue Ellen, who is chair of the ANSI TAG for TC 37 invited people from the software industry to participate.

(B) There was a concern regarding information overload -- an IT vendor being overwhelmed by long lists and not knowing how to make sense of it all. I think there are things we can do to address this. It will need to be looked at further, at any rate.

3. The W3C Internationalization Working Group held a workshop on Friday to explore needs for future work. Language and locale identification issues were concerns raised by many participants. There was a consensus that W3C should not get involved in proposing code sets, but that they do have an interest in understanding user needs and in what kind of syntax might be used for identificational codes. A group will likely be formed to look into this and some other somewhat related (but out-of-scope with respect to this list) issues.

4. A couple of months ago, a discussion arose on the www-international list regarding locales. That spawned a new Yahoo list focussed on locale-related issues. Mainly, this group is interested in seeing locales change from tightly bound bundles fairly rigid relationships between languages and countries in order to allow greater flexibility in mixing values for different cultural parameters. This entails a need for a somewhat different approach to locale identification that things like en_US. What happens here would, of course, be affected by whatever is arrived at for IDs for language-related categories. The interests of people on the locales list overlaps with those of people involved in the W3C workshop, and there will probably be some interaction.

All for now.

- Peter

Peter Constable
Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[email protected]>

Prepared by Robin Cover for The XML Cover Pages archive. See "Language Identifiers in the Markup Context."

SEARCH Advanced Search ABOUT Site Map CP RSS Channel Contact Us Sponsoring CP About Our Sponsors NEWS Cover Stories Articles & Papers Press Releases CORE STANDARDS XML SGML Schemas XSL/XSLT/XPath XLink XML Query CSS SVG TECHNOLOGY REPORTS XML Applications General Apps Government Apps Academic Apps EVENTS LIBRARY Introductions FAQs Bibliography Technology and Society Semantics Tech Topics Software Related Standards Historic	Language Identification Issues 2002-02-04 Date: Mon, 4 Feb 2002 11:59:41 -0500 From: [email protected] To: [email protected] Subject: update on some recent activity I thought I'd provide a brief update to people on this list of some recent activity regarding language identification issues. These are not in any particular order. 1. I presented a paper (co-authored by Gary Simons) at IUC 20 in Washington this past week, presenting an analysis of the existing ISO 639 code sets, a mapping from the ISO 639 codes to entries in the Ethnologue, and the principles by which we made those mappings. Also, complete details of our proposed mapping and the analysis as well as our paper have been published on the Ethnologue site at http://www.ethnologue.com/iso639/. 2. The task force created by ISO/TC 37/SC 2 to look into the need for future work on ISO 639 also met this past week. The original task force members are Gerhard Budin (convener), Havard Hjulstad, John Clews, and Jennifer DeCamp; Sue Ellen Wright and I were recently added. Unfortunately, Gerhard fell sick and was not able to come from Vienna. Also, John Clews was not able to come, but David Dalby of the Linguasphere Observatory was sent by BSI in John's place. Monty George (US DOD) and Rebecca Guenther (LOC and JAC chair) also participated in our discussions. Havard chaired in Gerhard's absense. We met on Tuesday afternoon, and sat together on a panel at the Unicode conference the next day. That panel discussion immediately followed my presentation. Most of us met again afterward to debrief (Rebecca had to leave before we had decided to meet again). During our discussions, we moved toward consensus on several points. I had suggested that there was a need to come up with an ontological model of language-related catogories for which we need to make distinctions for IT purposes; this model would serve to guide the solutions we create. It was decided that TC 37 should have a work project to develop a model and propose it as a standard. (I will be preparing a paper discussing an ontological model, to be at the Unicode conference in May.) Another work project would be proposed to work on an overall framework for language and language-related-category identification. We didn't discuss this in detail, but I inferred that this would cover issues such as how codes from various sources (language codes, script codes, country or domain codes) would be combined to create category IDs, and how language codes would be maintained and documented. It also seems likely that an alpha-four code set will be created. (Havard felt that existing constraints would not make it feasible to have a comprehensive set of codes as part of ISO 639-2.) There was also a little discussion of the respective contributions that might be made by Linguasphere and Ethnologue. David Dalby and I were in agreement that the two projects took different but complementary approaches: Linguasphere attempts to capture relationships between linguistic varieties at various levels, reflecting sociolinguistic reality in which language varies in continua rather than discrete entities. On the other hand, Ethnologue abstracts from those continua a set of discrete language entities that are inferred for practical purposes of language development. I think there was agreement among most of us that Linguasphere is useful for helping us understand the sociolinguistic phenomena to be observed in the world out there, and that Ethnologue is useful in that the categories it proposes corresponds more or less to what are generally of most interest for IT purposes. Havard will be working with Gerhard to prepare a report on proposed work projects. I think Jennifer and Sue Ellen to going to be looking into funding for such projects. During the panel discussion and in some individual conversations over the next two days, I heard some concerns expressed from software vendors. These related to two general issues: (A) Why hasn't there been more involvement from representatives of the software industry? On this point, I agree there has been a gap, and for a while have been suggesting some people I thought should be involved. Sue Ellen, who is chair of the ANSI TAG for TC 37 invited people from the software industry to participate. (B) There was a concern regarding information overload -- an IT vendor being overwhelmed by long lists and not knowing how to make sense of it all. I think there are things we can do to address this. It will need to be looked at further, at any rate. 3. The W3C Internationalization Working Group held a workshop on Friday to explore needs for future work. Language and locale identification issues were concerns raised by many participants. There was a consensus that W3C should not get involved in proposing code sets, but that they do have an interest in understanding user needs and in what kind of syntax might be used for identificational codes. A group will likely be formed to look into this and some other somewhat related (but out-of-scope with respect to this list) issues. 4. A couple of months ago, a discussion arose on the www-international list regarding locales. That spawned a new Yahoo list focussed on locale-related issues. Mainly, this group is interested in seeing locales change from tightly bound bundles fairly rigid relationships between languages and countries in order to allow greater flexibility in mixing values for different cultural parameters. This entails a need for a somewhat different approach to locale identification that things like en_US. What happens here would, of course, be affected by whatever is arrived at for IDs for language-related categories. The interests of people on the locales list overlaps with those of people involved in the W3C workshop, and there will probably be some interaction. All for now. - Peter Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[email protected]> Prepared by Robin Cover for The XML Cover Pages archive. See "Language Identifiers in the Markup Context."