ISO/TC37/SC2/WG1 N77 Date of presentation 2001-08-13Proposer BSI Draft technical report: Development and Application of ISO 639
in the identification, classification and
alphanumeric coding of the
world's languages
Contents
1 Introductory Note
2 Clarification of Terms and Categories
3 The Global Context
4 The Proposal
5 Towards a Global Public Resource
Bibliography
Appendix
1 ISO 639
Codes correlated with the Linguasphere Referential Framework (extract
A-H)
Appendix 2 Linguasphere Referential Framework
of 10 Sectors and 100 Zones
Appendix 3 Chart of the World's
Arterial Languages (printed as 2 pages in landscape view)
Appendix 4 Global Language Index
(sample extract)
1 Introductory Note
There is an established need for a standardised system of codes for the tagging and identification of the world's languages. Variation still exists, however, in the form of language codes used by different organisations and in different countries. The ISO 639 codes provide the base for standardisation in this field, although they at present cover only a small proportion of the world's languages. These ISO language codes also exist in 3 different versions, the ISO 639-1 two-letter code, and the ISO 639-2/T and 639-2/B three-letter codes (as designed for terminological and bibliographical use, respectively).
A fully classified inventory of the world's languages and speech communities was published in 1999/2000, including a coded index of over 71,000 names (Linguasphere Register of the World's Languages and Speech Communities, see Bibiography).
The following proposal outlines how the 3 versions of the ISO 369 codes may be unified as a single standard, and how the formal linking of this standard with the Linguasphere zones of reference would create an alphanumeric Global Identification Code (GIC) with increased informational content and inbuilt protection from error.
2 Clarification of terms and categories
2.1
Classification codes, identification codes and referential codes
A clear
distinction needs to be maintained among 3 forms of language code:
2.1.1 modifiable classification codes or
"relationship scale", recording proximities of interrelationship
among languages but subject to modification as research progresses;
2.1.2 fixed identification codes or "language tags", enabling individual languages to
be identified without ambiguity; and
2.1.3 a stable "referential framework", providing a meeting-point for the
correlation of classification codes and identification codes (as in the
proposed Global Identification Code).
2.2 Language names and umbrella names
A clear
distinction needs to be maintained between:
2.2.1 language and dialect names as applied to individual spoken and/or written
varities of language; and
2.2.2 umbrella names, often artificially created,
covering groups or families of related languages (the treatment of which has
not been presented in the following pages, for reasons of time and space).
2.3 Languages and dialects
The
continuum which frequently exists among adjacent forms of speech means that it
has always been difficult define the boundary between usage of the terms
"language" and "dialect".
The situation is eased by recognising that many languages are better
analysed and distinguished in terms of three (rather than two) layers of immediate
relationship. These layers are best
explained by reference to specific examples:
2.3.1 outer language, as applied, for example, to the totality of the Welsh language in all
its spoken and written forms;
2.3.2 inner language, as applied to the 3 major components of the modern Welsh (outer)
language: literary Welsh (as written, and progressively standardised, in recent
centuries); northern spoken Welsh (in north Wales); and southern spoken Welsh
(in south Wales);
2.3.3 dialect,
as applied to distinct varieties of written Welsh (e.g. Bible or
"pulpit" Welsh) or to local varieties of northern or southern spoken
Welsh (e.g. Anglesey Welsh in the north, or Pembrokeshire Welsh in the south).
2.4 Spoken languages and standard written
languages
It is
of great importance that a clear distinction be maintained between:
2.4.1 spoken languages and their dialects, which may also be written (in dialect literature, or in phonetic transcriptions, for example).
2.4.2 standard(ised) written languages, which have acquired a status independent of the spoken word but which may themselves be spoken (in speech which is modelled on the written tradition of a language). Part of the present proposal is that the 2-letter codes of ISO 639-1 should be formally recognised as designating the relevant standard written languages (e.g. en for Standard English), in contrast to the general coverage of the 3-letter codes of ISO 639-2 (e.g. eng for English in any or all its forms).
3 The Global Context
The objective of clearly identifying all the languages and speech communities of humankind, regardless of their demographic size, is today clear, attainable and of global importance.
3.1 The Twenty-first century perspective
At the onset of the twenty-first century, humankind is aware of itself as a single planetary community, with means of instant global communication and of increasing global planning and coordination. Languages are the key to that communication and coordination.
For the first time, the languages of the world may be viewed as integral parts of humankind's greatest and most fundamental creation, the continuous global web or "linguasphere" of human speech and writing.
Languages no longer need to be listed and catalogued as a vast array of independent objects, belonging to rival and often warring communities. They can now be viewed and classified as integral parts of a collective human heritage.
The classification of languages has until now been the preserve of erudite specialists, often tracking down words of ancient languages in the pursuit of evidence about the human past.
Today, however, the classification of modern languages has a direct relevance to the way humankind perceives and organises itself as a single global and multilingual community. Individual languages can now be perceived, not as the individual creation and property of specific communities, but as mutable and interrelating subsystems within a vast global kaleidoscope of words, grammatical rules, speech sounds and elements of writing.
All languages have benefitted or may potentially benfefit from the modern communications revolution in two fundamental ways:
· The recording and global transmission of the spoken word allow any spoken language to share the advantages previously reserved to written languages, enabling even small speech communities to maintain worldwide spoken contact.
· The instant transmission and exchange of the written word allow any written language to share the advantages previously reserved to speech, encouraging even children to use writing (instant messages by phone and computer, and e-mail) as an integral part of their social life.
3.2 The Need for the Identification of Languages within a Referential Framework
Any system of linguistic classification needs to contain an element of fluidity, in order to deal not only with the fundamental nature of the linguasphere but also with a still expanding knowledge of its complexity.
At the same time, it is necessary that the identifiable written and spoken languages of human communities be clearly and unambiguously catalogued and identified, from the international use of English or French to the unique speech of an isolated village in central Africa.
It is important to be aware of this contrast between (a) the need for fluidity in establishing and updating a sliding scale of linguistic interrelationships, and (b) the need for stability in identifying the individual spoken and recorded languages of humankind.
The primary objective in this field is therefore to complete a standardised international system of identification codes for the unambiguous tagging of all known forms of spoken and written languages, alive or recorded from the past, and for the correlation of those fixed tags to a separate scale of linguistic interrelationships.
4 The Proposal
4.1 The Institutional Background
The first comprehensive coded and classified inventory of the languages and speech communities of humankind during the 20th century was completed in December 1999 and published in Wales in 2000.
This Linguasphere Register of the World's Languages and Speech Communities provides a referential framework for the location and classification of over 22,000 identifiable varieties of speech and writing. The Linguasphere Register is supported by a unique and expandable Index of over 71,000 linguistic and ethnolinguistic names, each classified and coded within the referential framework, using comprehensive scale of linguistic interrelationships.
The agency responsible for compiling and maintaining the Register is the Linguasphere Observatory (www.linguasphere.org), a transnational research network devoted to the study and maintenance of multilingualism. Conceived in Canada in 1983, the Observatory was established in France during the 1980's. During the 1990's, it has worked in close collaboration with the University of London's School of Oriental and African Studies, and has been directed from bilingual Wales since 1995, with scientific support from Russia, India and the United States. See further details at the end of section 3 of this paper.
In July 2001, the BSI (British Standards Institution) requested the Linguasphere Observatory to make a firm proposal for the establishment of a standardised alphanumeric coding system covering all the world's languages, based on existing and future codes of ISO 639 and correlated with the referential framework and relationship scale of the Linguasphere Register.
4.2 The Technical Background
ISO 639-2, originally devised for use in library systems, now exists in slightly divergent forms, known as ISO 639-2/T (terminology) and ISO 639-2/B (bibliographic). Although the 3-letter codes of ISO 639-2 could provide codes for 26x26x26 languages, limits specified in the standard currently restrict the creation of new codes to languages with a substantial body of literature. If rigorously applied, this restriction limits the more generalised use of the IS-639 codes, particularly in ICT usage. As a result, some ICT users – including ministries and official agencies - have either made use of the SIL (Summer Institute of Linguistics) codes, or have developed their own coding systems, notably the OpenType specifications (OT) used in font and rendering technologies. Such variant codes have been developed in certain countries, including the UK, Sweden and Germany, which in some cases have caused clashes in bibliographic information interchange.
The purpose of this proposal is NOT to create yet another method of coding, but to enable existing ISO (TC/37) standards to work more efficiently and accurately, and to be expanded systematically to cover all languages and speech communities. The following pages outline how the 3 versions of the ISO 369 codes may be unified within this single standard, and how the formal linking of the ISO codes with the Linguasphere zones of reference would create an alphanumeric Global Identification Code (GIC) with compact informational content and inbuilt protection from error.
4.4 Practical considerations of the present proposal
Some of the
problems hitherto associated with the codes of ISO 639, and with language
identification in general, were discussed by Peter Constable and Gary Simons of
SIL International in their paper Language
Identification and IT: Addressing problems of linguistic diversity on a global
scale, presented to the 17th
International Unicode Conference (San Jose California, September 2000).
5 Towards a Global Public Resource
5.1 Progress towards a tripartite global
reference guide
The present proposal is designed to provide the key element in the production of a fully coded and interactive global reference guide to
· the languages and speech communities of the world,
· their established linguistic relationships,
· their global corpus of linguistic and ethnic names, and
· their geographic positions and demography.
This global reference guide would take the form of a freely available, independent and multilingual website, comprising 3 interdependent "panoramas". These would be interdependent, with a common alphanumeric coding system throughout (ISO-639 plus Linguasphere), and would be fully inter-referenced and interaccessible at every point:
5.1.1 the Global Index of the World's Languages and Speech Communities (or ISO 639/ Linguasphere Index), presenting an alphabetical key to the identification and location of all known written and recorded languages and dialects, and all varieties of linguistic, ethnic and communal names. This panorama, covering a total of over 71,000 names, is already available in a first printed edition (but without ISO codes), as the Index to the Linguasphere Register. This progressively updated and expanded edition will be opened to free public access and dialogue on the internet within the next year. An extract from this Index, covering names beginning G-, has been prepared and is now being extended as part of the current proposal. This will include existing and proposed additional ISO-639 codes.
5.1.2 the Global Register of the World's Languages and Speech Communities, presenting a comprehensive scale of linguistic relationships among the spoken and recorded languages and dialects of the world, and their relevant speech communities. This panorama is already available in a first printed edition as the Linguasphere Register of the World's Languages and Speech Communities, covering over 22,000 varieties of languages and dialects. This edition will be opened to free public access and dialogue on the internet within the next year, and will be progressively updated and expanded online. Extensive extracts are already freely available at /www.linguasphere.org/.
5.1.3 the Global Mapbase of the World's Languages and Speech Communities, presenting a cartographic survey of the location, distribution and interrelationships of the world's languages and speech communities. This panorama has already been developed by the Linguasphere Observatory for Africa (linguistically the most complex continent in the world), in collaboration with the London School of Oriental and African Studies (SOAS). It has been printed as the first sheet of the Linguasphere Mapbase of the World's Languages and Speech Communities and is currently being extended into southern Europe and western Asia, in collaboration with the Languages of the World unit of the Russian Academy of Sciences (Akademia Nauk) . The first African layer of this map is viewable at /http://www.soas.ac.uk/Geography/LanguageMapping/home.html/. This same page on the SOAS website illustrates how subsequent layers of the Linguasphere Mapbase will be accessible by zooming, down to the layer of urban speech communities (as already surveyed and published for over 300 minority languages of London, see Bibliography below).
5.2 Applications of the tripartite reference
guide
This three-part electronic reference guide will serve as
· a transnational reference system and educational resource for teaching covering
à the global complexity of humankind, as represented by the overlying diversity of its languages and the divergent welfare and cultures of its individual speech communities,
à the underlying unity of humankind, as represented by a worldwide continuum of multilingual communication and intercommunal identities (the "linguasphere"), and
à the establishment of comprehensive links with - and annotated signposts towards - a vast range of other electronic sources on the languages, peoples and cultures of the world;
· a stimulus to innovative teaching and research, including
à the active investigation and surveying of linguistic and ethnic realities and relationships, including the continuous updating and expansion of the global reference guide itself;
à the transnational observation and documentation, regardless of frontiers, of
- the actual and relative welfare of all speech communities in the world,
- the movement and migration of speech communities and their members,
- the formation and distribution of minority urban speech communities,
- the incidence of all forms of
genocide and other forms of discrimination
among ethnolinguistic
communities;
à the awakening of public interest in questions of the transnational and multilingual heritage and origins of communities and individuals (the "languages of our ancestors"). Linked to the growing strength of public interest in genealogical research, this development may be of particular importance in encouraging the development of bilingualism among first language English-speaking communities (in danger of becoming the only communities deprived of the advantages of bilingualism, in an otherwise multilingual world).
5.3 The Linguasphere Observatory
The present proposals and products are the outcome of many years research and development at the Linguaphere Observatory in Wales and at its previous location in France as the Observatoire Linguistique. Created in 1983, after planning and discussion in Quebec (at CIRB, the Centre International pour la Recherche en Bilinguisme at the Université Laval), the Observatory was set up in Normandy as a transnational research network devoted to the study and development of multilingualism (under the honorary presidency of Léopold Sédar Senghor of Senegal, and registered under the French law of association of 1901).
Among other linguistic activities, the Observatory was responsible for two bilingual exhibitions on languages at the Centre Georges Pompidou in Paris during the 1980's, with substantial support from the Government of Canada. (These exhibitions subsequently toured internationally, including London, Liège and Lagos, and around the world to Canberra.) Since 1995, the Observatory has been based in a bilingual area of west Wales, under the directorship of David Dalby, where the Linguasphere Register of the World's Languages and Speech Communities was first published at the turn of the millennium (1999/2000). Scientific support has been received from Russia, France, India and the United States.
It is appropriate that the present proposals and products should emanate from Wales, a country whose language has successfully resisted and survived the successive invasion of its territory by two of the most powerful languages in the history of the world, Latin and English. All speech communities now need to consider their relationship to English, as a global lingua franca, and in this respect the indigenous speech community of Wales has the longest experience in the world, having faced the growing strength of its English neighbour for more than one millennium. The cultural strength and linguistic survival of the Welsh-speaking community offer an important message of encouragement to small speech communities everywhere. English has a transnational role to play in the world, along with other "arterial" languages, but should be developed in the service of a multilingual global society, NOT as the medium of a monolingual culture.
That the British Standards Institution in London (BSI) and the University of London's School of Oriental and African Studies (SOAS) should have given their support to the proposals and products of the Linguasphere Observatory in Wales is also significant. At a time when countries around the world are devoting resources to the study of a language associated with England, it is appropriate that major public institutions in that country should devote resources to the study and development of multilingualism and of the languages of the world.
|
Language Name (English) |
Language Name (French) |
LRF + |
639-1 |
639-2/T |
639-2/B |
|
Abkhazian |
abkhaze |
42 |
ab |
abk |
abk |
|
Achinese |
aceh |
31 |
|
ace |
ace |
|
Acoli |
acoli |
04 |
|
ach |
ach |
|
Adangme |
adangme |
96 |
|
ada |
|