ISO/TC37/SC2/WG1 N77

Date of presentation 2001-08-13

Proposer BSI

Draft technical report:

Development and Application of ISO 639

in the identification, classification and

alphanumeric coding of the

world's languages

Contents

1 Introductory Note
2 Clarification of Terms and Categories
3 The Global Context
4 The Proposal
5 Towards a Global Public Resource

Bibliography

Appendix 1 ISO 639 Codes correlated with the Linguasphere Referential Framework (extract A-H)
Appendix 2 Linguasphere Referential Framework of 10 Sectors and 100 Zones
Appendix 3 Chart of the World's Arterial Languages (printed as 2 pages in landscape view)
Appendix 4 Global Language Index (sample extract)

1 Introductory Note

There is an established need for a standardised system of codes for the tagging and identification of the world's languages. Variation still exists, however, in the form of language codes used by different organisations and in different countries. The ISO 639 codes provide the base for standardisation in this field, although they at present cover only a small proportion of the world's languages. These ISO language codes also exist in 3 different versions, the ISO 639-1 two-letter code, and the ISO 639-2/T and 639-2/B three-letter codes (as designed for terminological and bibliographical use, respectively).

A fully classified inventory of the world's languages and speech communities was published in 1999/2000, including a coded index of over 71,000 names (Linguasphere Register of the World's Languages and Speech Communities, see Bibiography).

The following proposal outlines how the 3 versions of the ISO 369 codes may be unified as a single standard, and how the formal linking of this standard with the Linguasphere zones of reference would create an alphanumeric Global Identification Code (GIC) with increased informational content and inbuilt protection from error.

2 Clarification of terms and categories

2.1 Classification codes, identification codes and referential codes

A clear distinction needs to be maintained among 3 forms of language code:

2.1.1 modifiable classification codes or "relationship scale", recording proximities of interrelationship among languages but subject to modification as research progresses;

2.1.2 fixed identification codes or "language tags", enabling individual languages to be identified without ambiguity; and

2.1.3 a stable "referential framework", providing a meeting-point for the correlation of classification codes and identification codes (as in the proposed Global Identification Code).

2.2 Language names and umbrella names

A clear distinction needs to be maintained between:

2.2.1 language and dialect names as applied to individual spoken and/or written varities of language; and

2.2.2 umbrella names, often artificially created, covering groups or families of related languages (the treatment of which has not been presented in the following pages, for reasons of time and space).

2.3 Languages and dialects

The continuum which frequently exists among adjacent forms of speech means that it has always been difficult define the boundary between usage of the terms "language" and "dialect". The situation is eased by recognising that many languages are better analysed and distinguished in terms of three (rather than two) layers of immediate relationship. These layers are best explained by reference to specific examples:

2.3.1 outer language, as applied, for example, to the totality of the Welsh language in all its spoken and written forms;

2.3.2 inner language, as applied to the 3 major components of the modern Welsh (outer) language: literary Welsh (as written, and progressively standardised, in recent centuries); northern spoken Welsh (in north Wales); and southern spoken Welsh (in south Wales);

2.3.3 dialect, as applied to distinct varieties of written Welsh (e.g. Bible or "pulpit" Welsh) or to local varieties of northern or southern spoken Welsh (e.g. Anglesey Welsh in the north, or Pembrokeshire Welsh in the south).

2.4 Spoken languages and standard written languages

It is of great importance that a clear distinction be maintained between:

2.4.1 spoken languages and their dialects, which may also be written (in dialect literature, or in phonetic transcriptions, for example).

2.4.2 standard(ised) written languages, which have acquired a status independent of the spoken word but which may themselves be spoken (in speech which is modelled on the written tradition of a language). Part of the present proposal is that the 2-letter codes of ISO 639-1 should be formally recognised as designating the relevant standard written languages (e.g. en for Standard English), in contrast to the general coverage of the 3-letter codes of ISO 639-2 (e.g. eng for English in any or all its forms).

3 The Global Context

The objective of clearly identifying all the languages and speech communities of humankind, regardless of their demographic size, is today clear, attainable and of global importance.

3.1 The Twenty-first century perspective

At the onset of the twenty-first century, humankind is aware of itself as a single planetary community, with means of instant global communication and of increasing global planning and coordination. Languages are the key to that communication and coordination.

For the first time, the languages of the world may be viewed as integral parts of humankind's greatest and most fundamental creation, the continuous global web or "linguasphere" of human speech and writing.

Languages no longer need to be listed and catalogued as a vast array of independent objects, belonging to rival and often warring communities. They can now be viewed and classified as integral parts of a collective human heritage.

The classification of languages has until now been the preserve of erudite specialists, often tracking down words of ancient languages in the pursuit of evidence about the human past.

Today, however, the classification of modern languages has a direct relevance to the way humankind perceives and organises itself as a single global and multilingual community. Individual languages can now be perceived, not as the individual creation and property of specific communities, but as mutable and interrelating subsystems within a vast global kaleidoscope of words, grammatical rules, speech sounds and elements of writing.

All languages have benefitted or may potentially benfefit from the modern communications revolution in two fundamental ways:

· The recording and global transmission of the spoken word allow any spoken language to share the advantages previously reserved to written languages, enabling even small speech communities to maintain worldwide spoken contact.

· The instant transmission and exchange of the written word allow any written language to share the advantages previously reserved to speech, encouraging even children to use writing (instant messages by phone and computer, and e-mail) as an integral part of their social life.

3.2 The Need for the Identification of Languages within a Referential Framework

Any system of linguistic classification needs to contain an element of fluidity, in order to deal not only with the fundamental nature of the linguasphere but also with a still expanding knowledge of its complexity.

At the same time, it is necessary that the identifiable written and spoken languages of human communities be clearly and unambiguously catalogued and identified, from the international use of English or French to the unique speech of an isolated village in central Africa.

It is important to be aware of this contrast between (a) the need for fluidity in establishing and updating a sliding scale of linguistic interrelationships, and (b) the need for stability in identifying the individual spoken and recorded languages of humankind.

The primary objective in this field is therefore to complete a standardised international system of identification codes for the unambiguous tagging of all known forms of spoken and written languages, alive or recorded from the past, and for the correlation of those fixed tags to a separate scale of linguistic interrelationships.

4 The Proposal

4.1 The Institutional Background

The first comprehensive coded and classified inventory of the languages and speech communities of humankind during the 20^th century was completed in December 1999 and published in Wales in 2000.

This Linguasphere Register of the World's Languages and Speech Communities provides a referential framework for the location and classification of over 22,000 identifiable varieties of speech and writing. The Linguasphere Register is supported by a unique and expandable Index of over 71,000 linguistic and ethnolinguistic names, each classified and coded within the referential framework, using comprehensive scale of linguistic interrelationships.

The agency responsible for compiling and maintaining the Register is the Linguasphere Observatory (www.linguasphere.org), a transnational research network devoted to the study and maintenance of multilingualism. Conceived in Canada in 1983, the Observatory was established in France during the 1980's. During the 1990's, it has worked in close collaboration with the University of London's School of Oriental and African Studies, and has been directed from bilingual Wales since 1995, with scientific support from Russia, India and the United States. See further details at the end of section 3 of this paper.

In July 2001, the BSI (British Standards Institution) requested the Linguasphere Observatory to make a firm proposal for the establishment of a standardised alphanumeric coding system covering all the world's languages, based on existing and future codes of ISO 639 and correlated with the referential framework and relationship scale of the Linguasphere Register.

4.2 The Technical Background

The ISO Alpha-2 and Alpha-3 Codes for the Representation of Names of Languages (ISO 639) are complementary in purpose and form to the Numeric-2 Code employed for the Linguasphere Referential Framework (LRF) of the world's languages.

ISO 639 provides 2-letter or 3-letter tags (or "standardised abbreviations") for the identification of specific languages and groups of languages, whereas the LRF provides 2-digit tags for a referential inventory of the world's languages within 10 sectors (1^st digit) and 100 zones (2^nd digit).

4.2.1 ISO 639 Codes

The ISO 639 codes for a range of the most commonly encountered names of languages (and groups of languages) are presented in the International Organisation for Standardisation's Code for the representation of names of languages (1998), an are available online at /http://lcweb.loc.gov/standards/iso639-2/langhome.html/.

ISO 639-2, originally devised for use in library systems, now exists in slightly divergent forms, known as ISO 639-2/T (terminology) and ISO 639-2/B (bibliographic).  Although the 3-letter codes of ISO 639-2 could provide codes for 26x26x26 languages, limits specified in the standard currently restrict the creation of new codes to languages with a substantial body of literature.  If rigorously applied, this restriction limits the more generalised use of the IS-639 codes, particularly in ICT usage.

As a result, some ICT users – including ministries and official agencies - have either made use of the SIL (Summer Institute of Linguistics) codes, or have developed their own coding systems, notably the OpenType specifications (OT) used in font and rendering technologies.  Such variant codes have been developed in certain countries, including the UK, Sweden and Germany, which in some cases have caused clashes in bibliographic information interchange.

4.2.2 Linguasphere Codes

The Linguasphere 2-digit code is defined for over 22,000 modern languages and dialects, and for their historical forms where relevant, in the Linguasphere Register of the World's Languages and Speech Communities (published in 2 volumes by Linguasphere Press, Hebron, Wales 2000). The Linguasphere code (digital Reference Framework, plus alpha Relationship Scale) is discussed and exemplified on the Linguasphere Observatory website (http://www.linguasphere.org).

The fully coded Linguasphere classification and annotation of the world's languages is already available with limited access online (http://www.linguasphere.net) and is to be made freely accessible as a public resource within the next year (see section 5 below).

Appendix 1 to this paper displays the 2-digit tags of the Linguasphere Reference Framework as an additional column in the listing of ISO 639, as exemplified by the letters A-H in an alphabetical arrangement by English names of languages. Appendix 2 lists and explains the Linguasphere numerical tags, with their linguistic and/or geographical applications. Appendix 3 provides a table of the world's arterial languages (each reaching over 1% of the world's total population), with the relevant ISO 639 and Linguasphere codes. Appendix 4 presents an extract from the proposed Global Language Index, which is available as a starting point for the systematic extension of ISO 639 codes.

4.3 Formulation of the Proposal

It is proposed by the British TS/1 Committee and the British Standards Institution that a standard Global Identification Code for all known languages and speech communities be established, by the prefixing of the 2 digits of the Linguasphere Referential Framework (LRF) to the 2 or 3 letters of ISO 639 codes and to the extension of those codes to cover all spoken and recorded languages.

The purpose of this proposal is NOT to create yet another method of coding, but to enable existing ISO (TC/37) standards to work more efficiently and accurately, and to be expanded systematically to cover all languages and speech communities. The following pages outline how the 3 versions of the ISO 369 codes may be unified within this single standard, and how the formal linking of the ISO codes with the Linguasphere zones of reference would create an alphanumeric Global Identification Code (GIC) with compact informational content and inbuilt protection from error.

4.4 Practical considerations of the present proposal

Some of the problems hitherto associated with the codes of ISO 639, and with language identification in general, were discussed by Peter Constable and Gary Simons of SIL International in their paper Language Identification and IT: Addressing problems of linguistic diversity on a global scale, presented to the 17^th International Unicode Conference (San Jose California, September 2000).

Their paper proposes the extension of the present Alpha-3 system to cover all known varieties of written and spoken languages in the world, following the example established in the SIL's Ethnologue (14^th edition, 2000). This proposal allows for the establishment of thousands of 3-letter codes to represent language names, not necessarily related in form to those names, but fails to address some of the fundamental problems of isolated 3-letter codes.

Constable and Simons (page 15) recognise the Linguasphere Register as "the only likely candidate" as an alternative to their proposed SIL system.

4.4.1 Advantages of the proposed Global Identification Code

The prefixing of the digits of the relevant LRF numeric code to a 2-letter or 3-letter form of an ISO 639 tag for a specific language would create a combined alphanumeric LRF/ISO tag or Global Identification Code. This combined tag would assist in solving several existing problems of identification and referential classification.

In comparison with the existing ISO or SIL tags, the combined LRF/ISO tag would be:

· more transparent, with the initial digit indicating one of five major affinities or one of five continental areas (e.g. 5 = Indo-European or 8 = native South American: see Appendix 2 to this paper);

· more easily located, with the 2 digits indicating a linguistic group or area (e.g. 53 = Slavic or 87 = Amazon: see Appendix 2 to this paper);

· more readily classifiable, together with the names of other related and/or adjacent languages, either within the same sector (first digit in common) or zone (both digits in common)

· better protected against typographical error in the citation of tags (each alpha component being tied to a specific numeric component).

At the same time, the continued existence of a single series of unambiguous ISO 2-letter and 3-letter codes to identify the languages of the world would mean that the combined LRF/ISOtags could be abbreviated for practical purposes by the optional omission of the LRF numerical prefix.

· In such abbreviated usage, the invisible LRF code would still underlie any IT usage of the 2-letter or 3-letter components.

· The LRF numerical prefix would thus be available not only to classify language codes as required but, very importantly, to serve as a check against typographical error. A mistyped 2-letter or 3-letter code would have only a 1% chance of matching the correct numerical prefix.

Use of combined LRF/ISO tags would also open the way to a more structured approach to the classification of "language names", which may be used to indicate a wide variety of different categories of language name with an identical form of 3 letters.

· The existing ISO Alpha-3 codes may indicate either the name of a specific standardised language or of a wider "language" composed of two or more closely related spoken and/or written languages (e.g. the Ashkharik and Arewmta varieties of Armenian, or the Gheg and Tosk varieties of Albanian), or an historical and/or liturgical language (e.g. Church Slavonic), or a grouping of languages of undetermined dimension or nature (e.g. Athapascan languages, or "other" Austronesian, or "other" Creoles and pidgins). See examples in Appendix 1.

In contrast, the use of combined LRF/ISO tags could be associated with the reconsolidation and extension of the 3 existing lists of ISO tags, to create a single, more coherent and explicit system.

· In an increasingly internationalised world, it is appropriate that alpha codes for specific languages should be based wherever possible on the autoglossonym (or indigenous form of the language name) rather than on the English name (where this is different).

· In this respect, where ISO 639-2/B diverges from 639-2/T, the 2/T code is generally to be preferred (e.g. /eus/ in preference to /baq/ for Basque, for which the autoglossonym is Euskara). It may be noted that the Language Register also gives precedence to autoglossonyms.

Most importantly, it would be important to distinguish between the application of LRF/ISO tags to specific objects (i.e. standard written languages) as opposed to "fuzzy" phenomena (i.e. non-standardised languages, or continua of closely related spoken languages or dialects) or to referential boxes created or identified for use in the classification of languages (i.e. language groups, families or categories or areas of languages, including the Linguasphere sectors and zones).

· The most frequent use of tags to represent language names in IT is for the identification of specific standard written languages.

à For practical purposes, a standard written language may be defined as a language whose form is largely fixed by means of a system of graphic conventions, established and exemplified by the publication of a large corpus of texts (normally in thousands or more).

à A standard written language may also have a spoken form, modelled largely on the use of the written form, in the same way that a spoken language may also have a written form, transcribing actual speech.

· Most such languages are already provided for under the ISO 639-1 Alpha-2 code, and it would be helpful if the use of combined LRF/ISO tags of the form Numeric-2 plus Alpha-2 could be specifically confined to the identification of standard written languages (including their standard spoken forms, wherever these are modelled on the written language).

· The potential number of Alpha-2 codes (26 x 26 = 676) is adequate to retain the existing ISO 639/1 codes for standard written languages, regardless of the prefixed digits.

· If the linguasphere were one day to include more than 676 such languages, then it would be possible to duplicate some alpha codes under different digits.

· A more stable solution, however, would be to limit the use of Alpha-2 codes to a closed list of all those written languages standardised before the end of the 20^th century.

· This basic LRF/ISO tag "for the representation of the names of standard languages" would be only one character longer than the existing Alpha-3 tags, but would be considerably more systematic and rich in information, and more secure against typographical error. Cf. 79zh for Standard Chinese (rather than /zho/ or /chi/ for all forms of Chinese) or 55sq for Standard Albanian (rather than /sqi/ or /alb/ for all forms of Albanian).

· The initial 7 or 5 locates Chinese and Albanian within Sino-Tibetan or Indo-European, respectively, and it would be useful to produce a list of LRF/ISO tags classified numerically by the LRF digits (alongside the existing ISO lists arranged by alpha code or by names of languages in English or in French).

· The alphanumeric form of LRF/ISO tags would be readily identifiable as language codes within other text, as opposed to the potential confusion of some existing Alpha-3 tags with real words (e.g. /bug/ for Buginese or /got/ for Gothic).

In contrast, LRF/ISO tags applied to "other types" of language name (non-standardised languages, fuzzy continua or referential boxes) could be based on the Alpha-3 codes of ISO 639-2/T.

· This distinction between Alpha-2 and Alpha-3 codes would be useful in distinguishing standardised languages within fuzzy continua of spoken languages and dialects, e.g. between the varieties of standardised Norwegian (Bokmål = 52no or Norwegian Nynorsk = 52nn) and "wider" Norwegian in all its forms (= 52nor, which has fuzzy boundaries within the continuum of other spoken forms of Scandinavian languages).

There are a number of other more detailed points to be considered in the design of any improved codes for language names (including the treatment of historical languages, for example), which will would dealt with within the fully developed presentation of the proposed LRF/ISO system. The Linguasphere Observatory looks forward to productive discussions on all aspects of the development of ISO 639, with members of BSI and ISO, and beyond.

5 Towards a Global Public Resource

5.1 Progress towards a tripartite global reference guide

The present proposal is designed to provide the key element in the production of a fully coded and interactive global reference guide to

· the languages and speech communities of the world,

· their established linguistic relationships,

· their global corpus of linguistic and ethnic names, and

· their geographic positions and demography.

This global reference guide would take the form of a freely available, independent and multilingual website, comprising 3 interdependent "panoramas". These would be interdependent, with a common alphanumeric coding system throughout (ISO-639 plus Linguasphere), and would be fully inter-referenced and interaccessible at every point:

5.1.1 the Global Index of the World's Languages and Speech Communities (or ISO 639/ Linguasphere Index), presenting an alphabetical key to the identification and location of all known written and recorded languages and dialects, and all varieties of linguistic, ethnic and communal names. This panorama, covering a total of over 71,000 names, is already available in a first printed edition (but without ISO codes), as the Index to the Linguasphere Register. This progressively updated and expanded edition will be opened to free public access and dialogue on the internet within the next year. An extract from this Index, covering names beginning G-, has been prepared and is now being extended as part of the current proposal. This will include existing and proposed additional ISO-639 codes.

5.1.2 the Global Register of the World's Languages and Speech Communities, presenting a comprehensive scale of linguistic relationships among the spoken and recorded languages and dialects of the world, and their relevant speech communities. This panorama is already available in a first printed edition as the Linguasphere Register of the World's Languages and Speech Communities, covering over 22,000 varieties of languages and dialects. This edition will be opened to free public access and dialogue on the internet within the next year, and will be progressively updated and expanded online. Extensive extracts are already freely available at /www.linguasphere.org/.

5.1.3 the Global Mapbase of the World's Languages and Speech Communities, presenting a cartographic survey of the location, distribution and interrelationships of the world's languages and speech communities. This panorama has already been developed by the Linguasphere Observatory for Africa (linguistically the most complex continent in the world), in collaboration with the London School of Oriental and African Studies (SOAS). It has been printed as the first sheet of the Linguasphere Mapbase of the World's Languages and Speech Communities and is currently being extended into southern Europe and western Asia, in collaboration with the Languages of the World unit of the Russian Academy of Sciences (Akademia Nauk) . The first African layer of this map is viewable at /http://www.soas.ac.uk/Geography/LanguageMapping/home.html/. This same page on the SOAS website illustrates how subsequent layers of the Linguasphere Mapbase will be accessible by zooming, down to the layer of urban speech communities (as already surveyed and published for over 300 minority languages of London, see Bibliography below).

5.2 Applications of the tripartite reference guide

This three-part electronic reference guide will serve as

· a transnational reference system and educational resource for teaching covering

à the global complexity of humankind, as represented by the overlying diversity of its languages and the divergent welfare and cultures of its individual speech communities,

à the underlying unity of humankind, as represented by a worldwide continuum of multilingual communication and intercommunal identities (the "linguasphere"), and

à the establishment of comprehensive links with - and annotated signposts towards - a vast range of other electronic sources on the languages, peoples and cultures of the world;

· a stimulus to innovative teaching and research, including

à the active investigation and surveying of linguistic and ethnic realities and relationships, including the continuous updating and expansion of the global reference guide itself;

à the transnational observation and documentation, regardless of frontiers, of

- the actual and relative welfare of all speech communities in the world,

- the movement and migration of speech communities and their members,

- the formation and distribution of minority urban speech communities,

- the incidence of all forms of genocide and other forms of discrimination
among ethnolinguistic communities;

à the awakening of public interest in questions of the transnational and multilingual heritage and origins of communities and individuals (the "languages of our ancestors"). Linked to the growing strength of public interest in genealogical research, this development may be of particular importance in encouraging the development of bilingualism among first language English-speaking communities (in danger of becoming the only communities deprived of the advantages of bilingualism, in an otherwise multilingual world).

5.3 The Linguasphere Observatory

The present proposals and products are the outcome of many years research and development at the Linguaphere Observatory in Wales and at its previous location in France as the Observatoire Linguistique. Created in 1983, after planning and discussion in Quebec (at CIRB, the Centre International pour la Recherche en Bilinguisme at the Université Laval), the Observatory was set up in Normandy as a transnational research network devoted to the study and development of multilingualism (under the honorary presidency of Léopold Sédar Senghor of Senegal, and registered under the French law of association of 1901).

Among other linguistic activities, the Observatory was responsible for two bilingual exhibitions on languages at the Centre Georges Pompidou in Paris during the 1980's, with substantial support from the Government of Canada. (These exhibitions subsequently toured internationally, including London, Liège and Lagos, and around the world to Canberra.) Since 1995, the Observatory has been based in a bilingual area of west Wales, under the directorship of David Dalby, where the Linguasphere Register of the World's Languages and Speech Communities was first published at the turn of the millennium (1999/2000). Scientific support has been received from Russia, France, India and the United States.

It is appropriate that the present proposals and products should emanate from Wales, a country whose language has successfully resisted and survived the successive invasion of its territory by two of the most powerful languages in the history of the world, Latin and English. All speech communities now need to consider their relationship to English, as a global lingua franca, and in this respect the indigenous speech community of Wales has the longest experience in the world, having faced the growing strength of its English neighbour for more than one millennium. The cultural strength and linguistic survival of the Welsh-speaking community offer an important message of encouragement to small speech communities everywhere. English has a transnational role to play in the world, along with other "arterial" languages, but should be developed in the service of a multilingual global society, NOT as the medium of a monolingual culture.

That the British Standards Institution in London (BSI) and the University of London's School of Oriental and African Studies (SOAS) should have given their support to the proposals and products of the Linguasphere Observatory in Wales is also significant. At a time when countries around the world are devoting resources to the study of a language associated with England, it is appropriate that major public institutions in that country should devote resources to the study and development of multilingualism and of the languages of the world.

Linguasphere Observatory and British Standards Institution August 2001

Comments on this paper, prepared at relatively short notice for the ISO TC/37 meeting in Toronto, will be greatly welcomed, by post or by e-mail to /research@linguasphere.net/.
A more detailed proposal will be prepared by the Linguasphere Observatory for the beginning of 2002, including the orderly extension of identification codes to all spoken and written languages, and the examination of procedures for combining language codes with codes for countries and for scripts.

Bibliography

Baker, Philip & Eversley, John, Multilingual Capital: the languages of London's schoolchildren, Battlebridge Press: London, 2000

Constable, Peter and Simons, Gary (SIL), Language Identification and IT: Addressing problems of linguistic diversity on a global scale, presented to the 17^th International Unicode Conference, San Jose (California), September 2000.

Grimes, Barbara F. (editor), Ethnologue: Languages of the World (14^th ed.), SIL: Dallas, 2000

ISO Code for the representation of names of languages, ISO, 1998

Linguasphere Register of the World's Languages and Speech Communities (2 volumes), Linguasphere Press: Hebron (Wales), 2000

Appendix 1: ISO 639 Codes for the Representation of Language Names
correlated with the Linguasphere Referential Framework of 100 Zones

ISO 639-1 is an Alpha-2 code
ISO 639-2/T & /B are Alpha-3 (/T= terminology code; /B = bibliographic code)
The Linguasphere Referential Framework (LRF) is a Numeric-2 code (see Appendix 2)

(extract) A-H

as arranged alphabetically by English name of language

The proposed Identification Code will comprise the LRF + ISO 639-1 or 639-2/T elements

Language Name (English)	Language Name (French)	LRF +	639-1	639-2/T	639-2/B
Abkhazian	abkhaze	42	ab	abk	abk
Achinese	aceh	31		ace	ace
Acoli	acoli	04		ach	ach
Adangme	adangme	96		ada	ada
Afar	afar	14	aa	aar	aar
Afrihili	afrihili	99		afh	afh
Afrikaans	afrikaans	52	af	afr	afr
Afro-Asiatic (Other)	afro-asiatiques, autres langues	1		afa	afa
Akan	akan	96		aka	aka
Akkadian	akkadien	12		akk	akk
Albanian	albanais	55	sq	sqi	alb
Aleut	aléoute	60		ale	ale
Algonquian languages	algonquines, langues	62		alg	alg
Altaic (Other)	altaïques, autres langues	4		tut	tut
Amharic	amharique	12	am	amh	amh
Apache languages	apache	61		apa	apa
Arabic	arabe	12	ar	ara	ara
Aramaic	araméen	12		arc	arc
Arapaho	arapaho	62		arp	arp
Araucanian	araucan	85		arn	arn
Arawak	arawak	82		arw	arw
Armenian	arménien	57	hy	hye	arm
Artificial (Other)	artificielles, autres langues			art	art
Assamese	assamais	59	as	asm	asm
Athapascan languages	athapascanes, langues	61		ath	ath
Australian languages	australiennes, langues	2		aus	aus
Austronesian (Other)	malayo-polynésiennes, autres langues	3		map	map
Avaric	avar	42		ava	ava
Avestan	avestique	58	ae	ave	ave
Awadhi	awadhi	59		awa	awa
Aymara	aymara	84	ay	aym	aym
Azerbaijani	azéri	44	az	aze	aze
Balinese	balinais	31		ban	ban
Baltic (Other)	baltiques, autres langues	54		bat	bat
Baluchi	baloutchi	58		bal	bal
Bambara	bambara	00		bam	bam
Bamileke languages	bamilékés, langues	99		bai	bai
Banda	banda	93		bad	bad
Bantu (Other)	bantoues, autres langues	99		bnt	bnt
Basa	basa	95 or 99		bas	bas
Bashkir	bachkir	44	ba	bak	bak
Basque	basque	40	eu	eus	baq
Batak (Indonesia)	batak (Indonésie)	31		btk	btk
Beja	bedja	13		bej	bej
Belarusian	biélorusse	53	be	bel	bel
Bemba	bemba	99		bem	bem
Bengali	bengali	59	bn	ben	ben
Berber (Other)	berbères, autres langues	10		ber	ber
Bhojpuri	bhojpuri	59		bho	bho
Bihari	bihari	59	bh	bih	bih
Bikol	bikol	31		bik	bik
Bini	bini	20 or 98		bin	bin
Bislama	bichlamar	52	bi	bis	bis
Bosnian	bosniaque	53	bs	bos	bos
Braj	braj	59		bra	bra
Breton	breton	50	br	bre	bre
Buginese	bugi	31		bug	bug
Bulgarian	bulgare	53	bg	bul	bul
Buriat	bouriate	44		bua	bua
Burmese	birman	77	my	mya	bur
Caddo	caddo	64		cad	cad
Carib	caribe	80		car	car
Catalan	catalan	51	ca	cat	cat
Caucasian (Other)	caucasiennes, autres langues	42		cau	cau
Cebuano	cebuano	31		ceb	ceb
Celtic (Other)	celtiques, autres langues	50		cel	cel
Central American Indian (Other)	indiennes d'Amérique centrale, autres langues	6		cai	cai
Chagatai	djaghataï	44		chg	chg
Chamic languages	chames, langues	31		cmc	cmc
Chamorro	chamorro	31	ch	cha	cha
Chechen	tchetchène	42	ce	che	che
Cherokee	cherokee	63		chr	chr
Cheyenne	cheyenne	82		chy	chy
Chibcha	chibcha	81		chb	chb
Chichewa; Nyanja	chichewa; nyanja	99	ny	nya	nya
Chinese	chinois	79	zh	zho	chi
Chinook jargon	chinook, jargon	66		chn	chn
Chipewyan	chipewyan	61		chp	chp
Choctaw	choctaw	68		cho	cho
Church Slavic	slavon d'église	53	cu	chu	chu
Chuukese	chuuk	38		chk	chk
Chuvash	tchouvache	44	cv	chv	chv
Coptic	copte	11		cop	cop
Cornish	cornique	50	kw	cor	cor
Corsican	corse	51	co	cos	cos
Cree	cree	62		cre	cre
Creek	muskogee	68		mus	mus
Creoles and pidgins (Other)	créoles et pidgins divers	–		crp	crp
Creoles and pidgins, English-based (Other)	créoles et pidgins anglais, autres	52		cpe	cpe
Creoles and pidgins, French-based (Other)	créoles et pidgins français, autres	51		cpf	cpf
Creoles and pidgins, Portuguese-based (Other)	créoles et pidgins portugais, autres	51		cpp	cpp
Croatian	croate	53	hr	hrv	scr
Cushitic (Other)	couchitiques, autres langues	1		cus	cus
Czech	tchèque	53	cs	ces	cze
Dakota	dakota	64		dak	dak
Danish	danois	52	da	dan	dan
Dayak	dayak	31		day	day
Delaware	delaware	62		del	del
Dinka	dinka	04		din	din
Divehi	maldivien	59		div	div
Dogri	dogri	59		doi	doi
Dogrib	dogrib	61		dgr	dgr
Dravidian (Other)	dravidiennes, autres langues	49		dra	dra
Duala	douala	99		dua	dua
Dutch	néerlandais	52	nl	nld	dut
Dutch, Middle (ca. 1050-1350)	néerlandais moyen (ca. 1050-1350)	52		dum	dum
Dyula	dioula	00		dyu	dyu
Dzongkha	dzongkha	70	dz	dzo	dzo
Efik	efik	98		efi	efi
Egyptian (Ancient)	égyptien	11		egy	egy
Ekajuk	ekajuk	99		eka	eka
Elamite	élamite	12		elx	elx
English	anglais	52	en	eng	eng
English, Middle (1100-1500)	anglais moyen (1100-1500)	52		enm	enm
English, Old (ca.450-1100)	anglo-saxon (ca.450-1100)	52		ang	ang
Esperanto	espéranto	51	eo	epo	epo
Estonian	estonien	41	et	est	est
Ewe	éwé	96		ewe	ewe
Ewondo	éwondo	99		ewo	ewo
Fang	fang	99		fan	fan
Fanti	fanti	96		fat	fat
Faroese	féroïen	52	fo	fao	fao
Fijian	fidjien	39	fj	fij	fij
Finnish	finnois	41	fi	fin	fin
Finno-Ugrian (Other)	finno-ougriennes, autres langues	41		fiu	fiu
Fon	fon	96		fon	fon
French	français	51	fr	fra	fre
French, Middle (1400-1600)	français moyen (1400-1600)	51		frm	frm
French, Old (842-1400)	français ancien (842-1400)	51		fro	fro
Frisian	frison	52	fy	fry	fry
Friulian	frioulan	51		fur	fur
Fulah	peul	90		ful	ful
Ga	ga	96		gaa	gaa
Gaelic (Scots)	gaélique d'Ecosse	50	gd	gla	gla
Gallegan	galicien	51	gl	glg	glg
Ganda	ganda	99		lug	lug
Gayo	gayo	31		gay	gay
Gbaya	gbaya	93		gba	gba
Geez	guèze	12		gez	gez
Georgian	géorgien	42	ka	kat	geo
German	allemand	52	de	deu	ger
German, Low; Saxon, Low; Low German; Low Saxon	allemand, bas; saxon, bas; bas allemand; bas saxon	52		nds	nds
German, Middle High (ca.1050-1500)	allemand, moyen haut (ca. 1050-1500)	52		gmh	gmh
German, Old High (ca.750-1050)	allemand, vieux haut (ca. 750-1050)	52		goh	goh
Germanic (Other)	germaniques, autres langues	52		gem	gem
Gilbertese	kiribati	38		gil	gil
Gondi	gond	49		gon	gon
Gorontalo	gorontalo	31		gor	gor
Gothic	gothique	52		got	got
Grebo	grebo	95		grb	grb
Greek, Ancient (to 1453)	grec ancien (jusqu'à 1453)	56		grc	grc
Greek, Modern (1453-)	grec moderne (après 1453)	56	el	ell	gre
Guarani	guarani	88	gn	grn	grn
Gujarati	goudjrati	59	gu	guj	guj
Gwich´in	gwich´in	61		gwi	gwi
Haida	haida	66		hai	hai
Hausa	haoussa	19	ha	hau	hau
Hawaiian	hawaïen	39		haw	haw
Hebrew	hébreu	12	he	heb	heb
Herero	herero	99	hz	her	her
Hiligaynon	hiligaynon	31		hil	hil
Himachali	himachali	59		him	him
Hindi	hindi	59	hi	hin	hin
Hiri Motu	hiri motu	34	ho	hmo	hmo
Hittite	hittite	5		hit	hit
Hmong	hmong	48		hmn	hmn
Hungarian	hongrois	41	hu	hun	hun
Hupa	hupa	61		hup	hup

Appendix 2: Linguasphere Referential Framework of 10 Sectors & 100 Zones

5 geosectors (initial even digit)

5 phylosectors (initial odd digit)

Principles of

Linguasphere Numeric Code
first digit = Sectors; second digit = Zones

comprising

28 geozones + 22 phylozones

comprising
50 phylozones

0=AFRICA geosector	1=AFRO-ASIAN phylosector	Sectors The mother-tongues of the majority of humankind have been classified, with widespread agreement, within only five major linguistic families or affinities: Afro-Asiatic (or Hamitic-Semitic), Austronesian, Indo-European, Sino-Tibetan (or Sino-Indian) and Atlantic-Congo (or Transafrican, corresponding to 'old Niger-Congo' less Mande). A primary referential division which may be established within the linguasphere is the division between (i) languages classified outside the five major affinities, and (ii) all those languages which have been classified within them. Languages in category (i) have been classified by cautious historical linguists into more than two hundred separate entities, and are classified initially in the Linguasphere Register within five geosectors (see first column on this page), corresponding to each continent where they are spoken. Languages in category (ii) are classified within five linguistic phylosectors corresponding to the continental or intercontinental affinity to which each of them belongs (see second column). Each of the five geosectors bears the name of the relevant continental area (ending in English always in –a), while each of the phylosectors bears the name corresponding to the relevant affinity (ending in English always in –an). The ten sectors are ordered (both alphabetically and numerically) in such a way that geosectors are indicated by even digits and phylosectors by odd digits. Zones The second layer of classification, indicated by each pair of digits, is composed of the 100 zones listed on this page, representing the most useful referential division of each of the above geosectors and phylosectors into ten parts. Within each phylosector, the component zones (phylozones) are based on the known linguistic subdivisions of each of the affinities concerned, selected subdivisions being either combined or further divided to arrive at a total of ten referential parts. 5=Indo-European, for example, divides readily into ten phylozones or "branches" , whereas under 1=Afro-Asian a total of ten phylozones is arrived at by allocating three zones to the more complex Chadic "branch" of the Afro-Asiatic affinity. Within the five geosectors, 22 of the 50 component zones are themselves phylozones, corresponding to wider or narrower affinities, as in the case of 00=Mandic in Africa or 41=Uralic in Eurasia. The remaining 28 zones are geozones, corresponding to geographic groupings of languages which may (but do not necessarily) share a geo-typological relationship, as in the case of 43=Caucasus or 44=Siberia. Languages within the same geozone should never be assumed to be linguistically related, although some of them may be (as clearly indicated in the Linguasphere Register). The 28 geozones together account for a total of 380 sets of related languages (in contrast to 314 sets included within 72 phylozones), although representing only a small minority of the world's current population. Names given to the 100 zones are harmonised by use of the suffix -ic, and the zones of each sector are numbered as far as possible from north to south and/or from west to east.
00=MANDIC	10=TAMAZIC
01=SONGHAIC	11=COPTIC
02=SAHARIC	12=SEMITIC
03=SUDANIC	13=BEJIC
04=NILOTIC	14=CUSHITIC
05=EAST-SAHEL geozone	15=EYASIC
06=KORDOFANIC	16=OMOTIC
07=RIFT-VALLEY geozone	17=CHARIC
08=KHOISANIC	18=MANDARIC
09=KALAHARI geozone	19=BAUCHIC

2=AUSTRALASIA geosector	3=AUSTRONESIAN phylosector
20=ARAFURA geozone	30=TAIWANIC
21=MAMBERAMO geozone	31=HESPERONESIC
22=MADANGIC	32=MESONESIC
23=OWALAMIC	33=HALMAYAPENIC
24=TRANSIRIANIC	34=NEOGUINEIC
25=CENDRAWASIH geozone	35=MANUSIC
26=SEPIK-VALLEY geozone	36=SOLOMONIC
27=BISMARCK-SEA geozone	37=KANAKIC
28=NORTH-AUSTRALIA geozone	38=WEST-PACIFIC
29=TRANSAUSTRALIA geozone	39=TRANSPACIFIC

4=EURASIA geosector	5=INDO-EUROPEAN phylosector
40=EUSKARIC	50=CELTIC
41=URALIC	51=ROMANIC
42=CAUCASUS geozone	52=GERMANIC
43=SIBERIA geozone	53=SLAVIC
44=TRANSASIA geozone	54=BALTIC
45=EAST-ASIA geozone	55=ALBANIC
46=SOUTH-ASIA geozone	56=HELLENIC
47=DAIC	57=ARMENIC
48=MIENIC	58=IRANIC
49=DRAVIDIC	59=INDIC

6=NORTH-AMERICA geosector	7=SINO-INDIAN phylosector
60=ARCTIC	70=TIBETIC
61=NADENIC	71=HIMALAYIC
62=ALGIC	72=GARIC
63=SAINT-LAWRENCE geozone	73=KUKIC
64=MISSISSIPPI geozone	74=MIRIC
65=AZTECIC	75=KACHINIC
66=FARWEST geozone	76=RUNGIC
67=DESERT geozone	77=IRRAWADDIC
68=GULF geozone	78=KARENIC
69=MESO-AMERICA geozone	79=SINITIC

8=SOUTH-AMERICA geosector	9=TRANSAFRICAN phylosector
80=CARIBIC	90=ATLANTIC
81=INTER-OCEAN geozone	91=VOLTAIC
82=ARAWAKIC	92=ADAMAWIC
83=PRE-ANDES geozone	93=UBANGIC
84=ANDES geozone	94=MELIC
85=CHACO-CONE geozone	95=KRUIC
86=MATO-GROSSO geozone	96=AFRAMIC
87=AMAZON geozone	97=DELTIC
88=TUPIC	98=BENUIC
89=BAHIA geozone	99=BANTUIC

The Linguasphere sectors and zones form a stable system of reference for the world's languages, providing a transnational framework
for linguistic study and a stable "workbench" on which the jigsaw of linguistic relationships may be assembled and re-ordered as necessary.

Source: Linguasphere Register of the World's Languages and Speech Communities (2 vols), Linguasphere Press, Hebron (Wales): 2000

Appendix 3:

^{CHART OF THE} ^{WORLD'S ARTERIAL LANGUAGES}
^{each reaching over 1% of Humankind (above 60 million hearers)}^{The linguasphere is the
mantle of multilingual communication woven around the planet by humankind since
children first learned to talk.}A total of 28 arterial languages can each reach more than 1% of humankind – over 60 million hearers each – as either first or "second" languages. Several comprise a sequence of closely related "inner languages", some with different scripts (e.g. Hindi-Urdu, or Thai-Lao), while others may cover wide spoken variations with a single written standard (e.g. Arabic or German). Arrows ® denote that the preceding language may be partially intelligible to speakers of the language(s) following. Hearers reached by two arterial languages are counted within the total "range" of both. Population ranges, internal percentages [%], & global percentages % are rounded estimates. The ISO 639 codes comprise the 2-letter and 3-letter tags adopted by the International Standards Organisation for major languages (the ISO 639-1 "Alpha-2" and 639-2/T "Alpha-3" code).

^{When the sun is over the
western Pacific, the most spoken language is Chinese, followed by Hindi. 12 hours later, English and Spanish take
the lead.}This chart, which may be reproduced freely for educational use, is adapted from The Linguasphere Register of the World's Languages and Speech Communities (2 volumes, 1043 pages) by David Dalby, obtainable from the Linguasphere Observatory, Hebron, SA34 0XT, Wales or from /www.linguasphere.net/. The Register is the first systematic roll-call of humankind, based on language rather than nation-state. Completed for the start of the new millennium, it will now be permanently updated and expanded as a free resource open to all on the worldwide web. This public service, undertaken initially in Wales, Russia, India and France, depends on new data from users of the Register worldwide, and on support and sponsorship of the Observatory's research by institutions, companies and individuals around the globe.

The Linguasphere Observatory is an independent transnational research network dedicated to the promotion of understanding in a multilingual world.

WIDER AFFINITIES = [single odd digit ] see note below [two digits] = Linguasphere zones	ARTERIAL LANGUAGES reaching over 60 million hearers = over 1% of humankind (« over 2%; l over 8%; ^gover 16%)	RANGE in millions	MAJOR COUNTRIES OR REGIONS including official or co-official use and 8 principal diasporas	ISO 639 Codes	SCRIPTS including A Arabic L Latin	Literate transnational [%] F female M male	Online Share global %
[1] AFRO-ASIAN (Afro-Asiatic) [12] Semitic	ARABIC « ^{(al-'Arabiyya, including Maghribi or Arabic "West" + Mashriqi or Arabic "East" + Badawi or Bedouin Arabic)}	250m	Morocco; Algeria; Tunisia; Libya; Chad; Egypt; Israel; Palestine; Jordan; Saudi Arabia; Iraq; Lebanon; Syria; Iran; Gulf states; Oman; Yemen; Sudan; Mauritania 8 France…	ar/ara	A	[40%F ~ 65%M]	1%
[3] AUSTRONESIAN [31] Hesperonesic	MALAY-INDONESIAN « ^{(including Malayu +}^{Bahasa-Indonesia)}	200m	Malaysia; Singapore; Indonesia 8 Netherlands...	ms/msa + id/ind	L; (A)	[80%F ~ 90%M]
	® JAVANESE (Jawa)	100m	Indonesia 8 Surinam...	jw/jaw	Javanese	[80%F ~ 90%M]
	TAGALOG ^{(including Pilipino)}® ^{other Transphilippine languages}	60m	Philippines 8 USA; Canada...	tl/tgl	L	[90%FM]
[4] "other" languages of Eurasia [44] Turkic	TURKISH-AZERBAIJANI ^{(including T}^ò^rk^¸^e^{+ Azeri +}^Turkmen)^®^{other Turkic languages}	100m	Turkey; Bulgaria; Greece; Cyprus; Iran Azerbaijan; Turkmenistan & Central Asia 8 Russian Fed.; Germany…	tr/tur + az/aze	L; Cyrillic; (A)	[75%F ~ 90%M]
[45] isolated East Asia language	JAPANESE « (Nihongo)	130m	Japan 8 USA; Brazil; Peru...	ja/jpn	Sino-Japanese	[95%FM]	10%
[45] isolated East Asia language	KOREAN (Hankukmal)	75m	S.Korea; N.Korea 8 China; Japan; Russian Fed.; USA...	ko/kor	Korean; (Chinese)	[95%FM]	4%
[46] Mon-Khmer	VIETNAMESE (ViÃt)	75m	Vietnam; Cambodia 8 USA…	vi/vie	L	[90%F ~ 95%M]
[47] Daic / Tai	THAI-LAO (incl. Thai+ Isan+ Lao+ Huang+ Buyi)	90m	Thailand; Laos; Vietnam; China 8 Singapore…	th/tha + lo/lao	Thai; Lao	[55%F ~ 70%M]
[49] "Sanskritised" Dravidian	TAMIL ® Malayalam	90m	India; Sri Lanka 8 Malaysia; Singapore; Mauritius; Germany...	ta/tam	Tamil	[65%F ~ 75%M]
	TELUGU	70m	India 8 Malaysia...	te/tel	Telugu	[35%F ~ 60%M]

[5] INDO-EUROPEAN [51] Romance / Latin-related	SPANISH l (EspaÔol)	500m	Spain; the Americas; Morocco; Western Sahara; Equatorial Guinea…	es/spa	L	[85%F ~ 90%M]	6%
	® PORTUGUESE « (PortuguÃs) ^®^{Portuguese-based creoles (Crioulo)}	200m	Portugal; Brazil; Cape Verde; Guinea-Bissau; S±o Tomé; Mozambique; Angola; India (Goa); Macau 8 South Africa; France; Paraguay...	pt/por	L	[85%FM]	3%
	FRENCH « (Fran¸ais) ^®^{French-based creoles (Cr}^¾^ole)	135m	France; Belgium; Luxemburg; Switzerland; French Guiana, Antilles & Polynesia; New Caledonia; S.E.Asia; Canada; W. & Central Africa; Djibouti; Madagascar; Lebanon; Indian Ocean islands…	fr/fra	L	[90% FM]	4%
	ITALIAN (Italiano)	70m	Italy; Switzerland 8 USA; Canada; Argentina...	it/ita	L	[95% FM]	3%
[52] "Romanised" Germanic	ENGLISH ^g ® English-based creoles	1000m	(countries in all continents)	en/eng	L	[90% FM]	47%
[52] Germanic	GERMAN « (Deutsch) ® Nederlands (Dutch)	135m	Germany; Austria; Switzerland; Belgium; Lux.; France; 8 Canada; USA; Romania; Russian Fed.; Kazakhstan; Brazil; Argentina; Namibia…	de/deu	L	[95% FM]	6%
[53] Slavonic [53] Slavonic	RUSSIAN-BELARUSSIAN « (including Russkiy + Belarusskaya)	320m	Russia; Belarus; Ukraine; Moldova; Baltic states; Caucasus; Central Asia8 Israel; Germany; USA	ru/rus + be/bel	_Cyrillic	[95% FM]	3%
	^®^UKRAINIAN^{(Ukrainska) & other Slavonic languages}	45m	Ukraine; Belarus; Russian Fed.; Moldova; Poland; Hungary; Caucasus; Central Asia8 Canada; USA...	uk/ukr	_Cyrillic	[95% FM]
[58] Iranic / Iranian	PERSIAN-TAJIK (incl. Farsi + Dari + Tajiki)	60m	Iran; Turkey; Caucasus; Saudi Arabia; Iraq; Afghanistan; Tajikistan 8 USA; Germany…	fa/fas + tg/tgk	A	[45%F ~ 70%M]
[59] Indic / Sanskrit-related	HINDI-URDU l (incl. Urdu + Hindi + Braj + Awadhi + Bhojpuri + Maithili, etc, incl. former Hindustani)	900m	India; Pakistan; Bangladesh, Nepal 8 Fiji; Mauritius; S.Africa; Uganda; UK; Caribbean...	hi/hin + ur/urd	Devanagari; A	[35%F ~ 60%M]
	® PANJABI (including Panjabi "East" + "West")	85m	India; Pakistan 8 UK...	pa/pan	Gurmukhi; A	[45%F ~ 65%M]
	BENGALI « (Bangla + Sylhetti) ^®^{Assamese (Axamiya) & Oriya}	250m	Bangladesh; India 8 UK…	bn/ben	Bengali	[35%F ~ 60%M]
	MARATHI	80m	India	mr/mar	Devanagari	[45%F ~ 65%M]
[7] SINO-INDIAN (Sino-Tibetan) [79] Sinitic / "Wider" Chinese	CHINESE Putonghua ^g("Mandarin")	1000m	China; Taiwan 8 Vietnam; Thailand; Singapore, Malaysia	^zh/zho	^Chinese	^{[75%F ~ 90%M]}	^8%
	® WU & Xiang, Gan, Hakka, Min-nan, etc.	85m	China	_{Chinese languages (Han-yu) share}
	^®CANTONESE (Yue)	70m	China, Vietnam 8 Malaysia; Singapore; Indonesia; USA...	^{a common written form}
[9] TRANSAFRICAN (Atlantic-Congo) [99] "Arabicised" Bantu	SWAHILI (Kiswahili) ^®^{other Bantu "Inner-East" languages}	90m	Tanzania; Kenya; Uganda; Rwanda; Burundi; Congo Dem.Rep.; Somalia; Comoro Islands	swa	L; (A)	[55%F ~ 75%M]

^{The Linguasphere Register classifies and annotates 13,840 "inner
languages" (plus dialects) within 4,994 "outer languages" and
694 "sets" of languages.}Over half the world's modern languages are classified according to linguistic affinities into one of 5 major phylosectors or "families", numbered [1], [3], [5], [7], [9]. All other smaller groupings of languages, or isolated languages, are classified within 5 major geosectors or continental areas, numbered [0], [2], [4], [6], [8]. All five phylozones and one geozone – [4] Eurasia – are represented by arterial languages in the above table. Each of the 10 phylosectors or geosectors is subdivided, on linguistic and/or geographical grounds, into 10 zones of reference, numbered [00] to [99]. A simple 2-digit tag, indicating a zone, thus serves to locate any name of any language or dialect or speech community within the linguasphere (over 71,000 such names being recorded, classified, coded and indexed in the Linguasphere Register).

_{© Linguasphere Observatory, Hebron SA34 0XT, Wales /observatory@linguasphere.net/}_{tel. [+44] 1994 419.660 (fax 419.300)}

Appendix 4: Global Language Index (sample extract)

See table on following pages, alongside notes on columns A to J below.

The following table shows entries for language names beginning ga to gaf-. The Global Language Index will be based on the existing Linguasphere Index of over 71,000 linguistic and ethnolinguistic names (see Linguasphere Register Vol.1), which is still expanding. ISO 639 (columns D and E) will be an essential part of the Global Language Index, and it is proposed that the progressive expansion of ISO 639-2, to cover all languages and language names in the Index, should be undertaken with the guidance of TC 37 and of the established Maintenance Agencies for ISO 639 and 639-2.

It is NOT a proposal of the UK that the international Linguasphere Observatory should itself become a Maintenance Agency, but rather that it should confer and collaborate closely with the Registration Authorities already established for ISO 639-1 (Infoterm in Vienna) and for ISO 639-2 (Library of Congress).

Columns

A Names of languages and dialects, following the typographic conventions of the Linguasphere Register and Index: Reference Names (usually autonyms or "own names") are in bold type (in contrast to other Alternative Names); Names of Outer Languages have an initial capital; Names of Inner Languages and Dialects are in lower case throughout (Names of Dialects are indented). Umbrella names (groups and families of languages) are excluded from this sample.

B Status of languages, indicating where a language is in national or regional official use, where it is now extinct, or where it has been revived or partially revived.

C Proposed alphanumeric identification code, comprising the ISO 639-1 (column D) or 639-2 (column E) prefixed by the digits of the Linguasphere Referential Framework (first element in column H). Each 2-letter or 3-letter element in the identification code will remain unambiguous on a world scale. The (optionally omitted) 2-digit prefix will serve to provide (a) a wider informational content to the code, (b) a basis for the linguistic and/or geographical sorting of coded items, and (c) an automatic check on the accuracy of the following 3-letter element. No hyphen will be included between the component digits and letters (to maintain a clear distinction from the relationship scale in column H). In the present extract, only a few entries are already covered by ISO 639 and in all other cases the Linguasphere digits have been followed by *** (in anticipation of future ISO alpha codes). In the case of dialects (indented entries) it is proposed that special codes should not normally be allocated, but that the name of the dialect should follow the code of the relevant language (separated by a forward slash).

D ISO 639-1 codes, where these already exist. It is proposed that these be used specifically and exclusively to denote standardised written languages.

E ISO 639-2/T codes (and ISO 639-2/B codes in brackets), where these already exist.

F SIL (Summer Institute of Linguistics) codes, as used in the Ethnologue. It is not proposed that these codes be adopted automatically to fill gaps in the ISO 639 series, without first establishing principles for the selection of letters, including problems of conflict with established ISO 639 codes.

G OpenType language tags, developed for IT use by Adobe and Microsoft.

H Linguasphere codes, as used in the Linguasphere Register. These comprise the 2-digits of the Linguasphere Referential Framework, plus the 5 to 6 layer Relationship Scale. Items defined as languages are all classified in the Linguasphere Register as outer languages or inner languages (subdivided into dialects where appropriate), i.e. represented respectively by a first or second minuscule (subdivided by a third minuscule where appropriate) in the Relationship Scale

I Demoscale or 10-point demographic scale, as used in the Linguasphere Register. All outer languages are ranked in terms of relative demographic importance by a single digit, representing the order of magnitude of speakers (first or second language in 1999/2000) on a scale ranging from 0 (extinct between 1900 & 1999) through 2 (100+), 3 (1000+), 4 (10,000+), 5 (100,000+), 6 (1,000,000+), 7 (10,000,000+), 8 (100,000,000+) to 9 (over one billion).

J Country or principal countries where spoken.

A	B	C	D	E	F	G	H	I	J
Gã	Regional	96gaa		gaa	gac	gad	96-LAA-a	5	Ghana
gã		91***			gna		91-GEB-aa		Burkina Faso
ga / e-ga		96***					96-EAA-aa		Côte d'Ivoire
g//a		08***					08-AAB-cf		Botswana
gaabu		90***/					90-BAA-acb		Guinea-Bissau; Guinea
gaaduwa		02***/					02-BAA-abc		Chad; Niger
g//aakhwe		08***					08-AAB-cf		Botswana
ga'aliyyin		12***/					12-AAC-edd		Sudan
gaalpu		29***			gla		29-AAD-aa		Australia
Gaam (kor-e-gaam)		05***			tbi		05-MBA-a	4	Sudan
Ga'anda (Gaanda)		18***			gaa		18-HBA-a	4	Nigeria
gaandu		18***			gaa		18-HBA-aa		Nigeria
gaangala		99***/					99-AUR-gfb		Congo
gaba		14***			gay		14-DAA-ag		Ethiopia
gabadi		34***					34-GBE-aa		Papua New Guinea
g//abake-ntshori		08***					08-AAB-dh		Botswana
gabalbara	Extinct	29***					29-RAA-bg		Australia
gabalitain		51***/					51-AAA-gbh		France
gabbra		14***/					14-FBA-ahc		Kenya
Gabere (Gaberi)		17***					17-DGB-a	4	Chad
gabi		95***					95-ABA-wc		Côte d'Ivoire
Gabi+Badjala	Extinct	29***					29-QBA-a	0	Australia
Gabiano		26***					26-IAB-b	2	Papua New Guinea
gabi-gabi	Extinct	29***					29-QBA-aa		Australia
gabin		18***					18-HBA-ab		Nigeria
gablai		17***					17-DGC-aa		Chad
gablet		12***					12-ABA-ad		Oman
gabo		95***/					95-ABA-xbb		Côte d'Ivoire
Gabo-Bora		34***					34-FCB-a	2	Papua New Guinea
gabone		34***					34-GBB-ah		Papua New Guinea
gabou		93***					93-ABA-fg		CAR, Congo Dem.Rep.
gabra		14***/					14-FBA-ahc		Kenya
Gabri (Gabri proper)		17***			gab		17-DGB-a	4	Chad
Gabri (pseudo Gabri 1)		17***					17-DGA-a	4	Chad
gabri (pseudo gabri 2)		92***					92-CAA-db		Chad
gabri		58***			gbz		58-AAC-di		Iran
Gabrieleño	Extinct	65***					65-ADB-a	0	USA
gabri-kermani		58***/					58-AAC-dib		Iran
Gabri-Kimre		17***					17-DGA-a	4	Chad
gabu		98***					98-CAB-cd		Nigeria
gabu		93***					93-ABA-fg		CAR, Congo Dem.Rep.
Gabutamon		24***			gav		24-SCB-a	2	Papua New Guinea
gachikolo		59***/					59-AAF-tbe		India
gachitl		42***/					42-BBA-bga		India
Gadaba ("Dravidic" Gadaba)		49***			gau		49-CAB-b	3	India
gadaba ("mundic" gadaba)		49***			gbj		46-CBB-ba		India
gadaba (pseudo gadaba)		46***					46-CBB-ac		India
gadabursi		14***					14-GAG-ac		Somalia; Ethiopia
gadaigan		29***					29-RHA-ba		Australia
gadaiasu		34***					34-FIA-ja		Papua New Guinea
gadala		18***					18-FAB-ac		Cameroon
Ga'dang (Gadang)		31***			gdg		31-CCC-a	3	Philippines
Gadang		17***			gdk		17-DCA-a	3	Chad
Gadang	Extinct	29***					29-MGB-a	0	Australia
gadba		49***			gbj		49-CBB-ba		India
Gaddang		31***			gad		31-CCC-b	4	Philippines
gaddi (gaddi-chamba)		59***			gbk		59-AAF-ei		India
Gade		98***			ged		98-BBA-a	5	Nigeria
gade		19***					19-HAA-aa		Nigeria
gade-lohar		59***/			gda		59-AAF-gra		India
gadhang		29***					29-MGB-aa		Australia
gadhavali		59***					59-AAF-cb		India
gadhwali (gadhawala)		59***					59-AAF-cb		India
Gadhwali+Kumauni		59***					59-AAF-c	6	India
gadi		59***					59-AAF-ei		India
gadi \ churahi		59***/					59-AAF-eia	6	India
gadi \ bhateali		59***					59-AAF-ej		India
gadi		59***					59-ABC-aa		Sri Lanka
gadi (tsi-gadi)		98*/					98-HAC-bc		Nigeria
gadio		24***					24-LDA-ec		Papua New Guinea
gadiwa		02***/					02-BAA-abc		Chad; Niger
Gadjerawang (Gadjerong)		28***			gdh		28-DAB-b	1	Australia
gadjibamu		29***					29-SAA-aa		Australia
gadjnjamada		29***					29-BHA-ba		Australia
Ga-dre		96***					96-GAA-a	4	Ghana; Togo
gadsup		24***			gaj		24-PAC-aa		Papua New Guinea
Gadsup+Ontena		24***					24-PAC-aa	4	Papua New Guinea
gadu		75***					75-BAA-aa		Burma; China; Laos
gadua		02***/					02-BAA-abc		Chad; Niger
gaduliya-lohari		59***/					59-AAF-gra		India
gaduwa		02***/					02-BAA-abc		Chad; Niger
Gaduwa		18***			gdw		18-EAD-e	3	Cameroon
gadwahi		59***					59-AAF-cb		India
gadyaga		00***					00-BAA-aa		Mauritania; Senegal; Mali; France etc.
gadyali		59***					59-AAF-ei		India
gadyri		42***					42-BBA-bg		Russia
gae		83***					83-HAB-ac		Peru
gaeilge	National	50gle	ga	gle (iri)	gli	iri	50-AAA-ad~i		Ireland
gaeilge-C. (central irish gaelic)		50***					50-AAA-ah		Ireland
gaeilge-F. (formal irish gaelic)	National	50ga	ga	gle (iri)	gli	iri	50-AAA-ae		Ireland [Off.] ; UK (N.Ireland)
gaeilge-L. (old common gaelic)		50***					50-AAA-ad		Ireland
gaeilge-N. (north irish gaelic)		50***					50-AAA-ag		Ireland; UK (N.Ireland)
gaeilge-NE. (northeast irish g.)		50***					50-AAA-af		Ireland; UK (N.Ireland)
gaeilge-S. (south irish gaelic)		50***					50-AAA-ai		Ireland
Gaeilge+Gàidhlig (Gaelic)		50***					50-AAA-a	6	Ireland; UK (Scotland; Man)
gaejawa		19***					19-ECA-aa		Nigeria
gaelg (manx gaelic)	Revived	50***					50-AAA-aj	?	UK (Man)
gaeli	Extinct	32***					32-EBA-a	0	Indonesia (Maluku)
gaelic \ irish	National	50ga	ga	gle (iri)	gli	iri	50-AAA-ad~i		Ireland
gaelic \ manx	Extinct	50***			mjd		50-AAA-aj		UK (Man)
gaelic \ scottish	Regional	50gd	gd	gla(gae)	gls	gae	50-AAA-aa~c		UK (Scotland)
gaelic-shelta		50***					50-ACA-ab		UK (Scotland)
Gafat / Gafatinya	Extinct	12***					12-ACD-a	0	Ethiopia
Gafe		96***					96-MAA-a		Togo; Ghana
gafsa		12***/					12-AAC-ddc		Tunisia
gafuku		24***					24-OBA-ba		Papua New Guinea

The ISO Alpha-2 and Alpha-3 Codes for the Representation of Names of Languages (ISO 639) are complementary in purpose and form to the Numeric-2 Code employed for the Linguasphere Referential Framework (LRF) of the world's languages.

ISO 639 provides 2-letter or 3-letter tags (or "standardised abbreviations") for the identification of specific languages and groups of languages, whereas the LRF provides 2-digit tags for a referential inventory of the world's languages within 10 sectors (1st digit) and 100 zones (2nd digit).

4.2.1 ISO 639 Codes

4.2.2 Linguasphere Codes

The fully coded Linguasphere classification and annotation of the world's languages is already available with limited access online (http://www.linguasphere.net) and is to be made freely accessible as a public resource within the next year (see section 5 below).

4.3 Formulation of the Proposal

Constable and Simons (page 15) recognise the Linguasphere Register as "the only likely candidate" as an alternative to their proposed SIL system.

4.4.1 Advantages of the proposed Global Identification Code

In comparison with the existing ISO or SIL tags, the combined LRF/ISO tag would be:

· more transparent, with the initial digit indicating one of five major affinities or one of five continental areas (e.g. 5 = Indo-European or 8 = native South American: see Appendix 2 to this paper);

· more easily located, with the 2 digits indicating a linguistic group or area (e.g. 53 = Slavic or 87 = Amazon: see Appendix 2 to this paper);

· more readily classifiable, together with the names of other related and/or adjacent languages, either within the same sector (first digit in common) or zone (both digits in common)

· better protected against typographical error in the citation of tags (each alpha component being tied to a specific numeric component).

At the same time, the continued existence of a single series of unambiguous ISO 2-letter and 3-letter codes to identify the languages of the world would mean that the combined LRF/ISOtags could be abbreviated for practical purposes by the optional omission of the LRF numerical prefix.

· In such abbreviated usage, the invisible LRF code would still underlie any IT usage of the 2-letter or 3-letter components.

· The LRF numerical prefix would thus be available not only to classify language codes as required but, very importantly, to serve as a check against typographical error. A mistyped 2-letter or 3-letter code would have only a 1% chance of matching the correct numerical prefix.

Use of combined LRF/ISO tags would also open the way to a more structured approach to the classification of "language names", which may be used to indicate a wide variety of different categories of language name with an identical form of 3 letters.

In contrast, the use of combined LRF/ISO tags could be associated with the reconsolidation and extension of the 3 existing lists of ISO tags, to create a single, more coherent and explicit system.

· In an increasingly internationalised world, it is appropriate that alpha codes for specific languages should be based wherever possible on the autoglossonym (or indigenous form of the language name) rather than on the English name (where this is different).

· In this respect, where ISO 639-2/B diverges from 639-2/T, the 2/T code is generally to be preferred (e.g. /eus/ in preference to /baq/ for Basque, for which the autoglossonym is Euskara). It may be noted that the Language Register also gives precedence to autoglossonyms.

· The most frequent use of tags to represent language names in IT is for the identification of specific standard written languages.

à For practical purposes, a standard written language may be defined as a language whose form is largely fixed by means of a system of graphic conventions, established and exemplified by the publication of a large corpus of texts (normally in thousands or more).

à A standard written language may also have a spoken form, modelled largely on the use of the written form, in the same way that a spoken language may also have a written form, transcribing actual speech.

· The potential number of Alpha-2 codes (26 x 26 = 676) is adequate to retain the existing ISO 639/1 codes for standard written languages, regardless of the prefixed digits.

· If the linguasphere were one day to include more than 676 such languages, then it would be possible to duplicate some alpha codes under different digits.

· A more stable solution, however, would be to limit the use of Alpha-2 codes to a closed list of all those written languages standardised before the end of the 20th century.

· The alphanumeric form of LRF/ISO tags would be readily identifiable as language codes within other text, as opposed to the potential confusion of some existing Alpha-3 tags with real words (e.g. /bug/ for Buginese or /got/ for Gothic).

In contrast, LRF/ISO tags applied to "other types" of language name (non-standardised languages, fuzzy continua or referential boxes) could be based on the Alpha-3 codes of ISO 639-2/T.

Linguasphere Observatory and British Standards Institution August 2001

Bibliography

Baker, Philip & Eversley, John, Multilingual Capital: the languages of London's schoolchildren, Battlebridge Press: London, 2000

Constable, Peter and Simons, Gary (SIL), Language Identification and IT: Addressing problems of linguistic diversity on a global scale, presented to the 17th International Unicode Conference, San Jose (California), September 2000.

Grimes, Barbara F. (editor), Ethnologue: Languages of the World (14th ed.), SIL: Dallas, 2000

ISO Code for the representation of names of languages, ISO, 1998

Linguasphere Register of the World's Languages and Speech Communities (2 volumes), Linguasphere Press: Hebron (Wales), 2000

Appendix 1: ISO 639 Codes for the Representation of Language Names correlated with the Linguasphere Referential Framework of 100 Zones

ISO 639-1 is an Alpha-2 code ISO 639-2/T & /B are Alpha-3 (/T= terminology code; /B = bibliographic code) The Linguasphere Referential Framework (LRF) is a Numeric-2 code (see Appendix 2)

(extract) A-H

as arranged alphabetically by English name of language

The proposed Identification Code will comprise the LRF + ISO 639-1 or 639-2/T elements

Appendix 2: Linguasphere Referential Framework of 10 Sectors & 100 Zones

WIDER AFFINITIES = [single odd digit ] see note below [two digits] = Linguasphere zones

ARTERIAL LANGUAGES reaching over 60 million hearers = over 1% of humankind (« over 2%; l over 8%; g over 16%)

ISO 639 provides 2-letter or 3-letter tags (or "standardised abbreviations") for the identification of specific languages and groups of languages, whereas the LRF provides 2-digit tags for a referential inventory of the world's languages within 10 sectors (1^st digit) and 100 zones (2^nd digit).

· A more stable solution, however, would be to limit the use of Alpha-2 codes to a closed list of all those written languages standardised before the end of the 20^th century.

Constable, Peter and Simons, Gary (SIL), Language Identification and IT: Addressing problems of linguistic diversity on a global scale, presented to the 17^th International Unicode Conference, San Jose (California), September 2000.

Grimes, Barbara F. (editor), Ethnologue: Languages of the World (14^th ed.), SIL: Dallas, 2000

Appendix 1: ISO 639 Codes for the Representation of Language Names
correlated with the Linguasphere Referential Framework of 100 Zones

ISO 639-1 is an Alpha-2 code
ISO 639-2/T & /B are Alpha-3 (/T= terminology code; /B = bibliographic code)
The Linguasphere Referential Framework (LRF) is a Numeric-2 code (see Appendix 2)

WIDER AFFINITIES
= [single odd digit ] see note below
[two digits] = Linguasphere zones

ARTERIAL LANGUAGES
reaching over 60 million hearers = over 1% of humankind (« over 2%; l over 8%; ^gover 16%)