ISO/TC37/SC2/WG1 N77
 
Date of presentation 2001-08-13
Proposer BSI
 
 
Draft technical report:
 
Development and Application of ISO 639
in the identification, classification and
alphanumeric coding of the
world's languages
 
 

Contents

1  Introductory Note
2  Clarification of Terms and Categories
3  The Global Context
4  The Proposal
5  Towards a Global Public Resource

Bibliography

Appendix 1  ISO 639 Codes correlated with the Linguasphere Referential Framework (extract A-H)
Appendix 2  Linguasphere Referential Framework of 10 Sectors and 100 Zones
Appendix 3  Chart of the World's Arterial Languages (printed as 2 pages in landscape view)
Appendix 4  Global Language Index (sample extract)

 

1  Introductory Note

There is an established need for a standardised system of codes for the tagging and identification of the world's languages.  Variation still exists, however, in the form of language codes used by different organisations and in different countries.  The ISO 639 codes provide the base for standardisation in this field, although they at present cover only a small proportion of the world's languages.  These ISO language codes also exist in 3 different versions, the ISO 639-1 two-letter code, and the ISO 639-2/T and 639-2/B three-letter codes (as designed for terminological and bibliographical use, respectively).

A fully classified inventory of the world's languages and speech communities was published in 1999/2000, including a coded index of over 71,000 names (Linguasphere Register of the World's Languages and Speech Communities, see Bibiography).

The following proposal outlines how the 3 versions of the ISO 369 codes may be unified as a single standard, and how the formal linking of this standard with the Linguasphere zones of reference would create an alphanumeric Global Identification Code (GIC) with increased informational content and inbuilt protection from error.

2  Clarification of terms and categories

2.1 Classification codes, identification codes and referential codes

A clear distinction needs to be maintained among 3 forms of language code:

2.1.1  modifiable classification codes or "relationship scale", recording proximities of interrelationship among languages but subject to modification as research progresses;

2.1.2  fixed identification codes or "language tags", enabling individual languages to be identified without ambiguity; and

2.1.3  a stable "referential framework", providing a meeting-point for the correlation of classification codes and identification codes (as in the proposed Global Identification Code).

2.2  Language names and umbrella names

A clear distinction needs to be maintained between:

2.2.1  language and dialect names as applied to individual spoken and/or written varities of language; and

2.2.2  umbrella names, often artificially created, covering groups or families of related languages (the treatment of which has not been presented in the following pages, for reasons of time and space).

2.3  Languages and dialects

The continuum which frequently exists among adjacent forms of speech means that it has always been difficult define the boundary between usage of the terms "language" and "dialect".  The situation is eased by recognising that many languages are better analysed and distinguished in terms of three (rather than two) layers of immediate relationship.  These layers are best explained by reference to specific examples:

2.3.1  outer language, as applied, for example, to the totality of the Welsh language in all its spoken and written forms;

2.3.2  inner language, as applied to the 3 major components of the modern Welsh (outer) language: literary Welsh (as written, and progressively standardised, in recent centuries); northern spoken Welsh (in north Wales); and southern spoken Welsh (in south Wales);

2.3.3  dialect, as applied to distinct varieties of written Welsh (e.g. Bible or "pulpit" Welsh) or to local varieties of northern or southern spoken Welsh (e.g. Anglesey Welsh in the north, or Pembrokeshire Welsh in the south).

2.4  Spoken languages and standard written languages

It is of great importance that a clear distinction be maintained between:

2.4.1  spoken languages and their dialects, which may also be written (in dialect literature, or in phonetic transcriptions, for example).

2.4.2  standard(ised) written languages, which have acquired a status independent of the spoken word but which may themselves be spoken (in speech which is modelled on the written tradition of a language).  Part of the present proposal is that the 2-letter codes of ISO 639-1 should be formally recognised as designating the relevant standard written languages (e.g. en for Standard English), in contrast to the general coverage of the 3-letter codes of ISO 639-2 (e.g. eng for English in any or all its forms).


3  The Global Context

 

The objective of clearly identifying all the languages and speech communities of humankind, regardless of their demographic size, is today clear, attainable and of global importance. 

3.1  The Twenty-first century perspective

At the onset of the twenty-first century, humankind is aware of itself as a single planetary community, with means of instant global communication and of increasing global planning and coordination.  Languages are the key to that communication and coordination. 

For the first time, the languages of the world may be viewed as integral parts of humankind's greatest and most fundamental creation, the continuous global web or "linguasphere" of human speech and writing. 

Languages no longer need to be listed and catalogued as a vast array of independent objects, belonging to rival and often warring communities.  They can now be viewed and classified as integral parts of a collective human heritage.

The classification of languages has until now been the preserve of erudite specialists, often tracking down words of ancient languages in the pursuit of evidence about the human past. 

Today, however, the classification of modern languages has a direct relevance to the way humankind perceives and organises itself as a single global and multilingual community.  Individual languages can now be perceived, not as the individual creation and property of specific communities, but as mutable and interrelating subsystems within a vast global kaleidoscope of words, grammatical rules, speech sounds and elements of writing.

All languages have benefitted or may potentially benfefit from the modern communications revolution in two fundamental ways:

·         The recording and global transmission of the spoken word allow any spoken language to share the advantages previously reserved to written languages, enabling even small speech communities to maintain worldwide spoken contact.

·         The instant transmission and exchange of the written word allow any written language to share the advantages previously reserved to speech, encouraging even children to use writing (instant messages by phone and computer, and e-mail) as an integral part of their social life.

3.2  The Need for the Identification of Languages within a Referential Framework

Any system of linguistic classification needs to contain an element of fluidity, in order to deal not only with the fundamental nature of the linguasphere but also with a still expanding knowledge of its complexity.

At the same time, it is necessary that the identifiable written and spoken languages of human communities be clearly and unambiguously catalogued and identified, from the international use of English or French to the unique speech of an isolated village in central Africa.

It is important to be aware of this contrast between (a) the need for fluidity in establishing and updating a sliding scale of linguistic interrelationships, and (b) the need for stability in identifying the individual spoken and recorded languages of humankind.

The primary objective in this field is therefore to complete a standardised international system of identification codes for the unambiguous tagging of all known forms of spoken and written languages, alive or recorded from the past, and for the correlation of those fixed tags to a separate scale of linguistic interrelationships.

 

4  The Proposal

 

4.1  The Institutional Background

The first comprehensive coded and classified inventory of the languages and speech communities of humankind during the 20th century was completed in December 1999 and published in Wales in 2000. 

This Linguasphere Register of the World's Languages and Speech Communities provides a referential framework for the location and classification of over 22,000 identifiable varieties of speech and writing.  The Linguasphere Register is supported by a unique and expandable Index of over 71,000 linguistic and ethnolinguistic names, each classified and coded within the referential framework, using comprehensive scale of linguistic interrelationships.

The agency responsible for compiling and maintaining the Register is the Linguasphere Observatory (www.linguasphere.org), a transnational research network devoted to the study and maintenance of multilingualism.  Conceived in Canada in 1983, the Observatory was established in France during the 1980's.  During the 1990's, it has worked in close collaboration with the University of London's School of Oriental and African Studies, and has been directed from bilingual Wales since 1995, with scientific support from Russia, India and the United States.  See further details at the end of section 3 of this paper.

In July 2001, the BSI (British Standards Institution) requested the Linguasphere Observatory to make a firm proposal for the establishment of a standardised alphanumeric coding system covering all the world's languages, based on existing and future codes of ISO 639 and correlated with the referential framework and relationship scale of the Linguasphere Register.

4.2  The Technical Background

The ISO Alpha-2 and Alpha-3 Codes for the Representation of Names of Languages (ISO 639) are complementary in purpose and form to the Numeric-2 Code employed for the Linguasphere Referential Framework (LRF) of the world's languages. 

ISO 639 provides 2-letter or 3-letter tags (or "standardised abbreviations") for the identification of specific languages and groups of languages, whereas the LRF provides 2-digit tags for a referential inventory of the world's languages within 10 sectors (1st digit) and 100 zones (2nd  digit).

4.2.1  ISO 639 Codes

The ISO 639 codes for a range of the most commonly encountered names of languages (and groups of languages) are presented in the International Organisation for Standardisation's Code for the representation of names of languages (1998), an are available online at /http://lcweb.loc.gov/standards/iso639-2/langhome.html/.

ISO 639-2, originally devised for use in library systems, now exists in slightly divergent forms, known as ISO 639-2/T (terminology) and ISO 639-2/B (bibliographic).  Although the 3-letter codes of ISO 639-2 could provide codes for 26x26x26 languages, limits specified in the standard currently restrict the creation of new codes to languages with a substantial body of literature.  If rigorously applied, this restriction limits the more generalised use of the IS-639 codes, particularly in ICT usage. 
As a result, some ICT users – including ministries and official agencies - have either made use of the SIL (Summer Institute of Linguistics) codes, or have developed their own coding systems, notably the OpenType specifications (OT) used in font and rendering technologies.  Such variant codes have been developed in certain countries, including the UK, Sweden and Germany, which in some cases have caused clashes in bibliographic information interchange.

4.2.2  Linguasphere Codes

The Linguasphere 2-digit code is defined for over 22,000 modern languages and dialects, and for their historical forms where relevant, in the Linguasphere Register of the World's Languages and Speech Communities (published in 2 volumes by Linguasphere Press, Hebron, Wales 2000).   The Linguasphere code (digital Reference Framework, plus alpha Relationship Scale) is discussed and exemplified on the Linguasphere Observatory website (http://www.linguasphere.org). 

The fully coded Linguasphere classification and annotation of the world's languages is already available with limited access online (http://www.linguasphere.net) and is to be made freely accessible as a public resource within the next year (see section 5 below). 

Appendix 1 to this paper displays the 2-digit tags of the Linguasphere Reference Framework as an additional column in the listing of ISO 639, as exemplified by the letters A-H in an alphabetical arrangement by English names of languages.  Appendix 2 lists and explains the Linguasphere numerical tags, with their linguistic and/or geographical applications.  Appendix 3 provides a table of the world's arterial languages (each reaching over 1% of the world's total population), with the relevant ISO 639 and Linguasphere codes.  Appendix 4 presents an extract from the proposed Global Language Index, which is available as a starting point for the systematic extension of ISO 639 codes.

4.3  Formulation of the Proposal

It is proposed by the British TS/1 Committee and the British Standards Institution  that a standard Global Identification Code for all known languages and speech communities be established, by the prefixing of the 2 digits of the Linguasphere Referential Framework (LRF) to the 2 or 3 letters of ISO 639 codes and to the extension of those codes to cover all spoken and recorded languages.

The purpose of this proposal is NOT to create yet another method of coding, but to enable existing ISO (TC/37) standards to work more efficiently and accurately, and to be expanded systematically to cover all languages and speech communities.  The following pages outline how the 3 versions of the ISO 369 codes may be unified within this single standard, and how the formal linking of the ISO codes with the Linguasphere zones of reference would create an alphanumeric Global Identification Code (GIC) with compact informational content and inbuilt protection from error.

4.4  Practical considerations of the present proposal

Some of the problems hitherto associated with the codes of ISO 639, and with language identification in general, were discussed by Peter Constable and Gary Simons of SIL International in their paper Language Identification and IT: Addressing problems of linguistic diversity on a global scale, presented to the 17th International Unicode Conference (San Jose California, September 2000).

Their paper proposes the extension of the present Alpha-3 system to cover all known varieties of written and spoken languages in the world, following the example established in the SIL's Ethnologue (14th edition, 2000).   This proposal allows for the establishment of thousands of 3-letter codes to represent language names, not necessarily related in form to those names, but fails to address some of the fundamental problems of isolated 3-letter codes.

Constable and Simons (page 15) recognise the Linguasphere Register as "the only likely candidate" as an alternative to their proposed SIL system.


4.4.1  Advantages of the proposed Global Identification Code

The prefixing of the digits of the relevant LRF numeric code to a 2-letter or 3-letter form of an ISO 639 tag for a specific language would create a combined alphanumeric LRF/ISO tag or Global Identification Code.  This combined tag would assist in solving several existing problems of identification and referential classification.

In comparison with the existing ISO or SIL tags, the combined LRF/ISO tag would be:

·         more transparent, with the initial digit indicating one of five major affinities or one of five continental areas (e.g. 5 = Indo-European or 8 = native South American: see Appendix 2 to this paper);

·         more easily located, with the 2 digits indicating a linguistic group or area (e.g. 53 = Slavic or 87 = Amazon: see Appendix 2 to this paper);

·         more readily classifiable, together with the names of other related and/or adjacent languages, either within the same sector (first digit in common) or zone (both digits in common)

·         better protected against typographical error in the citation of tags (each alpha component being tied to a specific numeric component).

At the same time, the continued existence of a single series of unambiguous ISO 2-letter and 3-letter codes to identify the languages of the world would mean that the combined LRF/ISOtags could be abbreviated for practical purposes by the optional omission of the LRF numerical prefix. 

·         In such abbreviated usage, the invisible LRF code would still underlie any IT usage of the 2-letter or 3-letter components. 

·         The LRF numerical prefix would thus be available not only to classify language codes as required but, very importantly, to serve as a check against typographical error.  A mistyped 2-letter or 3-letter code would have only a 1% chance of matching the correct numerical prefix.

Use of combined LRF/ISO tags would also open the way to a more structured approach to the classification of "language names", which may be used to indicate a wide variety of different categories of language name with an identical form of 3 letters.

·         The existing ISO Alpha-3 codes may indicate either the name of a specific standardised language or of a wider "language" composed of two or more closely related spoken and/or written languages (e.g. the Ashkharik and Arewmta varieties of Armenian, or the Gheg and Tosk varieties of Albanian), or an historical and/or liturgical language (e.g. Church Slavonic), or a grouping of languages of undetermined dimension or nature (e.g. Athapascan languages, or "other" Austronesian, or "other" Creoles and pidgins).  See examples in Appendix 1.

In contrast, the use of combined LRF/ISO tags could be associated with the reconsolidation and extension of the 3 existing lists of ISO tags, to create a single, more coherent and explicit system.

·         In an increasingly internationalised world, it is appropriate that alpha codes for specific languages should be based wherever possible on the autoglossonym (or indigenous form of the language name) rather than on the English name (where this is different). 

·         In this respect, where ISO 639-2/B diverges from 639-2/T, the 2/T code is generally to be preferred (e.g. /eus/ in preference to /baq/ for Basque, for which the autoglossonym is Euskara).  It may be noted that the Language Register also gives precedence to autoglossonyms.

Most importantly, it would be important to distinguish between the application of LRF/ISO tags to specific objects (i.e. standard written languages) as opposed to "fuzzy" phenomena (i.e. non-standardised languages, or continua of closely related spoken languages or dialects) or to referential boxes created or identified for use in the classification of languages (i.e. language groups, families or categories or areas of languages, including the Linguasphere sectors and zones).

·         The most frequent use of tags to represent language names in IT is for the identification of specific standard written languages.  

à     For practical purposes, a standard written language may be defined as a language whose form is largely fixed by means of a system of graphic conventions, established and exemplified by the publication of a large corpus of texts (normally in thousands or more).

à     A standard written language may also have a spoken form, modelled largely on the use of the written form, in the same way that a spoken language may also have a written form, transcribing actual speech. 

·         Most such languages are already provided for under the ISO 639-1 Alpha-2 code, and it would be helpful if the use of combined LRF/ISO tags of the form Numeric-2 plus Alpha-2 could be specifically confined to the identification of standard written languages (including their standard spoken forms, wherever these are modelled on the written language).

·         The potential number of Alpha-2 codes (26 x 26 = 676) is adequate to retain the existing ISO 639/1 codes for standard written languages, regardless of the prefixed digits. 

·         If the linguasphere were one day to include more than 676 such languages, then it would be possible to duplicate some alpha codes under different digits. 

·         A more stable solution, however, would be to limit the use of Alpha-2 codes to a closed list of all those written languages standardised before the end of the 20th century.

·         This basic LRF/ISO tag "for the representation of the names of standard languages" would be only one character longer than the existing Alpha-3 tags, but would be considerably more systematic and rich in information, and more secure against typographical error.  Cf. 79zh for Standard Chinese (rather than /zho/ or /chi/ for all forms of Chinese) or 55sq for Standard Albanian (rather than /sqi/ or /alb/ for all forms of Albanian). 

·         The initial 7 or 5 locates Chinese and Albanian within Sino-Tibetan or Indo-European, respectively, and it would be useful to produce a list of LRF/ISO tags classified numerically by the LRF digits (alongside the existing ISO lists arranged by alpha code or by names of languages in English or in French).

·         The alphanumeric form of LRF/ISO tags would be readily identifiable as language codes within other text, as opposed to the potential confusion of some existing Alpha-3 tags with real words (e.g. /bug/ for Buginese or /got/ for Gothic).

In contrast, LRF/ISO tags applied to "other types" of language name (non-standardised languages, fuzzy continua or referential boxes) could be based on the Alpha-3 codes of ISO 639-2/T. 

·         This distinction between Alpha-2 and Alpha-3 codes would be useful in distinguishing standardised languages within fuzzy continua of spoken languages and dialects, e.g. between the varieties of standardised Norwegian (Bokmål = 52no or Norwegian Nynorsk = 52nn) and "wider" Norwegian in all its forms (= 52nor, which has fuzzy boundaries within the continuum of other spoken forms of Scandinavian languages).

There are a number of other more detailed points to be considered in the design of any improved codes for language names (including the treatment of historical languages, for example), which will would dealt with within the fully developed presentation of the proposed LRF/ISO system.   The Linguasphere Observatory looks forward to productive discussions on all aspects of the development of ISO 639, with members of BSI and ISO, and beyond.

 

5  Towards a Global Public Resource

 

5.1  Progress towards a tripartite global reference guide

The present proposal is designed to provide the key element in the production of a fully coded and interactive global reference guide to

·         the languages and speech communities of the world,

·         their established linguistic relationships,

·         their global corpus of linguistic and ethnic names, and

·         their geographic positions and demography.

This global reference guide would take the form of a freely available, independent and multilingual website, comprising 3 interdependent "panoramas".  These would be interdependent, with a common alphanumeric coding system throughout (ISO-639 plus Linguasphere), and would be fully inter-referenced and interaccessible at every point:

5.1.1  the Global Index of the World's Languages and Speech Communities (or ISO 639/ Linguasphere Index), presenting an alphabetical key to the identification and location of all known written and recorded languages and dialects, and all varieties of linguistic, ethnic and communal names.  This panorama, covering a total of over 71,000 names, is already available in a first printed edition (but without ISO codes), as the Index to the Linguasphere Register.   This progressively updated and expanded edition will be opened to free public access and dialogue on the internet within the next year.   An extract from this Index, covering names beginning G-, has been prepared and is now being extended as part of the current proposal.   This will include existing and proposed additional ISO-639 codes.

5.1.2  the Global Register of the World's Languages and Speech Communities, presenting a comprehensive scale of linguistic relationships among the spoken and recorded languages and dialects of the world, and their relevant speech communities.  This panorama is already available in a first printed edition as the Linguasphere Register of the World's Languages and Speech Communities, covering over 22,000 varieties of languages and dialects.  This edition will be opened to free public access and dialogue on the internet within the next year, and will be progressively updated and expanded online.  Extensive extracts are already freely available at /www.linguasphere.org/.


5.1.3  the Global Mapbase of the World's Languages and Speech Communities, presenting a cartographic survey of the location, distribution and interrelationships of the world's languages and speech communities.  This panorama has already been developed by the Linguasphere Observatory for Africa (linguistically the most complex continent in the world), in collaboration with the London School of Oriental and African Studies (SOAS).  It has been printed as the first sheet of the Linguasphere Mapbase of the World's Languages and Speech Communities and is currently being extended into southern Europe and western Asia, in collaboration with the Languages of the World unit of the Russian Academy of Sciences (Akademia Nauk) .  The first African layer of this map is viewable at /http://www.soas.ac.uk/Geography/LanguageMapping/home.html/.   This same page on the SOAS website illustrates how subsequent layers of the Linguasphere Mapbase will be accessible by zooming, down to the layer of urban speech communities (as already surveyed and published for over 300 minority languages of London, see Bibliography below).

5.2  Applications of the tripartite reference guide

This three-part electronic reference guide will serve as

·         a transnational reference system and educational resource for teaching covering

à     the global complexity of humankind, as represented by the overlying diversity of its languages and the divergent welfare and cultures of its individual speech communities,

à     the underlying unity of humankind, as represented by a worldwide continuum of multilingual communication and intercommunal identities (the "linguasphere"), and

à     the establishment of comprehensive links with - and annotated signposts towards - a vast range of other electronic sources on the languages, peoples and cultures of the world;

·         a stimulus to innovative teaching and research, including

à     the active investigation and surveying of linguistic and ethnic realities and relationships, including the continuous updating and expansion of the global reference guide itself;

à     the transnational observation and documentation, regardless of frontiers, of

- the actual and relative welfare of all speech communities in the world,

- the movement and migration of speech communities and their members,

- the formation and distribution of minority urban speech communities,

- the incidence of all forms of genocide and other forms of discrimination
       among ethnolinguistic communities;

à     the awakening of public interest in questions of the transnational and multilingual heritage and origins of communities and individuals (the "languages of our ancestors").  Linked to the growing strength of public interest in genealogical research, this  development may be of particular importance in encouraging the development of bilingualism among first language English-speaking communities (in danger of becoming the only communities deprived of the advantages of bilingualism, in an otherwise multilingual world).


5.3  The Linguasphere Observatory

The present proposals and products are the outcome of many years research and development at the Linguaphere Observatory in Wales and at its previous location in France as the Observatoire Linguistique.   Created in 1983, after planning and discussion in Quebec (at CIRB, the Centre International pour la Recherche en Bilinguisme at the Université Laval), the Observatory was set up in Normandy as a transnational research network devoted to the study and development of multilingualism (under the honorary presidency of Léopold Sédar Senghor of Senegal, and registered under the French law of association of 1901). 

Among other linguistic activities, the Observatory was responsible for two bilingual exhibitions on languages at the Centre Georges Pompidou in Paris during the 1980's, with substantial support from the Government of Canada.  (These exhibitions subsequently toured internationally, including London, Liège and Lagos, and around the world to Canberra.)  Since 1995, the Observatory has been based in a bilingual area of west Wales, under the directorship of David Dalby, where the Linguasphere Register of the World's Languages and Speech Communities was first published at the turn of the millennium (1999/2000).   Scientific support has been received from Russia, France, India and the United States.

It is appropriate that the present proposals and products should emanate from Wales, a country whose language has successfully resisted and survived the successive invasion of its territory by two of the most powerful languages in the history of the world, Latin and English.  All speech communities now need to consider their relationship to English, as a global lingua franca, and in this respect the indigenous speech community of Wales has the longest experience in the world, having faced the growing strength of its English neighbour for more than one millennium.  The cultural strength and linguistic survival of the Welsh-speaking community offer an important message of encouragement to small speech communities everywhere.  English has a transnational role to play in the world, along with other "arterial" languages, but should be developed in the service of a multilingual global society, NOT as the medium of a monolingual culture.

That the British Standards Institution in London (BSI) and the University of London's School of Oriental and African Studies (SOAS) should have given their support to the proposals and products of the Linguasphere Observatory in Wales is also significant.  At a time when countries around the world are devoting resources to the study of a language associated with England, it is appropriate that major public institutions in that country should devote resources to the study and development of multilingualism and of the languages of the world.

 

Linguasphere Observatory  and  British Standards Institution                                   August 2001

 

 

Comments on this paper, prepared at relatively short notice for the ISO TC/37 meeting in Toronto, will be greatly welcomed, by post or by e-mail to /research@linguasphere.net/. 
A more detailed proposal will be prepared by the Linguasphere Observatory for the beginning of 2002, including the orderly extension of identification codes to all spoken and written languages, and the examination of procedures for combining language codes with codes for countries and for scripts.


Bibliography

 

Baker, Philip & Eversley, John, Multilingual Capital: the languages of London's schoolchildren, Battlebridge Press: London, 2000

Constable, Peter and Simons, Gary (SIL), Language Identification and IT: Addressing problems of linguistic diversity on a global scale, presented to the 17th International Unicode Conference, San Jose (California), September 2000.

Grimes, Barbara F. (editor), Ethnologue: Languages of the World (14th ed.), SIL: Dallas, 2000

ISO Code for the representation of names of languages, ISO, 1998

Linguasphere Register of the World's Languages and Speech Communities (2 volumes), Linguasphere Press: Hebron (Wales), 2000

 

 


Appendix 1:  ISO 639 Codes for the Representation of Language Names
 correlated with the Linguasphere Referential Framework of 100 Zones

ISO 639-1 is an Alpha-2 code
 ISO 639-2/T & /B are Alpha-3 (/T= terminology code; /B = bibliographic code)
 The Linguasphere Referential Framework (LRF) is a Numeric-2 code (see Appendix 2)

(extract)  A-H

as arranged alphabetically by English name of language

The proposed Identification Code will comprise the LRF + ISO 639-1 or 639-2/T elements

Language Name (English)

Language Name (French)

 

LRF +

639-1

639-2/T

639-2/B

Abkhazian

abkhaze

42

ab

abk

abk

Achinese

aceh

31

 

ace

ace

Acoli

acoli

04

 

ach

ach

Adangme

adangme

96

 

ada

ada

Afar

afar

14

aa

aar

aar

Afrihili

afrihili

99

 

afh

afh

Afrikaans

afrikaans

52

af

afr

afr

Afro-Asiatic (Other)

afro-asiatiques, autres langues

1

 

afa

afa

Akan

akan

96

 

aka

aka

Akkadian

akkadien

12

 

akk

akk

Albanian

albanais

55

sq

sqi

alb

Aleut

aléoute

60

 

ale

ale

Algonquian languages

algonquines, langues

62

 

alg

alg

Altaic (Other)

altaïques, autres langues

4

 

tut

tut

Amharic

amharique

12

am

amh

amh

Apache languages

apache

61

 

apa

apa

Arabic

arabe

12

ar

ara

ara

Aramaic

araméen

12

 

arc

arc

Arapaho

arapaho

62

 

arp

arp

Araucanian

araucan

85

 

arn

arn

Arawak

arawak

82

 

arw

arw

Armenian

arménien

57

hy

hye

arm

Artificial (Other)

artificielles, autres langues

 

 

art

art

Assamese

assamais

59

as

asm

asm

Athapascan languages

athapascanes, langues

61

 

ath

ath

Australian languages

australiennes, langues

2

 

aus

aus

Austronesian (Other)

malayo-polynésiennes,
autres langues

3

 

map

map

Avaric

avar

42

 

ava

ava

Avestan

avestique

58

ae

ave

ave

Awadhi

awadhi

59

 

awa

awa

Aymara

aymara

84

ay

aym

aym

Azerbaijani

azéri

44

az

aze

aze

Balinese

balinais

31

 

ban

ban

Baltic (Other)

baltiques, autres langues

54

 

bat

bat

Baluchi

baloutchi

58

 

bal

bal

Bambara

bambara

00

 

bam

bam

Bamileke languages

bamilékés, langues

99

 

bai

bai

Banda

banda

93

 

bad

bad

Bantu (Other)

bantoues, autres langues

99

 

bnt

bnt

Basa

basa

95 or 99

 

bas

bas

Bashkir

bachkir

44

ba

bak

bak

Basque

basque

40

eu

eus

baq

Batak (Indonesia)

batak (Indonésie)

31

 

btk

btk

Beja

bedja

13

 

bej

bej

Belarusian

biélorusse

53

be

bel

bel

Bemba

bemba

99

 

bem

bem

Bengali

bengali

59

bn

ben

ben

Berber (Other)

berbères, autres langues

10

 

ber

ber

Bhojpuri

bhojpuri

59

 

bho

bho

Bihari

bihari

59

bh

bih

bih

Bikol

bikol

31

 

bik

bik

Bini

bini

20 or 98

 

bin

bin

Bislama

bichlamar

52

bi

bis

bis

Bosnian

bosniaque

53

bs

bos

bos

Braj

braj

59

 

bra

bra

Breton

breton

50

br

bre

bre

Buginese

bugi

31

 

bug

bug

Bulgarian

bulgare

53

bg

bul

bul

Buriat

bouriate

44

 

bua

bua

Burmese

birman

77

my

mya

bur

Caddo

caddo

64

 

cad

cad

Carib

caribe

80

 

car

car

Catalan

catalan

51

ca

cat

cat

Caucasian (Other)

caucasiennes, autres langues

42

 

cau

cau

Cebuano

cebuano

31

 

ceb

ceb

Celtic (Other)

celtiques, autres langues

50

 

cel

cel

Central American
Indian (Other)

indiennes d'Amérique centrale,
autres langues

 

6

 

cai

cai

Chagatai

djaghataï

44

 

chg

chg

Chamic languages

chames, langues

31

 

cmc

cmc

Chamorro

chamorro

31

ch

cha

cha

Chechen

tchetchène

42

ce

che

che

Cherokee

cherokee

63

 

chr

chr

Cheyenne

cheyenne

82

 

chy

chy

Chibcha

chibcha

81

 

chb

chb

Chichewa; Nyanja

chichewa; nyanja

99

ny

nya

nya

Chinese

chinois

79

zh

zho

chi

Chinook jargon

chinook, jargon

66

 

chn

chn

Chipewyan

chipewyan

61

 

chp

chp

Choctaw

choctaw

68

 

cho

cho

Church Slavic

slavon d'église

53

cu

chu

chu

Chuukese

chuuk

38

 

chk

chk

Chuvash

tchouvache

44

cv

chv

chv

Coptic

copte

11

 

cop

cop

Cornish

cornique

50

kw

cor

cor

Corsican

corse

51

co

cos

cos

Cree

cree

62

 

cre

cre

Creek

muskogee

68

 

mus

mus

Creoles and pidgins (Other)

créoles et pidgins divers

 

crp

crp

Creoles and pidgins,
English-based (Other)

créoles et pidgins anglais, autres

 

52

 

cpe

cpe

Creoles and pidgins,
French-based (Other)

créoles et pidgins français, autres

 

51

 

cpf

cpf

Creoles and pidgins,
Portuguese-based (Other)

créoles et pidgins portugais, autres

 

51

 

cpp

cpp

Croatian

croate

53

hr

hrv

scr

Cushitic (Other)

couchitiques, autres langues

1

 

cus

cus

Czech

tchèque

53

cs

ces

cze

Dakota

dakota

64

 

dak

dak

Danish

danois

52

da

dan

dan

Dayak

dayak

31

 

day

day

Delaware

delaware

62

 

del

del

Dinka

dinka

04

 

din

din

Divehi

maldivien

59

 

div

div

Dogri

dogri

59

 

doi

doi

Dogrib

dogrib

61

 

dgr

dgr

Dravidian (Other)

dravidiennes, autres langues

49

 

dra

dra

Duala

douala

99

 

dua

dua

Dutch

néerlandais

52

nl

nld

dut

Dutch, Middle (ca. 1050-1350)

néerlandais moyen (ca. 1050-1350)

 

52

 

dum

dum

Dyula

dioula

00

 

dyu

dyu

Dzongkha

dzongkha

70

dz

dzo

dzo

Efik

efik

98

 

efi

efi

Egyptian (Ancient)

égyptien

11

 

egy

egy

Ekajuk

ekajuk

99

 

eka

eka

Elamite

élamite

12

 

elx

elx

English

anglais

52

en

eng

eng

English, Middle (1100-1500)

anglais moyen (1100-1500)

52

 

enm

enm

English, Old (ca.450-1100)

anglo-saxon (ca.450-1100)

52

 

ang

ang

Esperanto

espéranto

51

eo

epo

epo

Estonian

estonien

41

et

est

est

Ewe

éwé

96

 

ewe

ewe

Ewondo

éwondo

99

 

ewo

ewo

Fang

fang

99

 

fan

fan

Fanti

fanti

96

 

fat

fat

Faroese

féroïen

52

fo

fao

fao

Fijian

fidjien

39

fj

fij

fij

Finnish

finnois

41

fi

fin

fin

Finno-Ugrian (Other)

finno-ougriennes, autres langues

41

 

fiu

fiu

Fon

fon

96

 

fon

fon

French

français

51

fr

fra

fre

French, Middle (1400-1600)

français moyen (1400-1600)

51

 

frm

frm

French, Old (842-1400)

français ancien (842-1400)

51

 

fro

fro

Frisian

frison

52

fy

fry

fry

Friulian

frioulan

51

 

fur

fur

Fulah

peul

90

 

ful

ful

Ga

ga

96

 

gaa

gaa

Gaelic (Scots)

gaélique d'Ecosse

50

gd

gla

gla

Gallegan

galicien

51

gl

glg

glg

Ganda

ganda

99

 

lug

lug

Gayo

gayo

31

 

gay

gay

Gbaya

gbaya

93

 

gba

gba

Geez

guèze

12

 

gez

gez

Georgian

géorgien

42

ka

kat

geo

German

allemand

52

de

deu

ger

German, Low; Saxon, Low; Low German; Low Saxon

allemand, bas; saxon, bas; bas allemand; bas saxon

 

52

 

nds

nds

German, Middle High
(ca.1050-1500)

allemand, moyen haut
(ca. 1050-1500)

 

52

 

gmh

gmh

German, Old High
(ca.750-1050)

allemand, vieux haut
(ca. 750-1050)

 

52

 

goh

goh

Germanic (Other)

germaniques, autres langues

52

 

gem

gem

Gilbertese

kiribati

38

 

gil

gil

Gondi

gond

49

 

gon

gon

Gorontalo

gorontalo

31

 

gor

gor

Gothic

gothique

52

 

got

got

Grebo

grebo

95

 

grb

grb

Greek, Ancient (to 1453)

grec ancien (jusqu'à 1453)

56

 

grc

grc

Greek, Modern (1453-)

grec moderne (après 1453)

56

el

ell

gre

Guarani

guarani

88

gn

grn

grn

Gujarati

goudjrati

59

gu

guj

guj

Gwich´in

gwich´in

61

 

gwi

gwi

Haida

haida

66

 

hai

hai

Hausa

haoussa

19

ha

hau

hau

Hawaiian

hawaïen

39

 

haw

haw

Hebrew

hébreu

12

he

heb

heb

Herero

herero

99

hz

her

her

Hiligaynon

hiligaynon

31

 

hil

hil

Himachali

himachali

59

 

him

him

Hindi

hindi

59

hi

hin

hin

Hiri Motu

hiri motu

34

ho

hmo

hmo

Hittite

hittite

5

 

hit

hit

Hmong

hmong

48

 

hmn

hmn

Hungarian

hongrois

41

hu

hun

hun

Hupa

hupa

61

 

hup

hup

 

 

Appendix 2: Linguasphere Referential Framework of 10 Sectors & 100 Zones

 

5 geosectors (initial even digit)  

5 phylosectors (initial odd digit)                      

 

Principles of

Linguasphere Numeric Code
first digit = Sectors; second digit = Zones

comprising

28 geozones + 22 phylozones

comprising
 
50 phylozones

 

                                                                 

0=AFRICA geosector

1=AFRO-ASIAN phylosector

Sectors The mother-tongues of the majority of humankind have been classified, with widespread agreement, within only five major linguistic families or affinities: Afro-Asiatic (or Hamitic-Semitic), Austronesian, Indo-European, Sino-Tibetan (or Sino-Indian) and Atlantic-Congo (or Transafrican, corresponding to 'old Niger-Congo' less Mande). A primary referential division which may be established within the linguasphere is the division between (i) languages classified outside the five major affinities, and (ii) all those languages which have been classified within them. Languages in category (i) have been classified by cautious historical linguists into more than two hundred separate entities, and are classified initially in the Linguasphere Register within five geosectors (see first column on this page), corresponding to each continent where they are spoken. Languages in category (ii) are classified within five linguistic phylosectors corresponding to the continental or intercontinental affinity to which each of them belongs (see second column). Each of the five geosectors bears the name of the relevant continental area (ending in English always in –a), while each of the phylosectors bears the name corresponding to the relevant affinity (ending in English always in –an).  The ten sectors are ordered (both alphabetically and numerically) in such a way that geosectors are indicated by even digits and phylosectors by odd digits.

Zones  The second layer of classification, indicated by each pair of digits, is composed of the 100 zones listed on this page, representing the most useful referential division of each of the above geosectors and phylosectors into ten parts.  Within each phylosector, the component zones (phylozones) are based on the known linguistic subdivisions of each of the affinities concerned, selected subdivisions being either combined or further divided to arrive at a total of ten referential parts.  5=Indo-European, for example, divides readily into ten phylozones or "branches" , whereas under 1=Afro-Asian a total of ten phylozones is arrived at by allocating three zones to the more complex Chadic "branch" of the Afro-Asiatic affinity.  Within the five geosectors, 22 of the 50 component zones are themselves phylozones, corresponding to wider or narrower affinities, as in the case of 00=Mandic in Africa or 41=Uralic in Eurasia.  The remaining 28 zones are geozones, corresponding to geographic groupings of languages which may (but do not necessarily) share a geo-typological relationship, as in the case of 43=Caucasus or 44=Siberia. Languages within the same geozone should never be assumed to be linguistically related, although some of them may be (as clearly indicated in the Linguasphere Register).  The 28 geozones together account for a total of 380 sets of related languages (in contrast to 314 sets included within 72 phylozones), although representing only a small minority of the world's current population. Names given to the 100 zones are harmonised by use of the suffix -ic, and the zones of each sector are numbered as far as possible from north to south and/or from west to east.

00=MANDIC

10=TAMAZIC

01=SONGHAIC

11=COPTIC

02=SAHARIC

12=SEMITIC

03=SUDANIC

13=BEJIC

04=NILOTIC

14=CUSHITIC

05=EAST-SAHEL geozone

15=EYASIC

06=KORDOFANIC

16=OMOTIC

07=RIFT-VALLEY geozone

17=CHARIC

08=KHOISANIC

18=MANDARIC

09=KALAHARI geozone

19=BAUCHIC

 

 

2=AUSTRALASIA geosector

3=AUSTRONESIAN phylosector

20=ARAFURA geozone

30=TAIWANIC

21=MAMBERAMO geozone

31=HESPERONESIC

22=MADANGIC

32=MESONESIC

23=OWALAMIC

33=HALMAYAPENIC

24=TRANSIRIANIC

34=NEOGUINEIC

25=CENDRAWASIH geozone

35=MANUSIC

26=SEPIK-VALLEY geozone

36=SOLOMONIC

27=BISMARCK-SEA geozone

37=KANAKIC

28=NORTH-AUSTRALIA geozone

38=WEST-PACIFIC

29=TRANSAUSTRALIA geozone

39=TRANSPACIFIC

 

 

4=EURASIA geosector

5=INDO-EUROPEAN phylosector

40=EUSKARIC

50=CELTIC

41=URALIC

51=ROMANIC

42=CAUCASUS geozone

52=GERMANIC

43=SIBERIA geozone

53=SLAVIC

44=TRANSASIA geozone

54=BALTIC

45=EAST-ASIA geozone

55=ALBANIC

46=SOUTH-ASIA geozone

56=HELLENIC

47=DAIC

57=ARMENIC

48=MIENIC

58=IRANIC

49=DRAVIDIC

59=INDIC

 

 

6=NORTH-AMERICA geosector

7=SINO-INDIAN phylosector

60=ARCTIC

70=TIBETIC

61=NADENIC

71=HIMALAYIC

62=ALGIC

72=GARIC

63=SAINT-LAWRENCE geozone

73=KUKIC

64=MISSISSIPPI geozone

74=MIRIC

65=AZTECIC

75=KACHINIC

66=FARWEST geozone

76=RUNGIC

67=DESERT geozone

77=IRRAWADDIC

68=GULF geozone

78=KARENIC

69=MESO-AMERICA geozone

79=SINITIC

 

 

8=SOUTH-AMERICA geosector

9=TRANSAFRICAN phylosector

80=CARIBIC

90=ATLANTIC

81=INTER-OCEAN geozone

91=VOLTAIC

82=ARAWAKIC

92=ADAMAWIC

83=PRE-ANDES geozone

93=UBANGIC

84=ANDES geozone

94=MELIC

85=CHACO-CONE geozone

95=KRUIC

86=MATO-GROSSO geozone

96=AFRAMIC

87=AMAZON geozone

97=DELTIC

88=TUPIC

98=BENUIC

89=BAHIA geozone

99=BANTUIC

 

 


The Linguasphere sectors and zones form a stable system of reference for the world's languages, providing a transnational framework
 for linguistic study and a stable "workbench" on which the jigsaw of linguistic relationships may be assembled and re-ordered as necessary.

Source: Linguasphere Register of the World's Languages and Speech Communities (2 vols), Linguasphere Press, Hebron (Wales): 2000

 

 


Appendix 3:

CHART OF THE    WORLD'S ARTERIAL LANGUAGES
 each reaching over 1% of Humankind (above 60 million hearers)
The linguasphere is the mantle of multilingual communication woven around the planet by humankind since children first learned to talk.
A total of 28 arterial languages can each reach more than 1% of humankind – over 60 million hearers each – as either first or "second" languages.  Several comprise a sequence of closely related "inner languages", some with different scripts (e.g. Hindi-Urdu, or Thai-Lao), while others may cover wide spoken variations with a single written standard (e.g. Arabic or German). Arrows ® denote that the preceding language may be partially intelligible to speakers of the language(s) following.  Hearers reached by two arterial languages are counted within the total "range" of both.  Population ranges, internal percentages [%], & global percentages % are rounded estimates.  The ISO 639 codes comprise the 2-letter and 3-letter tags adopted by the International Standards Organisation for major languages (the ISO 639-1 "Alpha-2" and 639-2/T "Alpha-3" code).

When the sun is over the western Pacific, the most spoken language is Chinese, followed by Hindi.   12 hours later, English and Spanish take the lead.
This chart, which may be reproduced freely for educational use, is adapted from The Linguasphere Register of the World's Languages and Speech Communities   (2 volumes, 1043 pages) by David Dalby, obtainable from the Linguasphere Observatory, Hebron, SA34 0XT, Wales or from /www.linguasphere.net/.  The Register is the first systematic roll-call of humankind, based on language rather than nation-state.  Completed for the start of the new millennium, it will now be permanently updated and expanded as a free resource open to all on the worldwide web.  This public service, undertaken initially in Wales, Russia, India and France, depends on new data from users of the Register worldwide, and on support and sponsorship of the Observatory's research by institutions, companies and individuals around the globe.

The Linguasphere Observatory is an independent transnational research network dedicated to the promotion of understanding in a multilingual world.

WIDER AFFINITIES
= [
single odd digit ] see note below
[two digits] = Linguasphere zones

ARTERIAL LANGUAGES
reaching over 60 million hearers = over 1% of humankind (« over 2%; l over 8%; g over 16%)

RANGE
in
millions

MAJOR COUNTRIES OR REGIONS
including official or co-official use and
8 principal diasporas

ISO
639
Codes

SCRIPTS
including
A Arabic L Latin 

Literate
transnational [%]
F female M male

Online
Share

global %

[1] AFRO-ASIAN (Afro-Asiatic)

    [12] Semitic

ARABIC « (al-'Arabiyya, including Maghribi or
   Arabic "West" + Mashriqi or Arabic "East" + 
   Badawi or Bedouin Arabic)

250m

Morocco; Algeria; Tunisia; Libya; Chad; Egypt; Israel; Palestine; Jordan; Saudi Arabia; Iraq; Lebanon; Syria; Iran; Gulf states; Oman; Yemen; Sudan; Mauritania 8 France…

ar/ara

A

[40%F ~ 65%M]

1%

[3] AUSTRONESIAN

    [31] Hesperonesic

MALAY-INDONESIAN « (including Malayu
 +
Bahasa-Indonesia)

200m

Malaysia; Singapore; Indonesia 8 Netherlands...

ms/msa + id/ind

L; (A)

[80%F ~ 90%M]

 

 

® JAVANESE (Jawa)

100m

Indonesia 8 Surinam...

jw/jaw

Javanese

[80%F ~ 90%M]

 

 

TAGALOG (including Pilipino)
® other Transphilippine languages

60m

Philippines 8 USA; Canada...

tl/tgl

L

[90%FM]

 

[4] "other" languages of Eurasia

    [44] Turkic

TURKISH-AZERBAIJANI  (including Tòrk¸e
 
+ Azeri + Turkmen)  ® other Turkic languages

100m

Turkey; Bulgaria; Greece; Cyprus; Iran Azerbaijan; Turkmenistan & Central Asia
8 Russian Fed.; Germany…

tr/tur + az/aze

L; Cyrillic; (A)

[75%F ~ 90%M]

 

    [45] isolated East Asia language

JAPANESE « (Nihongo)

130m

Japan 8 USA; Brazil; Peru...

ja/jpn

Sino-Japanese

[95%FM]

10%

    [45] isolated East Asia language

KOREAN (Hankukmal)

75m

S.Korea; N.Korea 8 China; Japan; Russian Fed.; USA...

ko/kor

Korean;
(Chinese
)

[95%FM]

4%

    [46] Mon-Khmer

VIETNAMESE (ViÃt)

75m

Vietnam; Cambodia 8 USA…

vi/vie

L

[90%F ~ 95%M]

 

    [47] Daic / Tai

THAI-LAO  (incl. Thai+ Isan+ Lao+ Huang+ Buyi)

90m

Thailand; Laos; Vietnam; China 8 Singapore…

th/tha + lo/lao

Thai; Lao

[55%F ~ 70%M]

 

    [49] "Sanskritised" Dravidian

TAMIL   ® Malayalam

90m

India; Sri Lanka 8 Malaysia; Singapore; Mauritius; Germany...

ta/tam

Tamil

[65%F ~ 75%M]

 

 

TELUGU

70m

India 8 Malaysia...

te/tel

Telugu

[35%F ~ 60%M]

 

 

[5] INDO-EUROPEAN
    [51] Romance / Latin-related

SPANISH l (EspaÔol)

500m

Spain; the Americas; Morocco; Western Sahara; Equatorial Guinea

es/spa

L

[85%F ~ 90%M]

6%

 

® PORTUGUESE « (PortuguÃs)
® Portuguese-based creoles (Crioulo)

200m

Portugal; Brazil; Cape Verde; Guinea-Bissau; S±o Tomé; Mozambique; Angola; India (Goa); Macau 8 South Africa; France; Paraguay...

pt/por

L

[85%FM]

3%

 

FRENCH « (Fran¸ais)
® French-based creoles (Cr¾ole)

135m

France; Belgium; Luxemburg; Switzerland; French Guiana, Antilles & Polynesia; New Caledonia; S.E.Asia; Canada; W. & Central Africa; Djibouti; Madagascar; Lebanon; Indian Ocean islands…

fr/fra

L

[90% FM]

4%

 

ITALIAN (Italiano)

70m

Italy; Switzerland 8 USA; Canada; Argentina...

it/ita

L

[95% FM]

3%

    [52] "Romanised" Germanic

ENGLISH  g   ® English-based creoles

1000m

(countries in all continents)

en/eng

L

[90% FM]

47%

    [52] Germanic

GERMAN « (Deutsch)
® Nederlands (Dutch)

135m

Germany; Austria; Switzerland; Belgium; Lux.; France; 8 Canada; USA; Romania; Russian Fed.; Kazakhstan; Brazil; Argentina; Namibia…

de/deu

L

[95% FM]

6%

    [53] Slavonic    [53] Slavonic

RUSSIAN-BELARUSSIAN « (including
   Russkiy + Belarusskaya)

320m

Russia; Belarus; Ukraine; Moldova; Baltic states; Caucasus; Central Asia8 Israel; Germany; USA

ru/rus +
be/bel

Cyrillic

[95% FM]

3%

 

® UKRAINIAN (Ukrainska)
   & other Slavonic languages

45m

Ukraine; Belarus; Russian Fed.; Moldova; Poland; Hungary; Caucasus; Central Asia8 Canada; USA...

uk/ukr

Cyrillic

[95% FM]

 

    [58] Iranic / Iranian

PERSIAN-TAJIK (incl. Farsi + Dari + Tajiki)

60m

Iran; Turkey; Caucasus; Saudi Arabia; Iraq; Afghanistan; Tajikistan 8 USA; Germany…

fa/fas + tg/tgk

A

[45%F ~ 70%M]

 

    [59] Indic / Sanskrit-related

HINDI-URDU l (incl. Urdu + Hindi + Braj + Awadhi   
 
+ Bhojpuri + Maithili, etc, incl. former Hindustani)

900m

India; Pakistan; Bangladesh, Nepal 8 Fiji; Mauritius; S.Africa; Uganda; UK; Caribbean...

hi/hin + ur/urd

Devanagari; A

[35%F ~ 60%M]

 

 

® PANJABI  (including Panjabi "East" + "West")

85m

India; Pakistan 8 UK...

pa/pan

Gurmukhi; A

[45%F ~ 65%M]

 

 

BENGALI « (Bangla + Sylhetti)
® Assamese (Axamiya) & Oriya

250m

Bangladesh; India 8 UK…

bn/ben

Bengali

[35%F ~ 60%M]

 

 

MARATHI

80m

India

mr/mar

Devanagari

[45%F ~ 65%M]

 

 [7] SINO-INDIAN (Sino-Tibetan)

    [79] Sinitic / "Wider" Chinese

CHINESE  Putonghua  g  ("Mandarin")

1000m

China; Taiwan 8 Vietnam; Thailand; Singapore, Malaysia

zh/zho

Chinese

[75%F ~ 90%M]

8%

 

® WU  &  Xiang, Gan, Hakka, Min-nan, etc.

85m

China

Chinese languages (Han-yu) share

 

® CANTONESE (Yue)

70m

China, Vietnam 8 Malaysia; Singapore; Indonesia; USA...

a common written form

[9] TRANSAFRICAN (Atlantic-Congo)

[99] "Arabicised" Bantu

SWAHILI (Kiswahili)
® other Bantu "Inner-East" languages

90m

Tanzania; Kenya; Uganda; Rwanda; Burundi; Congo Dem.Rep.; Somalia; Comoro Islands

swa

L; (A)

[55%F ~ 75%M]

 

The Linguasphere Register classifies and annotates 13,840 "inner languages" (plus dialects) within 4,994 "outer languages" and 694 "sets" of languages.
Over half the world's modern languages are classified according to linguistic affinities into one of 5 major phylosectors or "families", numbered [1], [3], [5], [7], [9].  All other smaller groupings of languages, or isolated languages, are classified within 5 major geosectors or continental areas, numbered [0], [2], [4], [6], [8].  All five phylozones and one geozone – [4] Eurasia – are represented by arterial languages in the above table.  Each of the 10 phylosectors or geosectors is subdivided, on linguistic and/or geographical grounds, into 10 zones of reference, numbered [00] to [99].  A simple 2-digit tag, indicating a zone, thus serves to locate any name of any language or dialect or speech community within the linguasphere (over 71,000 such names being recorded, classified, coded and indexed in the Linguasphere Register).

© Linguasphere Observatory, Hebron SA34 0XT, Wales      /observatory@linguasphere.net/      tel. [+44] 1994 419.660 (fax 419.300)

.


 

Appendix 4: Global Language Index (sample extract)

See table on following pages, alongside notes on columns A to J below.

 

The following table shows entries for language names beginning ga to gaf-.  The Global Language Index will be based on the existing Linguasphere Index of over 71,000 linguistic and ethnolinguistic names (see Linguasphere Register Vol.1), which is still expanding.  ISO 639 (columns D and E) will be an essential part of the Global Language Index, and it is proposed that the progressive expansion of ISO 639-2, to cover all languages and language names in the Index, should be undertaken with the guidance of TC 37 and of the established Maintenance Agencies for ISO 639 and 639-2.

It is NOT a proposal of the UK that the international Linguasphere Observatory should itself become a Maintenance Agency, but rather that it should confer and collaborate closely with the Registration Authorities already established for ISO 639-1 (Infoterm in Vienna) and for ISO 639-2 (Library of Congress).

Columns

A          Names of languages and dialects, following the typographic conventions of the Linguasphere Register and Index: Reference Names (usually autonyms or "own names") are in bold type (in contrast to other Alternative Names); Names of Outer Languages have an initial capital; Names of Inner Languages and Dialects are in lower case throughout (Names of Dialects are indented).  Umbrella names (groups and families of languages) are excluded from this sample.

B          Status of languages, indicating where a language is in national or regional official use, where it is now extinct, or where it has been revived or partially revived.

C          Proposed alphanumeric identification code, comprising the ISO 639-1 (column D) or 639-2 (column E) prefixed by the digits of the Linguasphere Referential Framework (first element in column H).  Each 2-letter or 3-letter element in the identification code will remain unambiguous on a world scale.  The (optionally omitted) 2-digit prefix will serve to provide (a) a wider informational content to the code, (b) a basis for the linguistic and/or geographical sorting of coded items, and (c) an automatic check on the accuracy of the following 3-letter element. No hyphen will be included between the component digits and letters (to maintain a clear distinction from the relationship scale in column H).  In the present extract, only a few entries are already covered by ISO 639 and in all other cases the Linguasphere digits have been followed by *** (in anticipation of future ISO alpha codes).  In the case of dialects (indented entries) it is proposed that special codes should not normally be allocated, but that the name of the dialect should follow the code of the relevant language (separated by a forward slash). 

D          ISO 639-1 codes, where these already exist.  It is proposed that these be used specifically and exclusively to denote standardised written languages.

E          ISO 639-2/T codes (and ISO 639-2/B codes in brackets), where these already exist.

F          SIL (Summer Institute of Linguistics) codes, as used in the Ethnologue.  It is not proposed that these codes be adopted automatically to fill gaps in the ISO 639 series, without first establishing principles for the selection of letters, including problems of conflict with established ISO 639 codes.

G         OpenType language tags, developed for IT use by Adobe and Microsoft.

H          Linguasphere codes, as used in the Linguasphere Register. These comprise the 2-digits of the Linguasphere Referential Framework, plus the 5 to 6 layer Relationship Scale.  Items defined as languages are all classified in the Linguasphere Register as outer languages or inner languages (subdivided into dialects where appropriate), i.e. represented respectively by a first or second minuscule (subdivided by a third minuscule where appropriate) in the Relationship Scale

I           Demoscale or 10-point demographic scale, as used in the Linguasphere Register.  All outer languages are ranked in terms of relative demographic importance by a single digit, representing the order of magnitude of speakers (first or second language in 1999/2000) on a scale ranging from 0 (extinct between 1900 & 1999) through 2 (100+), 3 (1000+), 4 (10,000+), 5 (100,000+), 6 (1,000,000+), 7 (10,000,000+), 8 (100,000,000+) to 9 (over one billion).

J          Country or principal countries where spoken.

                  A

B

   C

   D

   E

   F

  G

     H

   I

               J

               

Regional

96gaa

 

gaa

gac

gad

96-LAA-a 

5

Ghana                                 

               

 

91***

 

 

gna

 

91-GEB-aa

 

Burkina Faso                         

ga / e-ga     

 

96***

 

 

   

 

96-EAA-aa

 

Côte d'Ivoire                        

g//a             

 

08***

 

 

   

 

08-AAB-cf

 

Botswana                              

   gaabu          

 

90***/

 

 

   

 

90-BAA-acb

 

Guinea-Bissau; Guinea                

   gaaduwa        

 

02***/

 

 

   

 

02-BAA-abc

 

Chad; Niger                          

g//aakhwe        

 

08***

 

 

    

 

08-AAB-cf

 

Botswana                             

   ga'aliyyin     

 

12***/

 

 

    

 

12-AAC-edd

 

Sudan                                

gaalpu           

 

29***

 

 

gla

 

29-AAD-aa

 

Australia                            

Gaam (kor-e-gaam)

 

05***

 

 

tbi

 

05-MBA-a 

4

Sudan                                

Ga'anda (Gaanda) 

 

18***

 

 

gaa

 

18-HBA-a 

4

Nigeria                              

gaandu           

 

18***

 

 

gaa

 

18-HBA-aa

 

Nigeria                              

  gaangala       

 

99***/

 

 

   

 

99-AUR-gfb

 

Congo                                

gaba             

 

14***

 

 

gay

 

14-DAA-ag

 

Ethiopia                             

gabadi           

 

34***

 

 

   

 

34-GBE-aa

 

Papua New Guinea                     

g//abake-ntshori 

 

08***

 

 

   

 

08-AAB-dh

 

Botswana                             

gabalbara    

Extinct

29***

 

 

   

 

29-RAA-bg

 

Australia                            

  gabalitain     

 

51***/

 

 

   

 

51-AAA-gbh

 

France                               

  gabbra         

 

14***/

 

 

   

 

14-FBA-ahc

 

Kenya                                

Gabere (Gaberi)  

 

17***

 

 

   

 

17-DGB-a 

4

Chad                                 

gabi             

 

95***

 

 

   

 

95-ABA-wc

 

Côte d'Ivoire                        

Gabi+Badjala 

Extinct

29***

 

 

   

 

29-QBA-a 

0

Australia                            

Gabiano          

 

26***

 

 

   

 

26-IAB-b 

2

Papua New Guinea                     

gabi-gabi    

Extinct

29***

 

 

   

 

29-QBA-aa

 

Australia                            

gabin             

 

18***

 

 

   

 

18-HBA-ab

 

Nigeria                              

gablai                

 

17***

 

 

   

 

17-DGC-aa

 

Chad                                 

gablet                

 

12***

 

 

   

 

12-ABA-ad

 

Oman                                  

   gabo                

 

95***/

 

 

   

 

95-ABA-xbb

 

Côte d'Ivoire                        

Gabo-Bora             

 

34***

 

 

   

 

34-FCB-a 

2

Papua New Guinea                     

gabone                

 

34***

 

 

   

 

34-GBB-ah

 

Papua New Guinea                     

gabou                 

 

93***

 

 

   

 

93-ABA-fg

 

CAR, Congo Dem.Rep.                  

   gabra               

 

14***/

 

 

   

 

14-FBA-ahc

 

Kenya                                

Gabri (Gabri proper)  

 

17***

 

 

gab

 

17-DGB-a 

4

Chad                                 

Gabri (pseudo Gabri 1)

 

17***

 

 

   

 

17-DGA-a 

4

Chad                                 

gabri (pseudo gabri 2)

 

92***

 

 

   

 

92-CAA-db

 

Chad                                 

gabri                  

 

58***

 

 

gbz

 

58-AAC-di

 

Iran                                 

Gabrieleño       

Extinct

65***

 

 

   

 

65-ADB-a 

0

USA                                  

    gabri-kermani

 

58***/

 

 

 

 

58-AAC-dib

 

Iran                                 

Gabri-Kimre

 

17***

 

 

 

 

17-DGA-a 

4

Chad

gabu

 

98***

 

 

 

 

98-CAB-cd

 

Nigeria                              

gabu

 

93***

 

 

 

 

93-ABA-fg

 

CAR, Congo Dem.Rep.                  

Gabutamon                    

 

24***

 

 

gav

 

24-SCB-a 

2

Papua New Guinea                      

    gachikolo                  

 

59***/

 

 

   

 

59-AAF-tbe

 

India                                

    gachitl                    

 

42***/

 

 

   

 

42-BBA-bga

 

India                                

Gadaba ("Dravidic" Gadaba)   

 

49***

 

 

gau

 

49-CAB-b

3

India                                

gadaba ("mundic" gadaba)     

 

49***

 

 

gbj

 

46-CBB-ba

 

India                                

gadaba (pseudo gadaba)       

 

46***

 

 

   

 

46-CBB-ac