Toward a Standardized Format for ASCII Text Documents A Working Paper of The ICADD Subcommittee on Standardization of ASCII Text Documents Prepared at the Trace Research and Development Center Gregg C. Vanderheiden, Ph.D. Neal Ewers Keywords: Document access, ASCII, text documents, standard, print disabilities, alternate formats, braille DRAFT Table of Contents 1. The Need for A Standard Electronic Format for Electronic ASCII Text Files 1 1.a. The Need 1 1.b. Current ASCII Format 1 1.c. Requirements of the New ASCII Format 1 2. Overall Goal 3 3. Formal versus Informal Documents 3 3.a. Type 1 -- Informal Documents (ICADD-0 Format) 3 3.b. Type 2 -- Informal Documents (ICADD-8 Format) 3 3.c. Type 3 -- Formal Documents (ICADD-22) 3 4. Specific Goals 4 5. Constraints 4 6. Proposed Format for Type 1 Documents 5 7. Proposed Format for Type 2 Documents (ICADD-8) Format 7 7.a. Tag Rationale 9 8. Request for Input 9 Toward a Standardized Format for ASCII Text Documents A Working Paper of The ICADD Subcommittee on Standardization of ASCII Text Documents Prepared at the Trace Research and Development Center Gregg C. Vanderheiden, Ph.D. Neal Ewers 1. The Need for A Standard Electronic Format for Electronic ASCII Text Files 1.a. The Need Individuals who are blind or who have other print disabilities have difficulty in accessing and effectively using documents in print form. One approach to addressing this is to provide the documents in electronic form. Individuals using microcomputers and other electronic reading aids can then access and have the information presented to them in speech, braille, large text, or other suitable form. Because of the large number of different formats in which electronic text can be stored, specifying that a document must be in "electronic form" will not necessarily result in an electronic document which can in fact be accessed or read. Some standard format which can be read by all software is therefore necessary. Unless a standard definition of an "ASCII text document" is created, it will not be possible to create tools which can easily work with these documents. Further, it is difficult to specify that people must provide their information as an ASCII text file if no definition as to exactly what that means is provided. 1.b. Current ASCII Format Currently, the most common format available is what might be called an ASCII text file. This is a file which contains only standard ASCII text characters (Table 1). To accommodate foreign languages, this standard has been revised by the International Standards Organization (ISO) as shown in Table 2. In either case, ASCII or ISO, the text file does not include any formatting information. Thus, any information that was encoded in an original document by using boldface, underlining, italics, footnote designations, etc., is lost in a document that is changed into ASCII text form. Since the boldface, underlining, etc., may contain convey important information, converting a document into a straight ASCII file may in fact cause some important information to be lost and therefore unavailable to the individual using the ASCII text file. 1.c. Requirements of the New ASCII Format One requirement of standard ASCII text file format therefore would be that it provide some mechanism for preserving essential formatting information that might otherwise be lost. 1 A second requirement is that the standard must clearly define how the ASCII text file would be formatted. For example, is there a carriage return at the end of each line, or only at the end of paragraphs? (Documents with carriage returns only at the end of paragraphs cause a problem for some screen reading programs.) If there is a carriage return at the end of each line, how does one identify the end of a paragraph, so that screen readers can read smoothly across lines, but stop at the end of a paragraph? Table 1: ASCII Characters The ASCII value is listed to the left, and its corresponding character to the right. 33 ! 34 " 35 # 36 $ 37 % 38 & 39 ' 40 ( 41 ) 42 * 43 + 44 , 45 - 46 . 47 / 48 0 49 1 50 2 51 3 52 4 53 5 54 6 55 7 56 8 57 9 58 : 59 ; 60 < 61 = 62 > 63 ? 64 @ 65 A 66 B 67 C 68 D 69 E 70 F 71 G 72 H 73 I 74 J 75 K 76 L 77 M 78 N 79 O 80 P 81 Q 82 R 83 S 84 T 85 U 86 V 87 W 88 X 89 Y 90 Z 91 [ 92 \ 93 ] 94 ^ 95 _ 96 ` 97 a 98 b 99 c 100 d 101 e 102 f 103 g 104 h 105 i 106 j 107 k 108 l 109 m 110 n 111 o 112 p 113 q 114 r 115 s 116 t 117 u 118 v 119 w 120 x 121 y 122 z 123 { 124 | 125 } 126 ~ 127  Table 2: ISO Characters Table 2 will go here 2 2. Overall Goal The purpose of the ICADD ASCII Text Format Standard is to provide a standard format for ASCII text documents. This effort to define a standard ASCII text format is a subset of the overall goals of the International Committee for Accessible Document Design (ICADD). This group, which was formed in 1992, has an overall scope of work which includes both the development of a format for simple ASCII text documents and the development of a standard for more formal publications. The standard for more formal publications is not covered in this subcommittee report. 3. Formal versus Informal Documents Currently, the ICADD efforts cover three types of documents: two informal and one formal. 3.a. Type 1 -- Informal Documents (ICADD-0 Format) With the proliferation of computers, there has been a corresponding increase in the number of letters, memos, and other informal written communication which are prepared using word processors rather than typewriters. This makes it possible for a large amount of this material to be sent to people as an ASCII text file when this is their preference. Type 1 documents include all of those informal documents where there is no formatting (boldface, italics, footnotes, etc.) which is necessary to understand the documents (or where the loss of boldface, italics, etc., would not alter the reader's ability to understand the document). For this type of information, a very simple ASCII Text Standard has been defined, and is described below. It includes no formatting information, and does not support the use of boldface, underlining, etc., in a document. 3.b. Type 2 -- Informal Documents (ICADD-8 Format) In addition to informal correspondence and documents, there are also a number of other informal or semi-formal documents and reports which are prepared using standard word processors. In these documents, however, formatting (such as boldface, italic, underline, etc.) is often used to convey important information in the document. In addition, these documents often contain footnotes, side-bars, or boxed text which is interspersed with the running text of the document. Converting these documents into simple ASCII text files (without preserving the formatting information) can cause both confusion and loss of information. Where text formatting conveyed information, the information would be lost. When footnotes, boxed text, or side-bars suddenly appear intermixed with running text (without any type of marker), the resulting text file can be very confusing and even misleading. For these types of documents, a set of eight tags is defined which allow users to mark common attributes. Specifically, these tags allow the user to mark boldface, italicized, or other emphasized text, as well as to mark list items, picture captions, side-bars or boxed text, and page numbers. . This Type II document format is referred to as ICADD-8 and is described below. 3.c. Type 3 -- Formal Documents (ICADD-22) The third type of document defined by the ICADD effort is formal documents, including books, journals, and other formal publications. Such documents can often contain multiple sections or chapters as well as specially formatted text. In addition, these documents may also include equations, tables, columns, and other specially formatted information. A set of 22 tags have been defined by ICADD to allow these documents to be more effectively accessed and read. In addition, further specialized tag sets are being explored to handle scientific, mathematical, and 3 other types of specially formatted text. The purpose of these tags is to allow special commercial document readers to translate documents which are in the standard ICADD format into documents that are structured for use in the document reader. The result is a document which can be accessed and used by a person with a print disability in a manner which is both complete (contains all of the information in the original) and efficient (allows rapid movement about and within the text). Specifications for Type 3 documents are provided in a separate document. 4. Specific Goals This document outlines the current draft of the ICADD specifications for Type 1 and Type 2 informal documents. It is a first draft, and is being released so that persons with print disabilities and others interested in this problem can review it and offer input concerning the proposed specifications. This documents was prepared based upon questionnaires answered by and conversations with members of the ICADD subcommittee charged with arriving at the design of ASCII formats. Members of this committee include: Jim Allan Texas School for the Blind 1100 W. 45th Street Austin, TX 78756 512/454-8631 Internet: jallan@tenet.edu Charles Crawford, Commissioner Executive Office of Human Services Commission for the Blind Boston, MA 02111-2227 617/727-5550 Judith Dixon Consumer Relations Office National Library Service for the Blind and Physically Handicapped Library of Congress Washington, DC 20542 202/707-5100 Internet 74036.2101@Compuserve.com Neal Ewers Trace Research and Development Center Room S-153, Waisman Center 1500 Highland Avenue Madison, WI 53705 608/263-5485 fax 608/262-8848 John Hernandez New York Institute for Special Education 9999 Pelham Parkway Bronx, NY 10469 718/519-7000, extension 348 fax 718/231-9314 David Holladay Raised Dot Computing 408 S. Baldwin Street Madison, WI 53703 608/257-9595 Gregg Vanderheiden Trace Research and Development Center Room S-151 Waisman Center 1500 Highland Avenue Madison, WI 53705 608/262-6966 Internet vanderhe@macc.wisc.edu fax 608/262-8848 5. Constraints This section contains a listing of the constraints which a standard format in this area must meet. 4 a) Any proposed guidelines must work easily on a wide variety of computer platforms. b) The guidelines must be easy to implement, even on the most rudimentary word processor. c) The guidelines should use terminology and strategies which can be understood by any person responsible for preparing documents in this format (secretaries, students, etc.). d) Each level of format should be internally consistent with the higher level formats (e.g., Type 2 must be consistent with Type 3). 6. Proposed Format for Type 1 Documents This section presents the proposed format for Type 1 documents (ICADD-0 Format). A summary of the format rules is presented, following by a rationale for each of the rules. 1) Text should be broken up into lines with hard carriage returns at the end of each line. 2) Each line should be no longer than 78 characters. (65 characters is preferable for documents which are short and where short lines do not cause layout problems.) 3) There should be two carriage returns at the end of each paragraph. 4) All titles within the document text should be preceded by an extra carriage return (for a total of three carriage returns) if they are not at the top of a page or the document). 5) All carriage returns should be followed by a line feed character. 6) Text in an ICADD-ASCII formatted document is limited to printable ASCII characters with codes between 33 and 127, plus Space (32), Tab (09), Carriage Return / Line Feed (13, 10) and Form Feed or New Page (12). The basic characters for 33 to 127 include (in order): ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~  5 1) Text should be broken up into lines with hard carriage returns at the end of each line. Rationale: Some text readers are not able to scroll past the end of a screen line. Thus, hard carriage returns at the end of each line are necessary in order to keep these programs from crashing. Comment to Reviewers: The number of programs which cannot handle text without carriage returns at the end of each line is decreasing. Some people felt that we might lean into the future on this, and not specify carriage returns at the end of every line. This simplifies some other document interpretation. Most of the people we talked to, however, felt that many individuals trying to access these ASCII text files are not yet using the more sophisticated tools, and that at least for the foreseeable future it was better to stick with the hard carriage return on each line format. This is therefore included in the current version of the format. Additional comments, pro and con, are invited. 2) Each line should be no longer than 78 characters. Rationale: Using an 80-character line can cause some computer displays to automatically word-wrap after the 80th character. If this is then followed by a carriage return, it would result in all of the lines being double-spaced. A 78-character limit eliminates this problem. All modern computers support an 80-character display. Thus, adhering to this format would result in documents which display without distortion on any standard screen. For printouts, this would also fit in 6.5" at 10-point Courier, and thus would print out on standard 8 1/2" x 11" paper with 1" margins. For documents which are short, and where short lines will not create layout problems, a 65-character line is more convenient for some users. 3) There should be two carriage returns, with no spaces (or other characters) between them, at the end of each paragraph. Rationale: More than one carriage return is needed in order differentiate the carriage return at the end of a paragraph from the carriage return at the end of each line. It is important that there be no characters between the two carriage returns in order to facilitate machine identification of the dual carriage return. 6 4) All titles within the document text should be preceded by an extra carriage return (for a total of three carriage returns). Rationale: Providing the third carriage return after paragraphs which precede titles makes it easy to identify titles automatically in a document. 5) All carriage returns should be followed by a line feed character. Rationale: MS-DOS and other environments provide a line feed following each carriage return in the document. Documents in the Apple Macintosh environment, however, do not provide any line feed following the carriage return. A document with line feeds in either environment is quite readable, although in the Apple Macintosh environment each line is preceded by a square bracket on the screen. If the line feeds are left out in MS-DOS documents, however, some software will have difficulty with the document. The recommendation is therefore to provide a line feed with every carriage return. For any environments in which the line feed is superfluous, it can be very easily removed using a search-and-replace command. It is expected that translation programs will also be developed that will remove all ICADD format tags from a document and change them directly into format commands for popular word processors (WordPerfect, Microsoft Word, MacWrite, etc.). When this is done, the linefeeds can also be removed if appropriate. 6) Text in an ICADD-ASCII formatted document is limited to the ASCII characters with codes between 33 and 127, plus SPACE, TAB, CARRIAGE RETURN (and LINE FEED), and FORM FEED (new page). Rationale: Characters above ASCII 127 are not standardized. They are also not supported by many programs and readers. 7. Proposed Format for Type 2 Documents (ICADD-8) Format The ICADD-8 format includes the six guidelines listed above, plus eight additional tags that cover bold, italic, and other emphasized text, as well as lists, footnotes, figure descriptions, side-bars, and page numbers. These tags are: 1. BOLD: text to be bolded 2. ITALICS: text to be in italics 3. OTHER: Other emphasized text "Other" includes all emphasized text that is not bold, italic, or bold & italic; for example, underlined text. 7 4. LIST ITEM: item in list item in list item in list The principal reason for tagging items in a list is to differentiate a list of single-spaced items (with a carriage return at the end of each line) from a paragraph of running text (which would also have a carriage return at the end of each line). Without some way of easily distinguishing a list, screen reading and other automatic processing software may strip out the carriage returns and change a list into a stream of running text. This would be devastating to most lists, and particularly to lists such as Table of Contents. Two options for handling lists are supported. This first option is to place standard SGML list item tags before and after each item in a list. Option 2: Item 1 Item 2 Item 3 Item 4 The second option is a special ICADD-8 tag to be placed at the beginning and end of a list. With this option, instead of putting a tag before and after each item in the list, a tag is placed before and after the entire list. This option is provided to make it easier to read lists if a person is not using a program that removes the tags. It also makes it easier for hand-tagged text to be created. This second option is particularly handy when dealing with Tables of Contents and other similar lists, where each item in the list occupies its own line, and the list items can occupy an entire line. (Adding tags before and after each item would cause all of the lines to wrap and break up or be longer than 78 characters.) Reviewers Note! Note that this is not a standard SGML tag. It also violates one of the constraints stated above, which says that all of the ICADD specifications should be subsets of each other. It does appear, however, to be a very useful option. Comments pro and con are invited. 5. FOOTNOTE: footnoted text Footnoted text should be placed in the text and not at the bottom of the page so that it is close to the item it refers to. . 6. FIGURE DESCRIPTION:
Text in a figure description
Figure caption
This tag is used both for figure captions and for descriptions of figures. Descriptions should be provided for all figures, pictures, or other illustrations which are not completely redundant with the text of the document. 8 7. BOXES AND BLOCKED TEXT: Text in a box, side- bars, etc. Tag all boxed text (e.g., Sidebars, Historical Notes and other miscellaneous inserted text), and place them within the running text of the document at a location similar to their location in the printed document. 8. PRINT PAGE REFERENCE: print page reference When a document is converted to ASCII text, it almost always ends up on a different page number than the original, or it appears as a continuous text file with no page delimiters. In both cases, it is not possible to make any sense out of page references in the original text document (e.g., "See page 5") or the index on a Table of Contents. It is also difficult to discuss the document with people using a print copy. Preserving the page boundaries of the original printed document is therefore often important. 7.a. Tag Rationale These tags (with the exception of the second list option) are all taken directly from the standard SGML tags that are used in the formal Type 3 documents. The purpose of this minimal set of eight tags is to allow tagging of very common formatting information in the informal documents, in order either to preserve formatting information important to understanding the text or to make it easier for automatic text readers to deal with these documents. 8. Request for Input This is a working document, and input of all types is solicited. Because of the pressure to put out a first release of this standard, however, please send comments sooner versus later. Also, in order to get the widest possible review and input to the document, please code and redisseminate to any people or forums you think would be interested. You can send comments directly to the subcommittee chair via e-mail, regular mail, or fax: ICADD ASCII Subcommittee c/o Gregg Vanderheiden (chair) S-151 Waisman Center 1500 Highland Avenue Madison, WI 53705 608/262-6966 voice 608/263-5408 TT/TDD 608/262-8848 fax vanderhe@macc.wisc.edu  .