next up previous
Next: Project Timescales

UKERNA Technology Programme
Call for Proposals: Electronic Document Interchange

Proposer: De Montfort University

Proposal: Standardised electronic document interchange using SGML and DSSSL Lite Summary

The project will analyse the current international and industry de facto standards that are in use for electronic document creation, transfer and presentation. The project will identify the set of common elements that will allow the conversion of both logical and layout aspects of a document. The project will also include the design and implementation of a conversion tool that will allow multiple common formats to be converted into SGML and DSSSL Lite formats. The SGML and DSSSL Lite documents will then be able to be viewed using a WWW type browser that will be available for common computer platforms.

The project will be a joint effort between the School of Computing Sciences and the Division of Learning Development at De Montfort University. Estimated project costs for a twelve-month period are £ 56,280.

Background

Both users and providers of electronic documents have become aware of the difficulties in interchanging this information across networked environments. These problems have arisen due to the number of formats available from manufacturers of text processing systems. The problems have been further increased by the fact that other text and data formats exist for closely related products such as spreadsheets, electronic mail and databases. Moreover, the process of authoring, typesetting and publishing could be made simpler and more efficient if a recognised standard for document mark up and page formatting system is adopted.

Word/Text Processing

There is a plethora of word processing packages on the market. This leads to the inevitable increase in proprietary solutions to handling text formatting. Popular solutions have been generated by such packages as WordPerfect, Microsoft Word, and WordStar. Attempts have been made by leading software houses to provide a method of interchange and this has led to RTF (Rich Text Format). Unfortunately this is another proprietary solution and independent reports have highlighted a number of deficiencies with it.

Much work has been done in the public domain to provide a free and easy to use text processing system. This has led to the emergence of TeX, a system widely adopted in the academic world. It has yet to find acceptance outside universities perhaps due to its size and need for a separate viewing program.

There appears to be no movement toward a recognised text processing and markup standard. This being due to the wide acceptance of many packages for particular platforms and environments. There is also little agreement on character sets. Many English speaking countries adopt ASCII and appear to reject other alphabets. Until such time as an all encompassing text processing language that uses an international character set such as ISO 8859, there will continue to be word processing packages that have vendor specific solutions to character and document formation.

Platform dependencies

One of the key elements to the evolution of Word/Text processing systems has been the platform for which they have bean written. WordPerfect is a PC package and as such uses such features as function keys. Sadly, the keyboard on a Apple MAc or a UNIX workstation has a different keyboard and so WordPerfect must be rewritten and re-learnt for these new environments.

The emergence of UNIX as a de facto industry standard operating system has meant that text processing systems that have been associated with UNIX such as nroff, troff, groff, TeX and LaTeX have become more widely used. Many documents such as the manual pages are formatted using roff. Much public domain software documentation is written in LaTeX.

Designing a processing system that will only work on a single platform increases the complexity of learning and results in ever more systems for converting from one de facto standard to another.

Markup Languages

A number of markup languages exist that have been designed to overcome the platform dependencies of text processing packages such as WordPerfect. Any common text editor can be used to generate marked-up documents which are then translated into a suitable format for either viewing or printing. De facto standards include TeX and LaTeX which are widely used in the academic community. An international standard is SGML (Standard Generalised Markup Language) and is defined in International Standard ISO 8879:1986. Whereas TeX attempts to solve the problem of logical layout and page formating, SGML concentrates on the logical content and for that reason has become very important for text retrieval and authorship.

Another important internationally recognised standard is the ODA. ODA: Office [Open] Document Architecture, takes a similar approach to SGML but stresses blind interchange. ODA is defined in the multi-part standard ISO 8613:1988, in the CCITT T.410 series recommendations, and in ECMA standard 101.

Although mark up langauges are platform independent, they all differ in their approach to page layout, leaving this function to other packages such as DVIPS.

Page Layout Formats.

The principle ISO standard for page formatting is DSSSL. Recent reviews of this standard have indicated that a subset of DSSSL could be used with existing viewers to provide a quick entrance into the the world of WWW and browsers [1].

The traditional page layout mechanism has been PostScript, which is a de facto standard for laser printers. Postscript viewers have been created to provide a means of seeing the document as it would look on paper. There are obvious advantages of this method but the major disadvantage is that it is not possible to interact with the text. Extensions made to Postscript by it originators, Abode, have resulted in PDF (Portable Definition FIles). However there are no public domain viewers for this format and hence its popularity is limited.

World Wide Web

HTML is just one instance of SGML. HTML is defined by its Document Type Definition. It is quite conceivable that other DTDs could be used and 'translated' for viewing. The viewing transformation could be done by the browser at initialisation. Arguments put forward by Sperberg-McQueen and Goldstein [1] and Freese [2] would indicate that the future Web browser will be an SGML browser and will use a simplified version of DSSSL to handle the page formatting. Indeed, recent announcements of the availability of public domain SGML browsers that use this approach suggest that the next age in Web browsing is upon us.

Multimedia

Traditional methods of text processing and formatting have had to adapt to meet the demands of modern technologies and provide the ability to provide not only text but multimedia extensions. The WWW technology now means that text, image, video and audio are capable of being link together in a single document. This means that a suitable format for handling such media as image are required. Standards such as TIFF, GIF, JPEG and MPEG have been introduced in order to overcome this.

File Transfer

The transfer of large documents can be problematical. Various compression techniques exist that can dramatically reduce the size of files to allow speedier and more efficient document transfer. These include PKZIP, ZIP, GNU ZIP and COMPRESS. There is no standard and many platforms use there own proprietary solutions (eg Apple Mac HQX).

File transfer is closely related to Electronic mail and many solutions to the problems of improving the ease and efficiency of file transfer via email have been put forward.

Attempts to standardise the protocols for message interchange have resulted in SMTP and X.400 being put forward. Whilst the protocol is essential for providing service access the actual user interface to email system (WordPerfect Office, Pine etc) which result in no standard methods of dealing with such problems as file compression and image handling.

Aims and Objectives

The project will concentrate on the conversion aspect of electronic document interchange and not on document transfer.

In order to test the converter a suitable set of test documents will be identified. These will include types that will allow the construction of suitable SGML DTDs and DSSSL Lite style sheets. Multimedia documents may also be defined.

Although the testing of the converter will invlove users, it is proposed that a pilot scheme be set up upon the completion of the project in order to fully expose the system to users. It is felt that this type of testing is not possible given the limited 12 month project life time.

A detailed breakdown of the required work is provided.



next up previous
Next: Project Timescales


Dave Houghton