Practical Issues in SGML Publishing

[Mirrored from: http://www.sgmlbelux.be/96/gower.htm]

Practical Issues in SGML Publishing

Elizabeth Gower
Adobe Systems Europe
Uxbridge, UK

E-mail : egower@Adobe.com

Keywords: SGML, Publishing, Practical SGML

Getting Started in SGML Publishing

Once a company has an SGML publishing requirement, the question no longer is "What is SGML and how can it help me?", but "Where and how do I get started?". Frequently, this question is followed by the statement, "Help me--I have no idea where to start!".

When an SGML implementation team plans its project, there are some basic areas that need to be considered:

Legacy Data Conversion
Hardware and Network Capacity and Scalability
Software Features, Perfomance, and Customization
SGML-specific issues: Tables, IDs, Entities

Many SGML specialists focus on how SGML markup should be implemented in their publishing environment, but the success of your organization's SGML publishing system can be significantly impacted by issues related to legacy data conversion, system architecture and network performance.

For example, the structural inconsistencies in legacy data may severely impact your ability to apply SGML markup according to the requirements of an industry interchange DTD.

Or, the SGML editor that performed well in tests on a stand-alone Pentium with a 10 Mb file starts crashing during file saves on a 50 Mb file on an NFS-mounted network drive.

The trick in designing and implementing an SGML publishing system is to anticipate and make contingency plans for possible show-stoppers during each project phase.

Speaking of which, what ARE those possible show-stoppers?

Legacy Data Analysis and Conversion

Legacy data conversion problems are a major cause of project delay, or at worst, complete failure. It is impossible to plan even halfway accurately for your SGML publishing project unless a thorough study has been made of the legacy data and SGML conversion requirements involved. The starting point for developing SGML conversion requirements is usually an industry interchange DTD and your existing documents. Here is a brief checklist for legacy data conversion:

Document conversion rules for your documents
Prior to accepting bids, make site visits to potential conversion vendors
Tour production facility and meet staff
Understand procedure for error recycling
If vendor facility meets approval, submit full-size samples of all candidate documents for analysis

Many companies use an authoring DTD that is different from the industry interchange DTD they must deliver to. In many cases, the authoring DTD is adapted from the interchange DTD. In other cases, the existing documents look radically different from the interchange DTD, and some type of transformation process must be used to convert the internal SGML structure to an instance that complies with the interchange DTD.

It is very important to have a stable DTD when developing requirements for legacy data conversion. Each change in your DTD will probably result in a corresponding change to the legacy data conversion program.

If your data is in very bad condition (e.g. structurally inconsistent and missing required content), considerable manual clean-up and re-cycling of the data will be necessary (unless everything in your DTD is optional).

If the legacy data is in a proprietary or no-longer-supported format, you must pay a 3rd party to develop one-off tools for converting your data to SGML, or in some cases, companies decide to perform the conversion themselves. Doing an in-house legacy data conversion is a traumatic process, and should be undertaken only when 3rd party conversion possibilities have been exhausted.

If your company is considering in-house legacy data conversion, there are several factors to consider before chosing to do so:

Does your company have the required software tools development experience?
Are your DTDs and data so complicated that a 3rd party would require too much supervision by your technical staff?
Are your developers already SGML-literate?

The advantage in performing in-house legacy data conversion is that the subject matter experts are more readily accessible to the conversion tools developers and authoring corrections can be turned around more rapidly. The downside is that many companies simply do not have the technical skill nor the luxury of time to have in-house staff perform the conversion. If your programmers are not already SGML experts, your project may not be able to accomodate their steep learning curve.

In any case, regardless of who performs the conversion, the resulting SGML output must be checked by the data owners and content experts. Even though an SGML instance parses correctly, the mark-up may not be semantically correct. For example, warnings may have been tagged as paragraphs, and not with warning tags. Only a content expert would be able to catch such a mistake; the service bureau probably doesn't have tools to ensure that the SGML output is semantically as well as syntactically correct.

The content expert will probably need to render the SGML instance in some type of SGML viewer or editor to assist him in the visual inspection and verification of the content markup. Rendering the SGML instance in a real SGML tool is an important step. For example, even though the CALS table output is syntactically correct, it may be missing some attribute data required by your particular table editor to actually render and display a formatted table. The only way to find out whether you can render SGML markup into a final, finished product is to actually output paper, HTML, or other delivery formats in a test environment.

System Capacity Planning for SGML Publishing Systems

Ensure that your hardware and network architecture can take on the load imposed by the proposed SGML publishing system. Benchmark tests using production volumes of SGML data with the proposed hardware and network architecture is critical. Very often, software benchmark tests are aimed at testing product features rather than performance, and the system architecture fails to keep pace with CPU processing loads and production data growth. Here are some hardware sizing tips:

Determine available disk and memory capacity on existing equipment--is it enough?
Establish a standard hardware benchmark test that can be used on any platform
Test each configuration with the same processes and data
Test your proposed hardware configuration with real software and real data
Test volume performance by using multiple, progressively larger test files
Remember to test network (NFS) performance

The candidate software may run very well using small files on a stand-alone machine, but may come to a grinding halt when large amounts of SGML data are being processed on a network drive. Sometimes, the solution is to sub-net the SGML publishing system rather than run it off the corporate backbone. The only way to identify these kinds of system architecture adjustments is to perform a volume performance benchmark test using an architecture that resembles production system configuration as close as possible. To ensure that your system can accomodate future growth, be sure to consider the following:

Plan for new software, e.g. SGML parser
Forecast data growth over 36 month period
Determine scalability/upgrade requirements based on growth projections
Don't use under-powered hardware because it costs less!
NT is not necessarily a UNIX replacement

SGML Software Evaluation Considerations

Although features and functions are important considerations in SGML software selection, other factors will have a significant impact on how well the tool performs in a production publishing environment. Important points to consider are:

Customization--does the product have an API? SGML Editors are rarely used as stand-alone tools. Typically, they will be integrated with a document management system or database.
Does the product for your platform contain ALL the demonstrated features?
Use application prototypes for proof-of-concept
Use production file sizes when benchmarking hardware performance

Publishing with SGML

SGML documents are usually published in multiple delivery formats:
Paper, HTML, Postscript, other (e.g. PDF)
Page composition and printing are major considerations in SGML publishing
Separate page layout/composition products may be a requirement
Management of SGML-unique constructs (CALS Tables) and data (IDs) present special problems

SGML Data Management Issues

Using SGML in your publications system poses some new and unique data management issues. The following sitations usually pose a significant data conversion and management challenge:

Conversion of non-CALS tables to the CALS table model
ID/IDREF generation, resolution and management
Entity management

Table Conversion Issues

Conversion of legacy table data is one of the most difficult SGML implementation tasks. Although the CALS table model is widely supported in SGML authoring and publishing tools, your company's legacy tables are probably non-CALS compliant, and considerable authoring clean-up might be required to make the tables structurally consistant and amenable to SGML tagging.

Editor support for non-CALS tables (e.g. those found in legacy data) is much more limited; that is, you will probably be compelled to use a software vendor's internal table model, which does not always map cleanly or automatically to SGML.

It can also be difficult to convert tables into SGML from other file formats (e.g. MS Word or WordPerfect) that are not structurally rigorous or contain inconsistent formatting codes.

Considerable custom programming is frequently required to convert legacy tables into CALS, since large volumes of data may preclude manual clean-up and conversion.

Bottom line: study your tables closely and be realistic about the effort involved in clean-up and conversiont to SGML.

ID/IDREF Data Management Considerations

Unique ID values must be generated and assigned to Cross-References, Hot Links, etc. Each ID value must be unique and have a corresponding IDREF value, which is resolved during the SGML parsing process.

Most SGML tools do not automatically generate nor maintain unique ID and IDREF values, so special software may need to be written in order to generate and maintain unique ID values for each document and to maintain a record of each ID. Since IDs must be unique and (should be) consistent, allowing manual application by authors is not desireable nor managable in production.

Large numbers of IDs and ID-IDREF pairs are a major production data management problem; in many cases, IDs and IDREFs are applied and resolved in batch mode after the authoring process is finished.

SGML Entity Usage and Management

SGML Entities are commonly used to refer to graphics, special characters, and external files. For example, instead of referring to an actual graphic file name, an SGML graphic entity is referenced in the instance, and the entity declaration points to the physical path name and file name.

Be sure to verify that the generation and resolution of Graphic Entity declarations and references are supported in your candidate SGML editor/publisher tools. If entity generation and management are not supported, your programmers will be in for a lot of work.

Summary and Conclusions

In order to keep the pain and expense of an SGML publishing implementation to a bearable level, please keep the following key points in mind:

Plan and budget adequately for legacy data conversion--it's always a bigger job than you anticipated.
Ensure adequate hardware and network capacity to accommodate data growth.
Test your SGML software with real data and real data volumes.
Identify and plan for major SGML data management issues early in the project, such as:
- Table conversion and layout
- ID and Entity generation and management

A successful SGML publishing implementation is all about technical and schedule risk management. Know your pitfalls and risks right aways, and manage your problem areas realistically.