[Mirrored from: http://www.passage.com/integrat/pubweb/intake.htm]

Passage Systems Inc.
[ Passage Home ] [ About Passage ] [ Online Services ] [ What's New ] [ Products ] [ Customers ] [ Partners ] [ Employment ]  [ Contact Info. ] [ Courses ] [ Consulting ]

Back to Publisher's Corner TOC


Arofan Gregory
SGML Consultant and Practical Publishing Specialist
Passage Systems, Inc.
August, 1996


In most industries where SGML has been widely adopted, the ability to create and enforce use of an authoring technology is a prerequisite for success; the absence of such control is seen as a generally fatal problem. In the commercial book publishing industry, the lack of author control is a given. Despite this, there are strategies that will allow publishers to successfully implement SGML. This is a difficult problem, however, and the solution is not a simple one.

Understanding the Problem

A publisher is faced with several problems when it comes to getting manuscripts that can be turned into successful books. The first of these is convincing the appropriate author to write the manuscript in the first place; very often, a desirable writer is someone who does not need to write the book you think you can sell (because they have a full-time occupation in their field of specialty, or because they could easily go to another publisher and write something they would have more fun with, or for a host of other reasons). Once contracted to write the book, the problem of actually making them do it on a reasonable schedule remains, and this, too, can be problematic. Anyone who has been faced with the task of eliciting manuscript from an unwilling author knows just how difficult this can be.

There is very little room left for specifying to an author that they even supply the manuscript in an electronic form, much less asking them to learn to use a specific (and complicated) software package, just so they can provide you with a manuscript you had to cajole them into writing in the first place. The idea of forcing authors to use a native SGML authoring tool by threatening not to publish them otherwise is absurd. This might work for first-time novelists, but it certainly would not for the kind of authors whose books pay a publisher's bills.

The Variety of Formats

The result of this situation is that publishers are faced with manuscript that arrives - usually behind schedule - in a bewildering variety of formats. Word processing packages from decades past may have been used (i.e., WordStar); all of the common word processing packages of the day will be represented (WordPerfect for DOS or Windows; Word for Windows; Word for MacIntosh; AmiPro; Frame, etc.) and then there will be those manuscripts that do not exist in any electronic form at all. The manuscript is styled inconsistently, and often a single chapter arrives in separate pieces - text, tables, figures, figure legends, appendixes, etc. Second and later editions will often be made of tearsheet - pages taken from the last edition and marked up in ink, with the overflow scrawled on stickies or pages torn from a notebook.

The worst of it is, many authors seem to be convinced that the way they work is the only way to work - that it is somehow dictated to them by their muse (or, perhaps, by the fact that they are basically afraid of software that they haven't been using for the last twelve years . . . ). And if you are dealing with a book that has chapters contributed by several authors, you have the same problem compounded by the number of contributors!

Once in hand, the manuscript needs to be cleaned up, and will generally go through the process of development. Inconsistencies are cleared up, the dross is edited out, and the author is asked to write new sections or to re-write others. Missing figures are begged, borrowed, or stolen, permissions obtained, and a host of other minor details are taken care of. Eventually, the manuscript is complete enough and in good enough shape to be handed over to production. All of this takes place, too, on a deadline (this goes without saying).

In most commercial publishing SGML implementations today, the creation of SGML is left up to the production end of the business. After all, production people are used to dealing with funny mark-up embedded in electronic formats, aren't they? If you are an acquisitions or development editor, you probably are happy to leave it at that - the problem leaves your office when your editorial assistant carries it down the hall to Production. But the problem remains, even if you don't see it. The production staff are now faced not only with getting the manuscript copyedited, typeset, proofed, printed, and bound on a schedule that is often optimistic in the extreme, but must also find time to rekey it into SGML. The kind of money that is required to outsource this work is often sufficient to put a real dent in a production budget, too.

By this point, you are probably asking yourself why anyone bothers - what the other industries say is true: SGML demands that authors be controlled, and our authors can't be controlled. End of story, right? It might be awfully nice to use SGML, but it's impossible to do effectively in the real world . . .

Oh ye of little faith!

Determining the Nature of the Solution

In fact, there are a number of workable strategies for dealing with this problem, and, as with most aspects of SGML, the solution is dictated by what you hope to gain from the the technology. The promises of SGML are many: you can update the next edition painlessly; you can reuse the material in ancillaries and compilations; you can create CD-ROM and World Wide Web versions without difficulty; you can reduce time-to-market; you can cut down on production costs . . . the list goes on and on, and as it does, the benefits seem to be further and further from reality. What's more, they all seem to be based on the assumption that you've already got the manuscript in an SGML form.

How Is SGML Used?

These benefits fall into three basic categories: the ones that require SGML to exist as the basic form of the manuscript for editorial purposes, and the slightly less powerful (but still valid) ones that require SGML at the time the print version is released, and the ones that require SGML only at some later point. If you eventually want to put out a CD-ROM version of the title using an SGML-related technology such as DynaText, all you really need is SGML to follow the release of the print version; this is also true if you plan to re-use the material later for other titles or ancillaries. For updating a future edition, you don't need the SGML until the life of the current edition is over! If an electronic release of your title needs to coincide with the release of the print version, you can't wait for your SGML, but you don't need it while the manuscript is in production, either. If you want to improve your time-to-market, however, or put sample chapters of upcoming releases on your Web-site painlessly, or to leverage the power of structured information to help speed production - or even development - of your manuscripts, you need to render the document into SGML as soon as possible.

SGML as By-Product

Let's address the easiest category first: if you need SGML at some point after the release of the print version, you can often use the typesetting tapes as the source for a conversion. Some compositors will even give you a package deal, agreeing to do the typesetting, and then perform a conversion to SGML after the fact. They deliver camera-ready copy or film, accompanied by (after a reasonable interlude) the same material tagged according to the DTD you provided them with (or - and beware of this one! - which they wrote for you). In any event, you have a complete and richly-tagged electronic form of your document which can be used to produce SGML, not without pain, but generally at least out of sight, and without impacting the release date.

It is a good idea to take a close look at what you are really paying for this service, and to see what it buys you, however. If you will not make a profit on the CD-ROM sufficient to make up for what it cost to turn the typesetting tapes into SGML, then there's no point in doing it, is there? You may choose to eat the expense now, based on projected savings that will accrue from re-use of the material in other titles, or future editions, and this may be a valid reason to make the expenditure. One thing is for certain, however: if you are not planning to use the SGML in the advantageous ways you are basing your projected savings on, you will never realize any benefit from your extra expense. The key here is to plan ahead - try to determine what you will have to do to make this a sound investment, instead of a foolish one, and then do it (and remember, once you start work on the next edition, the SGML will need to be updated to incorporate author's changes, so you'd better have a process in place for doing it when the time comes . . .).

SGML for Simultaneous Release

If you want to simultaneously release print and electronic versions of a given title, getting a conversion from the typesetter's tapes will probably impact your release date. This may be acceptable, but most likely will not. You may be able to get an SGML conversion from your typesetting tapes and prepare the CD-ROM in the time it takes to do print and bind on the books themselves, but this is also improbable. Realistically, you will want to consider typesetting from the SGML itself, merging this category with the most difficult category described above - getting SGML for use as the editorial format. (And if you're going to solve this problem, maybe you should consider reaping the additional benefits...) This is the point at which the answer to the question "Where does the SGML come from?" starts to get complicated.

SGML Up-Front: Native Authoring and Structured Word-Processing

There are a number of ways to get SGML early in a book's life-cycle, and, because of the variety of authors, a given publisher will have to use a variety of different techniques. Some authors actually like computers, and will feel that writing a book using a native SGML authoring tool is a good idea. (This category is very small, but I personally know at least two extant cases. In both, however, the authors are SGML professionals!) More likely, you will find that a computer-literate author can be trained or enticed to use native authoring software you provide them with to write their book. It may help to provide them with a spiffy new laptop pre-loaded with the requisite software, or to promise them larger royalties if they go along with the scheme, of course, but this is a realistic alternative in some situations.

If this is your chosen tactic, however, be prepared to have someone available for the authors to contact with questions - and this means someone who knows your DTD and the authoring tool as well. (If you have an in-house SGML expert, that is probably the person to use.) This kind of author support can be very time-consuming, however, so make sure that you are ready to dedicate many hours of your support-person's work life to the task.

A more practicable way to implement this is to ask your authors to use a word-processing template to do their authoring in, and to have tools in-house that can perform the conversion to SGML. (Personally, I think that providing a Word template to authors, and asking them to use named styles, is preferable to cutting them loose with the word-processing SGML add-ons. The latter category of tools generally produce both bad SGML and a great deal of confusion for authors, so you are getting the worst of two worlds.) It may be helpful to make some interface modifications to the word-processor that you ask your author to use - one standard trick is to remove the bold, italic, and underline buttons from the toolbar, forcing authors to use the appropriate character styles.

In this case, too, you may want to provide incentives to your authors; loaning them a good computer with pre-loaded software is not a bad idea, especially if they have any techno-vanity at all. (For some reason, laptops make people feel more important, even though peg-mice are incredibly annoying and hard to use!) This also makes it easier to make interface modifications, and to ensure that software is installed and configured properly. Buying a few "loaner" laptops and loading them with appropriate software may be an investment that can pay off over time. For really important authors, it may even make sense to purchase them a computer outright.

Training is another expense that you will incur, whether your authors are working with native authoring tools or word-processing templates. It helps to have people on staff who can train authors to use whatever software is required, rather than paying to send authors to classes that aren't specific to the task at hand. (Personalized training can be a real boost for author morale as well, because it makes authors feel more important to the company, and more involved in the actual creation of their book.)

Now That I've Got It, What Do I Do With It?

Getting SGML from your authors obviously requires that you have the ability to deal with it once it arrives in-house. If your development staff are unable to work with your templated Word-Processing files, or with native SGML, it may be desirable to let them work on paper, and simply print out the word-processing documents or have a filter to down-translate the SGML to .RTF or some other easily printed format. The source files would then go to Production, to be updated when the corrected paper version comes in from the editors. You may wish to develop a separate SGML team that works for the development staff, unrelated to Production. This team could serve both as support for developmental work with the word-processing or SGML files, and as a "clean-up" team to correct whatever mistakes authors have made in their manuscripts in using SGML tags or named syles. (Such a team would also be ideal for providing author support.)

SGML Up-Front: Unstructured Formats

What we have so far discussed are the easy cases - an author has consented to use some special tool to create their manuscript. In the majority of cases, however, this will not be true: you may have signed a manuscript that already exists in some format, or you may simply be dealing with an author who is unable or unwilling to oblige you. This means that you will either be faced with some unstructured word-processing format, or you will have to rekey the manuscript. These two situations have more in common than you might think. If you are rekeying the manuscript, you can either have it keyed in SGML or a word-processing template (for conversion), or you can have it keyed as an unstructured word-processing document and then put it through some other conversion process to get to SGML.

If you pay to have it turned into SGML directly from the paper, then you have gotten yourself to the same place as if the author had submitted it that way. If you paid a freelancer to do this, chances are you will need to do some clean-up (just as with an author). If the keyboarder is on staff, then you will have more control over the quality. Again, if there is a tech support pool for your development editors (or if they are SGML-savvy), you have a fine set of candidates for this work. This may also be a task that you wish to leave to Production, only paying to key the manuscript after it has gone through development. Be aware, however, that in this situation you are sacrificing whatever could have been gained from an automated development process. (The advantages to this are outlined below.)

If you have your paper keyed into an unstructured word-processing format, you have just created another manuscript of the type that will probably be the most common submission anyway. The technology for turning unstructured word-processing documents into SGML is not very effective, and for a good reason: you don't usually get something for nothing. SGML requires that all of the knowledge about your document that will be stored in the SGML be added to it. Some of this information can be deduced from formatting cues using tools like FastTag; most word-processing documents will tell you something about their structure in a way that can be automatically derived from them. This is a complex technical matter, and there are tricks that can be performed using a series of DTDs and a variety of filtering techniques that build on this limited knowledge to create richer SGML, but this is never a fully automated process.

The Human Factor: A Requirement

What it comes down to in the end, however, is that you will need human beings with a knowledge of the structure and content of your documents to go over every word, and make sure that the translation was successful (which really means, to correct it where it was not). This may take the form of people using native authoring tools to correct the invalid or incorrect SGML that has resulted from your filtering efforts; it may manifest as people converting whatever the input format was into .RTF, and then structuring it with named styles in Word, which will then be translated into valid and correct SGML with more exacting filters. The point is, someone needs to go through your documents, and make sure that the structural and content-specific information you wanted to add to them is there.

Production or Editorial?

Whether this job is performed in Production or in Editorial is a matter for debate, and the answer lies mainly in the specifics of your situation. Generally speaking, it is a good idea to get your development editors to work in a structured way, or to have staff who can do this for them. A raw manuscript differs from a second or third edition only in extent, not in kind. (If the first edition was in SGML then all you need to do is key in the changes, but it is helpful if this process is basically the same as the one used for dealing with a raw manuscript.) Your development editors are the ones who deal directly with the content and structure of your manuscripts. They have the best knowledge of exactly that information that you wish to include in your SGML. It is easier to have them incorporate this knowledge directly than to make your Production editors elicit it from them and incorporate it at a later stage. On the other hand, experienced Production editors are pretty good at eliciting this information from their Editorial brethren, and they are usually much better at dealing with mark-up embedded in electronic formats.

Before you decide to leave it to Production, though, I will touch very briefly on the benefits of using SGML during the development of a manuscript:

  1. You can get promotional drafts of early chapters to put up on your Web-site with only the cost of a simple SGML-to-HTML conversion filter.
  2. Parsing SGML tools are excellent at detecting things like figures without legends, blind cross-references, and contributed chapters where the author failed to supply their affiliation or degrees. Intelligent use of a well-constructed DTD can materially help development and production editors get their jobs done faster and more effectively.
  3. Production will be able to cut down on the time it takes them to turn your manuscript into bound books - they will spend less time correcting editorial omissions, and tasks like type-coding will already have been done for them. (This means that those over-optimistic production schedules may become possible after all!)
  4. The use of SGML encourages editors to think about ways that material can be re-used. If you are working hands-on with tagged material, you will be personally aware of the effort required to turn it into rich SGML. This will stimulate the part of your editors' brains that want to get the most for their effort, which may mean coming up with ideas for anything from compilations of several existing titles re-packaged on CD-ROM to ancillary diskettes of existing, tagged material related to the subject of a given title. The topic of re-use is very dependent on the type of material published, but personal experience with SGML definitely helps Editorial staff understand its possibilities and limitations.

Letting the Compositor Handle It

If you decide to leave your developmental editors in peace, however, there is another way that your manuscripts can be turned into SGML: have your compositors do it as part of the typesetting process. Compositors have been keying paper manuscripts for a long time now, and they're good at it (meaning, they charge reasonable rates). There are some dangers involved here, although it may seem like the best solution on the surface. As discussed above, SGML involves adding knowledge about the structure and content of your manuscripts to their electronic formats. Compositors generally do not know very much about the material they typeset - content is the publisher's responsibility. Further, SGML is only as good as the accuracy of the mark-up. If you get bad mark-up from your compositors, every other part of your processing system that relies on SGML will be that much less effective. Moreso than ever before, quality is at least as important as the bottom line when it comes to selecting outside suppliers.

You are not necessarily freed from the responsibility of having Production staff who can work comfortably with SGML, either. Although you might be able to have your compositor key all of your proofreader's and author's corrections into an SGML format, you sacrifice a great deal of value here.

Production and SGML: What Is to Be Gained?

Much as for development work, SGML can substantially improve the pace and quality of your Production editors' efforts. In addition to the kinds of structure checking described above, any kind of consistency or terminology verification can be performed more quickly and thoroughly with the help of context-sensitive searches. If you have linked your citations and references together with ID-IDREF constructions, you can automatically find where the missing references are (not to mention the fact that you can determine which ones are structured incorrectly). You can automatically generate TOCs and rough indexes, using simple scripts that select the contents of the desired elements. If marketing wants outlines for the catalog or Web-site, this too becomes possible at the push of a button, using simple filters. Pre-edited disclaimers and legalese can appear in your frontmatter with the simple inclusion of an entity reference. The list goes on and on, spinning off into the realms of fantasy . . .

In reality, the amount of effort you save is directly related to the sophistication of your editorial system. You cannot use filters that haven't been written, and you can't check pieces of structure that you didn't make the effort to tag in the first place. Every advantage has its price, and sometimes it's better not to get too complicated until you know what you are doing. Production editors do spend a lot of time cleaning up bad SGML, especially if it was created by a compositor who didn't bother to ask the necessary questions. There is a trade-off here, and it is a good idea to be careful in designing a production system based on SGML, even to conduct a pilot project or two to give you a better sense of what you're dealing with.


There are many different ways to get SGML from your author's submissions, and you have a number of choices available, depending on what it is you want from your SGML-based system. Commercial publishing represents the high end of the information spectrum in terms of the effort put into creation, verification, and presentation of material. Consequently, the systems required to automate this process are themselves "high-end": they demand the same sophistication as the process that they are automating. This means that they are complicated and expensive.

Like any business proposition, it is important to make sure that the expenditure required for automation produces commensurate return. The promise of SGML will not be realized unless publishers think ahead, making the kinds of changes that allow them to re-use their material easily, and to get their books to market faster. A Web-site can be an excellent place to get your titles into the public eye, but it needs to remain as fresh and exciting as the Web itself, or the public will only visit once. SGML can help you reach these goals, but only if these are your goals. The new information technology represents not just a better way of doing the same old thing; it represents a way of doing new things (and you can bet that if you aren't doing them, the competition is!) To realize this potential, Editorial techniques and thinking need to change, as well as those of Production departments.

Getting to SGML is not easy for publishers, nor is it cheap. It is important to understand that the expenditure of money and effort can ultimately pay off, and to understand how this can be made to happen. It may be more difficult to use SGML in an industry where author control is conspicuously lacking than in most "SGML" industries, but that same industry is a place where using SGML can pay unusually large dividends. A complex problem requires a complex solution, but the problem here is one that is well worth solving.

Back to Publisher's Corner TOC
Passage Home Page [About Passage] [Online Services] [What's New] [Products] [Customers] [Partners] [Employment] [Contact Info.] [Courses] [Consulting] Comments to webmaster@passage.com
This page last updated 9/1/96.
Copyright© 1996 Passage Systems Inc.