[Mirrored from: http://www.dclab.com/nlm.htm]
ABSTRACT: In the medical industry, the accuracy of data is a matter of life and death. Therefore, accurate data conversion is a necessity whenever a new system is implemented. This article describes the conversion process of the National Library of Medicine as they moved their materials into an on-line system, and will be of interest to anyone who is concerned with the accuracy and quality of data.
There's a maxim in the conversion business: "Information is an asset." Nowhere is this more true than in the medical industry, where information can mean the difference between life and death. In such a milieu, the importance of the National Library of Medicine (NLM) can hardly be exaggerated. It is, after all, the largest medical research library in the world. In fact, it's the world's largest research library in a single scientific and professional field, with a collection of 5 million items books, journals, technical reports, manuscripts, microfilms, and pictorial materials.
But what does it mean to be a library today? New information technology is expanding the possibilities of how much information can be stored and how it can be disseminated. In the field of health care, new opportunities mean new obligations. To quote the Hippocratic Oath, "into whatsoever house you shall enter, it shall be for the good of the sick to the utmost of your power." High technology has expanded "utmost" to new levels and has made it possible to enter more houses than ever before without even getting out of one's chair.
The NLM's latest response to this challenge is HSTAT (Health Services/Technology Assessment Text), an electronic resource that includes the full text of clinical practice guidelines, quick-reference guides for clinicians, and consumer brochures. The materials were provided by the Agency for Health Care Policy and Research (AHCPR), National Institutes of Health (NIH) consensus development conference and technology assessment reports, and the U.S. Preventive Services Task Force Guide to Clinical Preventive Services (1989 edition).
HSTAT is part of an initiative called the Health Services Research Information Program coordinated by NLM's National Information Center on Health Services Research and Health Care Technology (NICHSR). The actual development of HSTAT was left to the Information Technology Branch of the Lister Hill Center, also part of NLM. It was the Lister Hill Center that called DCL.
"HSTAT can be accessed several different ways," explains Maureen Prettyman at the Center. "In fact, it's currently in three different databases. Users can do full-text search and retrieval on character-based terminals, they can download over the Internet with gopher or ftp, and then there's the World Wide Web. But when we started, all we had were WordPerfect documents, ASCII, and the books themselves. We knew we needed to go to SGML."
SGML tagging would provide the cues needed for search engines and could readily be converted to the SGML-based HTML, the accepted format for World Wide Web access. But after developing a Document Type Definition (DTD) to define the structural rules for the SGML documents, hundreds of pages of material had to be converted from the first set of books, and then there would be more sets to follow.
Norman Barth, DCL Project Manager for this conversion, talked about the difficulty of such a conversion. "Some companies think they can save money by converting files in-house, but with a complex conversion like this one, a company will find itself draining more and more of its resources as the project continues. NLM came to us right away and we were able to give them a cost estimate that allowed them to make a realistic budget"
But why is the conversion so difficult? Norman continues, "In this conversion, we are adding information. There are no tags in the original material. Where does this information come from? Three places: appearance, context, and content. If all chapter titles are the same font size, then we can use appearance cues to tag chapters.
"But because SGML is so concerned with how a document is structured, context becomes important, too. A blank line in a sample form, for example, might be considered different than a blank line after a study question in the back of a chapter. The only information source we didn't use for the NLM job was content. In this case, it was more cost-effective to have their own people do that tagging, since they were subject-matter experts. Still, most of the tagging was accomplished by appearance and context clues only.
"Your original question was about difficulty. Let me just say that we've had to expand our development and editorial departments twice as we've increased the number of SGML conversions we do. Most of our editors are trained specifically for SGML. My advice: Don't try this at home!"
DCL is able to offer any combination of manual and automated processes to most cost-effectively convert legacy documents. In this case, a manual approach was chosen. Jennifer Ruckdeschel was put in charge of the editing process.
"Maureen [Prettyman] sent the DTD and narrowed down what she wanted us to do. From that information, I came up with keying specs from outside editors, who tagged the WordPerfect files. When they came back, we parsed them and had in-house editors do the final clean up.
"Communication with the client was very good on this project. Whenever Maureen had a question or concern about our tagging, she didn't hesitate to call. Our priority is always to create a good feedback cycle with our clients. By sending them materials early and often, and then making them feel comfortable when they call, we stay in touch with what the clients want and the clients don't get any surprises."
Even though most, if not all, of DCL's employees have not taken the Hippocratic Oath, they did the utmost of their power for the NLM, which has already begun to make HSTAT information available. For more information on how you can access this information, please call the NICHSR at (301) 496-0176 or E-mail them at NICHSR@NLM.NIH.GOV
Data Conversion Laboratory 184-13 Horace Harding Expressway Fresh Meadows, NY 11365 Tel (718) 357-8700 Fax (718) 357-8776 Email: convert@dclab.com |
Copyright © 1996, Data Conversion Laboratory
Last modified September 25, 1996.
Please report problems to convert@dclab.com .