[Mirrored from: http://www.dtc.rankxerox.co.uk/Srw_pape.html]

High-volume, High-accuracy, SGML Document Capture: A Case Study

Susanne Richter-Wills
Business Development Manager, Document Imaging
Rank Xerox Business Services
Document Technology Centre
Beech House, Building 9
Mitcheldean
Gloucestershire

Abstract

This case study describes points that need to be considered when setting up a high quality (99.9%), high volume (over 1 million pages per annum), long term (5-10 years) document capture operation.
Rank Xerox Business Services captures documents for the EPO - European Patent Office. Around 25,000 pages are captured each week, 52 weeks a year. From paper to SGML encoded text with embedded images, all steps and relating issues will be discussed.

Introduction

RXBS have implemented an in-house high-volume data capture operation enabling 100% accurate capture of patent documents as SGML-encoded text plus embedded images. We describe our experiences with setting up and running this operation over the last 5 years.

Rank Xerox Business Services

Xerox Business Services is the world-wide leader in document outsourcing, providing services to more than 4,000 client companies in 36 countries. XBS is the fastest growing business division of The Document Company, Xerox.
Having pioneered document outsourcing 40 years ago, XBS today offers an expanded portfolio of advanced, digital and network-based solutions, as well as strategic consulting services.
Through its world-wide network of Document Technology Centres XBS offers custom applications; strategic print-on-demand; Internet Document Services; high-volume, high accuracy document capture and conversion services; high-volume printing and duplication; short-run, customised digital offset colour printing and publishing; digital file enhancement; electronic storage, short-run books, CD-ROM creation and replication and high-volume disk duplication.

The Challenge

In July 1990 RXBS won a contract with the EPO to capture and publish European patents. (RXBS share the contract with the French company Jouve)
The contract involves the processing of the two main classes of patent documents, patent applications and granted patents, which are broadly similar in terms of the processing required.
The original patent applications as submitted by the applicant consist of an abstract, a description and a set of claims.
Granted patents are the original applications as amended by the examiner (in the form of hand-written amendments which must be accurately captured). It has a description plus a set of claims translated into English, German and French.
Although there are various rules laid down by the EPO governing the presentation of patent applications, the applicant is still free to use a very wide range of type faces, font sizes, paper quality, etc.
Patent documents may be submitted to the EPO in English, German or French. The actual language distribution is roughly 60% English, 30% German and 10% French. In terms of subject matter, patent applications can be split into three broad technical fields of similar size; chemistry/metallurgy, physics/electricity, and mechanical.
The average size od a document is around 27 pages, and our share of each weekly publication batch would amount to about 25,000 pages per week, or 1 million pages per year.
The work involved in processing an individual patent document can be divided into the digital capture of both the text and images of the document, and the subsequent printing and distribution of the captured electronic data. We will concern ourselves here with the capture element of the process.

The EPO have been using generalised mark-up for capturing the structure of patent documents since 1985, so SGML was specified as the means of encoding the text parts of each document. The images, such as drawings and unencodable tables and formulae, must be captured as CCITT Group IV bitmaps, and SGML tags must be inserted into the text to indicate the logical position of the image in relation to the text.
The SGML mark-up required by the EPO is fairly complex, particularly considering that in early 1990 when we won the contract SGML was still in its infancy, and there were few quality SGML tools available.
Challenging aspects include the following:

Table encodding

A patent document has on average 4 tables to be caputured as SGML, each table containing on average 204 characters.
Features to be encoded include identifiers, titles, column headers and sub-headers, table footnotes, cells spanning vertically and horizontally, various cell alignments, and arbitrary horizontal and vertical ruling.
Simple and complex mathematical formulae
These include summations, limits, products, integrals, radicals and arrays; in other words the usual gamut of mathematical constructs.

Lists

A number of list types must be recognised and correctly encoded, the list type depending on the form of the individual list items.

Floating accents

A number of 'accents' such as a large circle surrounding a character and a small circle over a character are defined; these can be combined with any character to form a floating accent construct.

Character Fractions

These are used for two strings of characters which appear one above the other within one line of the document. They can occur with or without a separating horizontal bar.

Images

Any parts of the document which cannot be faithfully captured as text marked up with the available SGML tags must be catured as a bitmap. As well as drawings in the usual sense, this approach embraces the small proportion of tables which do not fit into the EPO's table encoding scheme, complex mathematical formulae, and chemical formulae involving symbols such as benzene rings.
A call-out SGML tag must also be encoded within the text at the position the image would logically occur during the text. This often cannot be deduced automatically, for example where you have a column of text with associated images to the right of the text.
A special case of image are occurrences of characters not included in the extended character set defined by the EPO (so-called 'undefined' or 'FF' characters). About 1.5% of patent documents have undefined characters; those that do, contain on average 3 different undefined characters. Such characters must be captured as bitmaps when they first appear, and added to a font specific to that document for re-use throughout the document.

Character sets

The EPO use a proprietary character set consisting of around 500 characters which commonly occur in patent literature. This character set includes the upper and lower case Greek alphabet and a full set of mathematical and logical operators, plus an assortment of less common characters (see Figure 1 below for some examples). It is necessary for us to accurately capture each of these characters whenever they occur in a patent document, and to subsequently print them in a variety of styles and point sizes. (Selection of characters from the EPO character set).

Importance of Processes

Production schedules

The EPO has a legal obligation to publish patent documents within a certain time-frame. Since publication and preparation for distribution are undertaken by RXBS we inherit their obligation. The EPO notify us of the publication dates of each document, normally about 6 weeks beforehand; these deadlines must be met at whatever cost.
Weekly volumes can fluctuate by as much as 50% which results in serious workflow challenges.
In order to meet the challenges set, processes must be designed which incorporate all worksteps i.e. from receipt to despatch.

Customised vs. 'Off-the-shelf' software

One major decision to make when setting up a high volume capture operation is whether to invest in software development resource or outsource the task and therefore buy 'off-the-shelf'.

It became apparent during early investigations that in order to maximise productivity and to achieve high quality, processes must be customised.
It is advisable to invest in a set of tools and also support heads. To work with tools which only meet 85% of the requirements results in subsequent higher production cost and production issues.
If you decide to outsource your capture operation, ensure that the contractor fully controls and owns the processes used. This avoids discrepancies between expectation and delivery.

Design Decisions

The first stage in implementing a solution required is making some basic design decisions.

Platform

We decided on Sun workstations as the primary platform, due to their good performance to cost ratio, the high-resolution monitors and networking capabilities supplied as standard, and Sun's commitment to open systems.

Disk storage

A server would be needed at the centre of the operation to store all completed patent documents, but we wanted to make the system resilient to the failure of any single component, including the server. We therefore specified that each data capture workstation would have a local disk on which all completed documents would be stored until the server was ready to receive them. The server has a big disk to store completed documents until their publication date arrives.

Document transfer

Operators can select any workstation on which to work. When they specify the document number they wish to work on, the workstation will poll all the other workstations plus the server to determine the location of the most up-to-date version of the document. This will then be transferred to the local disk. The operator need only wait until the data pertaining to the first page has been transferred before starting work (typically a delay of less than a second); the remainder of the document will be transferred behind the scenes as the first page is being worked on.

Workgroups

Initially we had the workstations arranged in groups of six, where a group of operators would process any given document from start to finish, with each operator performing any task necessary. Further trials revealed however that both productivity and operator satisfaction were increased by dedicating operators to one specific task, such as scanning or proof-checking. The operators can now choose their specialisation according to their talents and preference.

Bitmap resolution

We decided to use 300 dpi scanners, hence this is the resolution of bitmaps used throughout the system. We experimented with higher resolutions, but found the increase in ICR accuracy to be minimal, and since the customer only required 300 dpi images to be returned, we could not justify the increased disk space required to process higher resolutions.

Process Steps

We decided to divide the processing for a document into a sequence of separate steps. Each step must be completed for all pages of a document before the system will allow the operator to commence the next step. We will discuss the process steps in the order that an operator performs them.

Physical document preparation

Experience shows that companies often underestimate the importance of document preparation.
Preparation includes the counting of pages, removal of binding and staples, attaching of worksheets and the general checking of the pages for anything unusual e.g. poor quality pages, hand-written amendments, complex maths or tables.

Scanning

It is important to choose the right type of scanner. The decision made should be dependent on document size, document quality, resolution and volume.
It is advisable to invest in a backup scanner in case of breakdowns.
A word of warning: Do not take throughput specified by manufacturers as the daily achievable target. Processes must be in place to ensure that as little time as possible is wasted by e.g. stacking of auto-feeders, writing of data to disk. Allow approximately 65% of the throughput only (this assumes good processes). We use two types of scanner; a bulk scanner for high throughput and high quality scanners for rescanning of problem pages. Rescanning may be required to achieve higher quality for images.
At this process stage as well as physically scanning in the pages of the document, pages will be segmented into text and image areas. For image areas the type of image must also be specified by the operator (maths, table, chemical etc.).

Character Recognition (ICR Intelligent Character Recognition)

Not surprisingly, the key element in accurately converting paper documents to electronic data at a competitive cost is the ICR and the subsequent identification and correction of any conversion errors.
We chose to use a beta test version of the Xerox Imaging System ScanWorx API. General opinion held this to be the leading ICR at the time, and our own investigations bore this out. It also embodied all the functionality we required, such as omnifont recognition, training capability, positional information on a word-by-word basis and feedback on recognition assurance levels for individual characters.

For the capture of patent documentation the ICR is run with no interactive verification by the operator. No training data is made available to the ICR prior to starting each document, since each document is in general singular in terms of fonts used and page quality. Training data is however accumulated for use with subsequent pages as the document is processed by the ICR.
It is advisable to train the ICR if documents contain the same font.
As well as the in-built dictionary, we supply to the ICR a custom dictionary which is being continually updated. During the spell-checking stage, the operator can elect to add an unrecognised word to the dictionary. When each document is sent to the server on completion, the dictionary is merged with a pool of all such collected words. At two-weekly intervals a senior data capture operator will 'harvest' this crop of words, and will select from those collected those words which occurred in three or more patent documents. In this way we have accrued a dictionary of words which commonly occur in patent literature.
The output from the ICR contains the recognised text along with supplementary information such as the position of each word. We have elected to convert the proprietary format as output by the ICR into a simplified form of SGML that we call 'internal mark-up language', or IML. IML is used to represent the text of the document throughout the rest of the system, until it is finally converted to SGML as required by the EPO.

Proof-checking

As mentioned before the EPO, requires high quality output. The choice of tools for the correction of ICR outputs is of vital importance. It is this process step which does require most resources. In the case of RXBS 70% of staff are employed in the area of correcting the ICR's text. If the requirement of the EPO was 98% correct text, only a small part of the currently employed staff would be required. But although it can multiply the cost, to specify high quality ensures that the value is added to the usage of the data in future. It is not advisable to save on quality.
For checking and correcting the output of the ICR in order to obtain 100% conversion accuracy, it was necessary to devise a system whereby every character could be checked. Patents are legal documents and therefore spelling mistakes introduced by the inventor can not be amended. This limits the option of correction tools. In this case RXBS can not use auto-correction tools as mistakes could be introduced.
We decided to implement a manual line-by-line proof-checking system. For this process the operator is presented with a line of the bitmap and the related line of ICR output beneath it.
It is important for the operators to learn to check and correct on the basis of the visual appearance of the two lines as opposed to trying to 'read' the text, as it is all to easy to see what you expect to see. For example, a surprising number of people will see no flaw in the following text at first glance:
As well as correcting any mistakes made by the ICR, the operator will at this stage apply any missing styles such as bold, italic, underline, overscore, superscript and subscript. They will also encode any character fractions or floating accents.
'FF' characters can at this stage be identified to the system, which has the effect of adding them to a virtual keyboard which can be used to encode subsequent occurrences of the same character.
There are in fact six virtual keyboards; the other five contain the non-ASCII characters from the EPO's extended character set, grouped according to usage (Greek, mathematical, chemical etc.). Any of these keyboards can be displayed on the screen while the operator is working to allow rapid access (see Figure 3 for an example virtual keyboard).
Most of the proof-checking functions can be accessed via menus or by keyboard short-cuts depending on the individual operators experience and preference.

Spell-checking

The spell-checker is a variation on the proof-checking application. Instead of working through each page line-by-line however, the operator is taken straight to the first spelling mistake on the page. They can then elect to accept the word (and optionally add it to the document dictionary), leave the word for later attention, or correct the word. The interface provides icons to move backwards or forwards to the next unresolved spelling mistake.
Two dictionaries are used to validate each document; a global dictionary and a document-specific dictionary. The global dictionary on all workstations is updated periodically with the words added to the document-specific dictionaries.
As well as high-lighting words not found in the dictionary, the operator can opt to see all terms or characters which can significantly alter the meaning of a patent, e.g. a number referencing a drawing.
At the end of each page, the operator can enter a 'view' mode, where the full screen is given over to the text of the page displayed at 150 dpi. About half the page can be viewed at one time, the display being freely scrollable.
We have found that for certain types of document (consisting almost exclusively of normal text, free of formulae or numeric data, and in the data capture operator's first language) the proof-checking stage can be by-passed, the spell-checker plus view-mode being sufficient to achieve perfect quality.

Encoding

This is where the structure of the document is encoded, by inserting SGML tags into the text to identify the start and end of constructs such as headings, paragraphs, lists, tables and footnotes.
We considered but rejected the possibility of automatically encoding the structure of the document based on the position and content of the text. Only fairly straight-forward cases such as paragraphs can be reliably encoded automatically, and the encoding of each page would still need to be checked by an operator. Since the simple cases are quickly encoded anyway by a skilled operator, we felt that little or no time would be saved by this approach.
The third-party SGML editors available at the time allowed very little customisation of their user interfaces, and would not have permitted us to take advantage of the constraints and interrelationships which are inherent in the structure of patent documents but not made explicit in the DTD. Examples inlude:(list)
Footnotes and their associated references are connected via their 'ID' attribute
Images can be repositioned by the operator between any two lines of text, and a call-out tag must be inserted there to tie in the correct bitmap
The numbering of ordered list items must be consecutive within a list
In addition, it would have been cumbersome and time-consuming for an operator to use these tools to encode such structures as character fractions and floating accents. We therefore chose to design and implement our own tool to insert SGML mark-up as efficiently as possible, and with minimal possibility of operator error.

As in the proof-checker, the operator can in most cases choose to use menus or keyboard short-cuts. In addition the left mouse button has a special function, in that what it does depends on the context of the line beneath the cursor. For example, at the top level of the document (not within any structure) it will encode a paragraph, whereas within a list it will encode a list item, and within a table it will encode a table end.
The encoder application enforces the rules emboded in the DTD specified by the EPO, plus additional guidelines defined by the EPO. If the current document becomes invalid because of the encoding applied by the operator, a warning will be displayed, and the operator cannot exit the application until it is corrected.
Images are displayed in the position they occurred on the original page. In addition a small marker is placed within the text to indicate where the system considers the image to logically occur. This marker can be moved to another position within the text by the operator.
Encoding the structure of tables within patent documents using the SGML tags defined by the EPO was a sizeable problem in its own right. At that time we were not aware of any third-party products on the market capable of marking up tables in the way that we required, so we designed and wrote our own software to do this. We implemented table encoding as a sub-system of the main SGML encoding application. The operator is presented with the text content of the table, as captured by the ICR and proof-checked by data capture operators, with each word positioned as it was on the original page.
By designing a DTD specific encoding tool an operator can encode in excess of 1,400 pages per day (this includes complex tables). Again it must be emphasised how valuable it is to customise the tools for production.

SGML check

The final stage of the capture process is a QA check. The document is converted to SGML, and is then reformatted and displayed on the screen beside the original bitmap.
It is easy to spot missing paragraph tags and other encoding errors, since the difference in the document structure will be immediately apparent.

Quality Assurance

Accuracy

The most challenging aspect of the contract is the fact that the EPO require 100% data capture accuracy. Under the terms of the contract incorrectly captured characters can lead to financial penalties and the document being rejected by the EPO, for example one Greek or other special character incorrectly encoded in a document is sufficient cause for rejection.
RXBS has implemented stringent quality control processes throughout the capture process. All pages are checked at least twice. Sampling to ISO standard is also carried out.
Operator pay is partially linked to performance (productivity and quality). The performance of every operator is closely monitored and any errors found will trigger a course of counselling or retraining.
Processes are constantly revised to further improve the performance.
As previously mentioned most cost during capture lies in quality assurance. A low quality output would not bear much value. It is important to 'get it right the first time' and customised processes will help to achieve this aim.

Staffing for Performance

Finding the right calibre of data capture operator is important. Operators must be able to concentrate for long times to achieve the high level of accuracy required.
Different skills are required for different process steps. RXBS have developed highly effective recruitment and training methods. Recruitment tests are the first part of the recruitment procedure. These tests have been specially designed for the tasks. Test results can be closely linked to operator performance after training. Operators are also interviewed to ensure that their commitment and attitude fits the team they will work in.
Training is split into two main parts. The first part is training in a group of four for one week. This training is carried out in a classroom environment.
Once successfully through this training operators are assigned to a team. The team leader takes on responsibility for the training of the individual. Over a 12 week period the performance of the operator will be closely monitored. The operators work will be reworked until the quality of the work can be assured. After the 12 week period the operator will be taken on to the team if their performance matches the expectations set.
A vital part of the training is also the quality traning each Rank Xerox employee receives. It emphasises to the individual the importance of their task. Each operator is responsible for the part of the process they carry out. They are aware that any errors passed on, can effect the customer and definately will effect them.
Maintaining a 'library' atmosphere where in theory at least only functional talking is permitted is an important aid to sustaining an acceptable accuracy level, especially where operators are involved in proof-checking. Interestingly enough, experience has shown that allowing the use of personal stereos actually increases operator productivity.
A process is in place whereby data capture operators may submit suggestions for improvements to the software, these suggestions are then prioritised by a committee based on approximate software development times and expected cost savings, and the leading suggestions are implemented as soon as possible. As well as improving the quality of the software, the operatorsachieve satisfaction from knowing that their opinions are valued.

Recruitment criteria

It is not possible to recruit solely staff with SGML knowledge. In 1990 SGML was still in its infancy and it was not possible to expect any prior knowledge of the field from the operators. Therefore the production software which was designed hides the SGML tags from the operator and replaces then by codes which are meaningful to the operator i.e. P for paragraph. Style and references are optically emphasised e.g underlining with 'm's for mathematical formulae. The capture operators do not have to know the rules of the DTD used e.g. they are not aware if end tags are required.
This approach ensures that high productivity can be achieved as well as relatively low employment cost.
The main criteria are the ability to concentrate, basic computer knowledge, typing skills and ability to understand complex rules and tasks.

Conclusion

We have designed and implemented a distributed system for accurately capturing and marking-up patent documents, and it has met our quality objectives whilst permitting high operator productivity.
The correction of the text itself occupies most of the operator time, as opposed to the SGML encoding, but it must be borne in mind that the subject matter is in general highly complex, containing many unusual characters, formulae and technical terms.
Line-by-line proof-checking where each line of converted text is visually compared to the original bitmap seems to be a very effective way of achieving 100% accuracy at high productivity levels, although for simple documents the use of a spell-checker plus a side-by-side visual check of original document versus converted text suffices.
We have found it worthwhile to write a custom tool to facilitate efficient and error-free SGML mark-up. Having a separate table mark-up mode based on the layout of the original page has proved effective.
Splitting the capture process into separate tasks and allocating each operator to only one task has improved both productivity and employee satisfaction.

In-house Capture versus Outsourcing

In the case of the EPO an outsourcing solution has been chosen. The EPO has trusted two contractors with the capture and conversion of its product.
This approach brings many benefits:


The benefits far outweigh any percieved disadvantages of lost control and security.
Outsourcing should be considered when:
It is important to ensure that any contractor chosen has full control over the processes used. An enquiry process must be in place for the contractor and customer to ensure specifications are met. It is vital to the success of the project that a service level agreement is written which specifies inputs and outputs. This ensures that all expectations of customers and contractors are met.

Biography: Susanne Richter-Wills

Susanne Richter-Wills is the European Business Development Manager for the Document Imaging Services offered by Rank Xerox Business Services.
Over the last 6 years she has been involved in setting up several high volume, high accuracy document capture operations.
Her responsibility today is to ensure that new Xerox research results are translated into a customer solution.