[Mirrored from: http://www.dtc.rankxerox.co.uk/Hrs_pape.html]

EXPERIENCES WITH HIGH-VOLUME, HIGH-ACCURACY DOCUMENT CAPTURE

H R Stabler
Rank Xerox
Document Technology Centre
Mitcheldean, United Kingdom
Hugh@dtc.rankxerox.co.uk

ABSTRACT

Rank Xerox have implemented an in-house high-volume data capture operation enabling 100% accurate capture of patent documents as SGML-encoded text plus embedded images. We describe our experiences with setting up and running this operation over the last 4 years.

Background

In July 1990 Rank Xerox won a contract with the European Patent Office (EPO) to capture and publish 65% of all European patents (the remaining 35% to be processed by a French company, Jouve).

The contract involved the processing of the two main classes of patent documents, A-documents and B-documents, which are broadly similar in terms of the processing required. A-documents are the original patent applications as submitted by the applicant. An A-document consists of an abstract, a description and a set of claims.

B-documents are the original applications as amended by the examiner (in the form of hand-written amendments which must be accurately captured). It is this patent specification which is actually granted to the applicant and is a legally binding document. A B-document has a description plus a set of claims translated into English, German and French. Although there are various rules laid down by the EPO governing the presentation of patent applications, the applicant is still free to use a very wide range of type faces, font sizes, paper quality, etc.

Patent documents may be submitted to the EPO in English, German or French. The actual language distribution is roughly 60% English, 30% German and 10% French. In terms of subject matter, patent applications can be split into three broad technical fields of similar size; chemistry/metallurgy, physics/electricity, and mechanical. The average size of a document is around 20 pages, and our share of each weekly publication batch would amount to about 1500 documents, so we would have to capture in the region of 30,000 pages per week, or 1.5 million pages per year.

The work involved in processing an individual patent document can be divided into the digital capture of both the text and images of the document, and the subsequent printing and distribution of the captured electronic data. We will concern ourselves here with the capture element of the process.

The Task

The EPO have been using generalised mark-up for capturing the structure of patent documents since 1985, so SGML1 was specified as the means of encoding the text parts of each document (see Goldfarb2, the inventor of SGML, for a more readable guide than the ISO standard itself). For the images, such as drawings and unencodable tables and formulae, we must capture them as CCITT Group IV bitmaps, and insert SGML tags into the text to indicate the logical position of the image in relation to the text.

SGML Mark-up
The SGML mark-up required by the EPO is fairly complex, particularly considering that in early 1990 when we won the contract SGML was still in its infancy, and there were few quality SGML tools available. Challenging aspects include the following:
1. Table encoding
  A patent document has on average 4 tables to be captured as SGML, each table containing on average 204 characters. Features to be encoded include identifiers, titles, column headers and sub-headers, table footnotes, cells spanning vertically and horizontally, various cell alignments, and arbitrary horizontal and vertical ruling.
2. Simple and complex mathematical formulae
  These include summations, limits, products, integrals, radicals and arrays; on other words the usual gamut of mathematical constructs.
3. Lists
  A number of list types must be recognised and correctly encoded, the list type depending on the form of the individual list items:
  - a term and a definition or explanation of the term, called a definition list,
  - a sequence number preceding each item, called an ordered list,
  - an indicator such as a bullet or hyphen preceding list items in no specific order, called an unordered list,
  - any list not belonging to the set enumerated above, called a simple list.
4. Floating accents
  A number of "accents" such as a large circle surrounding a character and a small circle over a character are defined; these can be combined with any character to form a floating accent construct.
5. Character fractions
  These are used for two strings of characters which appear one above the other within one line of the document. They can occur with or without a separating horizontal bar.
Images
Any parts of the document which cannot be faithfully captured as text marked up with the available SGML tags must be captured as a bitmap. As well as drawings in the usual sense, this approach embraces the small proportion of tables which do not fit into the EPO's table encoding scheme, complex mathematical formula, and chemical formula involving symbols such as benzene rings.
A call-out SGML tag must also be encoded within the text at the position the image would logically occur during the text. This often cannot be deduced automatically, for example where you have a column of text with associated images to the right of the text.
A special case of image are occurrences of characters not included in the extended character set defined by the EPO (so-called "undefined" or "FF" characters). About 1.5% of patent documents have undefined characters; those that do contain on average 3 different undefined characters. Such characters must be captured as bitmaps when they first appear, and added to a font specific to that document for re-use throughout the document.
Character sets
The EPO use a proprietary character set consisting of around 500 characters which commonly occur in patent literature. This character set includes the upper and lower case Greek alphabet and a full set of mathematical and logical operators, plus an assortment of less common characters (see Figure 1 below for some examples). It is necessary for us to accurately capture each of these characters whenever they occurred in a patent document, and to subsequently print them in a variety of styles and point sizes.
Figure 1: Selection of characters from the EPO character set
Accuracy
The most challenging aspect of the contract is the fact that the EPO require 100% data capture accuracy. Under the terms of the contract incorrectly captured characters can lead to financial penalties and the document being rejected by the EPO. For example one Greek or other special character incorrectly encoded in a document is sufficient cause for rejection.
Production schedules
The EPO has a legal obligation to publish patent documents within a certain time-frame. Since publication and distribution are undertaken by us, we inherit their obligation. The EPO notify us of the publication dates of each document, normally about 6 weeks beforehand; these deadlines must be met at whatever cost.

Design Decisions

The first stage in implementing a solution required making some basic design decisions.

Platform
We decided on Sun workstations as the primary platform, due to their good performance to cost ratio, the high-resolution monitors and networking capabilities supplied as standard, and Sun's commitment to open systems.
Disk storage
A server would be needed at the centre of the operation to store all completed patent documents, but we wanted to make the system resilient to the failure of any single component, including the server. We therefore specified that each data capture workstation would have a 350 MB local disk on which all completed documents would be stored until the server was ready to receive them. The server has 15 GB of disk to store completed documents until their publication date arrives.
Document transfer
Operators can select any workstation on which to work. When they specify the document number they wish to work on, the workstation will poll all the other workstations plus the server to determine the location of the most up-to-date version of the document. This will then be transferred to the local disk. The operator need only wait until the data pertaining to the first page has been transferred before starting work (typically a delay of less than a second); the remainder of the document will be transferred behind the scenes as the first page is being worked on.
Workgroups
Initially we had the workstations arranged in groups of six, where a group of operators would process any given document from start to finish, with each operator performing any task necessary. Further trials revealed however that both productivity and operator satisfaction were increased by dedicating operators to one specific task, such as scanning or proof-checking. The operators can now choose their specialisation according to their talents and preference.
Bitmap resolution
We decided to use 300 dpi scanners, hence this is the resolution of bitmaps used throughout the system. We experimented with higher resolutions, but found the increase in ICR accuracy to be minimal, and since the customer only required 300 dpi images to be returned, we could not justify the increased disk space required to process higher resolutions.

Process Steps

We decided to divide the processing for a document into a sequence of separate steps. Each step must be completed for all pages of a document before the system will allow the operator to commence the next step. We will discuss the process steps in the order that an operator performs them.

Scanning
We use two types of scanner; the Futitsu 3096G with ADF to bulk-scan documents, and the Xerox 7650 to rescan pages which require scaling (the customer specified maximum sizes for drawings, which are frequently exceeded by applicants). Around 10% of pages require re-scanning.
As well as physically scanning in the pages of the document, the operator will segment the page into text and image areas by stretching a bounding box around each segment. For image areas the type of image must also be specified by the operator (maths, table, chemical etc.).
In some cases the smallest rectangle enclosing an image overlaps with the rectangle enclosing another image or text region. For this reason the operator can indent any of the four corners of a bounding box as required.
Automatically segmenting the pages is something we are currently working on, but we expect that it will still be necessary for an operator to check each page, and identify the type of each image segment.
ICR
Not surprisingly, the key element in accurately converting paper documents to electronic data at a competitive cost is the ICR and the subsequent identification and correction of any conversion errors.
We chose to use version 6 of the Kurzweil ICR toolkit (the latest version of this is now marketed by Xerox Imaging Systems as the ScanWorx API). General opinion held this to be the leading ICR at the time, and our own investigations bore this out. It also embodied all the functionality we required, such as omnifont recognition, training capability, positional information on a word-by-word basis and feedback on recognition assurance levels for individual characters.
Rather than being a separate step, the ICR is run by the following process, the proof-checking application. As each page is being proof-checked by the operator, the following page is being processed by the ICR in background.
The ICR is run with no interactive verification by the operator. No training data is made available to the ICR prior to starting each document, since each document is in general singular in terms of fonts used and page quality. Training data is however accumulated for use with subsequent pages as the document is processed by the ICR.
As well as the in-built dictionary, we supply to the ICR a custom dictionary which is being continually updated. During the spell-checking stage, the operator can elect to add an unrecognised word to the document dictionary. When each document is sent to the server on completion, the dictionary is merged with a pool of all such collected words. At two-weekly intervals a senior data capture operator will "harvest" this crop of words, and will select from those collected those words which occurred in three or more patent documents. In this way we have accrued a dictionary of words which commonly occur in patent literature.
The output from the ICR contains the recognised text along with supplementary information such as the position of each word on the original page and the font variations throughout the page. We elected to convert the proprietary format as output by the ICR into a simplified form of SGML that we call "internal markup language", or IML. IML is used to represent the text of the document throughout the rest of the system, until it is finally converted to SGML as required by the EPO.
Proof-checking
For checking and correcting the output of the ICR in order to obtain 100% conversion accuracy it was necessary to devise a system whereby every character would be compared against the original bitmap.
We decided to implement a line-by-line proof-checking system. In this system, the operator is presented with a single line of text from the bitmap shown at 150 dots per inch. We can identify the relevant portion of the bitmap from the positional feedback obtained from the ICR. Immediately below this is the same line of text, as converted by the ICR. The converted text is positioned so as to align as closely as possible with the original bitmap, both in terms of positioning and font. These two lines are displayed in a wide, shallow window the width of the screen. Above are two taller windows side-by-side. The left contains the bitmap displayed at 75 dpi, the right contains the converted text, also at 75 dpi, and again positioned to match the bitmap. These are automatically scrolled so that the line of text displayed at 150 dpi is positioned as closely as possible to the centre of these two windows so as to provide a degree of context. Below the 150 dpi window are various icons and informational displays. The operator scrolls down line-by-line by clicking the mouse. Figure 2 below portrays the proof-checking screen.
It is important for the operators to learn to check and correct on the basis of the visual appearance of the two lines, as opposed to trying to "read" the text, as it is all to easy to see what you expect to see. For example, a surprising number of people will see no flaw in the following text at first glance:
As well as correcting any mistakes made by the ICR, the operator will at this stage apply any styles such as bold, italic, underline, overscore, superscript and subscript. They will also encode any character fractions or floating accents.
"FF" characters can at this stage be identified to the system, which has the effect of adding them to a virtual keyboard which can be used to encode subsequent occurrences of the same character. There are in fact six virtual keyboards; the other five contain the non-ASCII characters from the EPO's extended character set, grouped according to usage (Greek, mathematical, chemical etc.). Any of these keyboards can be displayed on the screen while the operator is working to allow rapid access (see Figure 3 for an example virtual keyboard).
Most of the proof-checking functions can be accessed via menus or by keyboard short-cuts depending on the individual operators experience and preference.
Spell-checking
The spell-checker is a variation on the proof-checking application. It has the same proofing line window and two context windows. Instead of working through each page line-by-line however, the operator is taken straight to the first spelling mistake on the page. They can then elect to accept the word (and optionally add it to the document dictionary), leave the word for later attention, or correct the word. The interface provides icons to move backwards or forwards to the next unresolved spelling mistake.
Two dictionaries are used to validate each document; a global dictionary and a document-specific dictionary. The global dictionary on all workstations is updated periodically with the words added to the document-specific dictionaries.
As well as high-lighting words not found in the dictionary, all words containing special characters (such as Greek and mathematical symbols) and all numbers are high-lighted for checking, since errors capturing these could significantly alter the meaning of a patent.
At the end of each page, the operator can enter a "view" mode, where the full screen is given over to the text of the page displayed at 150 dpi. About half the page can be viewed at one time, the display being freely scrollable.
We have found that for certain types of document (consisting almost exclusively of normal text, free of formulae or numeric data, and in the data capture operator's first language) the proof-checking stage can be by-passed, the spell-checker plus view-mode being sufficient to achieve perfect quality.
Encoding
This is where the structure of the document is encoded, by inserting SGML tags into the text to identify the start and end of constructs such as headings, paragraphs, lists, tables and footnotes.
We considered but rejected the possibility of automatically encoding the structure of the document based on the position and content of the text. Only fairly straight-forward cases such as paragraphs can be reliably encoded automatically, and the encoding of each page would still need to be checked by a human. Since the simple cases are quickly encoded anyway by a skilled operator, we felt that little or no time would be saved by this approach.
The third-party SGML editors available at the time allowed very little customisation of their user interfaces, and would not have permitted us to take advantage of the constraints and interrelationships which are inherent in the structure of patent documents but not made explicit in the DTD. Examples include:
- Footnotes and their associated references are connected via their "ID" attribute
- Images can be repositioned by the operator between any two lines of text, and a call-out tag must be inserted there to tie in the correct bitmap
- The numbering of ordered list items must be consecutive within a list
In addition, it would have been cumbersome and time-consuming for an operator to use these tools to encode such structures as character fractions and floating accents. We therefore chose to design and implement our own tool to insert SGML mark-up as efficiently as possible, and with minimal possibility of operator error.
As in the proof-checker, the operator can in most cases choose to use menus or keyboard short-cuts. In addition the left mouse button has a special function, in that what it does depends on the context of the line beneath the cursor. For example, at the top level of the document (not within any structure) it will encode a paragraph, whereas within a list it will encode a list item, and within a table it will encode a table end.
The encoder application enforces the rules embodied in the DTD specified by the EPO, plus additional guidelines defined by the EPO. If the current document becomes invalid because of the encoding applied by the operator, a warning will be displayed, and the operator cannot exit the application until it is corrected.
Images are displayed in the position they occurred on the original page. In addition a small marker is placed within the text to indicate where the system considers the image to logically occur. This marker can be moved to another position within the text by the operator.
Encoding the structure of tables within patent documents using the SGML tags defined by the EPO was a sizable problem in its own right. At that time we were not aware of any third-party products on the market capable of marking up tables in the way that we required, so we designed and wrote our own software to do this. We implemented table encoding as a sub-system of the main SGML encoding application. Figure 4 below shows a typical table encoding screen. The operator is presented with the text content of the table, as captured by the ICR and proof-checked by data capture operators, with each word positioned as it was on the original page.
The table encoding process is logically split into five stages, which the operators normally (but not necessarily) follow in order:
- The first step is the identification of each line of text as one of identifier (such as "Table 3"), title, column header, column sub-header, body or footnote.
- The table is then split into columns by identifying the inter-column gaps (the system can automatically perform this step for many tables).
- Next spanned cells are formed by joining cells together (by clicking between two cells or groups of cells while in "join" mode).
- The alignment of the cells is then specified where they differ from the default alignments selected by the system based on the cell contents.
- Finally individual horizontal and vertical rules can be drawn between rows and columns.
There are additionally a number of checking modes available while encoding tables, whereby the operator can view the table as it will be printed, or view the SGML-encoded text of the table.
SGML check
The final stage of the capture process is a QA check. The document is converted to SGML, and is then reformatted and displayed on the screen beside the original bitmap.
It is easy to spot missing paragraph tags and other encoding errors, since the difference in the document structure will be immediately apparent.

Performance

Time
The scanning/segmenting process step takes around 50 seconds per page, the encoding and SGML-checking process steps combined take on average less than 35 seconds of operator time per page, whereas the proof-checking step takes on average about 130 seconds, and spell-checking 90 seconds (the ICR process itself, since it is performed concurrently in background, takes no operator time).
From this it is clear that the cost of marking up the text with SGML tags, including tables, is negligible compared to the cost of ensuring the accuracy of the text itself, which consumes over 70% of total operator time per page. The nature of the text plays a large part in this; a complex chemical patent is much more difficult to correctly capture, both for an ICR and a human operator, than "normal" text where most of the words are in common usage.
Accuracy
The quality of the resulting electronic document is excellent, consistently exceeding 99.95% accuracy.

Future Improvements

Our existing contract with the EPO expires next year, and we have started work on the next major release of the system in preparation for the next tender request. The quality of the output produced by the existing system leaves little room for improvement, therefore our primary aim is to improve productivity so that the costs to the customer may be reduced. Most of the per-page cost goes to pay the wages of the data capture operators, therefore we must look at ways to reduce the operator time spent processing each page.

There are two major areas where we intend to automatically assist processes currently performed by human operators:

Automatic Correction of ICR Errors
Any major improvement in productivity or accuracy will come about by improving the accuracy of the ICR.
Recent research at the University of Nevada3 has indicated to us that it may be possible to eliminate a large proportion of ICR errors automatically by using a confusion matrix for the ICR plus a very large dictionary of all 100% correct words encountered to date with their occurrence counts in order to identify and correct wrongly converted characters in the raw output of the ICR. We plan to implement a feedback system to monitor the success rates of each substitution made by the automatic correction mechanism and adjust its future behaviour accordingly.
Automatic Segmentation
We are currently working on automatically analysing the layout of each page bitmap and segmenting it into text and image areas. Patent documents are unusual in a number of respects; they typically have line numbers every five lines, and can have a number of extraneous features on the page such as rubber stamps and document numbers.
The approach we are taking is to transform the bitmap into connected components, and to heuristically analyse the relationships of the positions of the components to differentiate normal text from features such as line numbers, page numbers and images.
As a by-product of the analysis of the page bitmap performed to automatically segment each page, the positions of individual words on the page can easily be determined, which in turn gives an indication of the skew of the page. Rotating the page by this amount should eliminate skew and improve both ICR performance and the quality of embedded images.
In addition, as a by-product of the clustering operation, we can discard clusters smaller than a certain threshold value. For a 10-point font scanned at 300 dpi, discarding all clusters smaller than 6 by 6 pixels reduces the noise on a poor-quality bitmap significantly, with an acceptably small risk of losing a genuine punctuation mark.

Personnel Issues

Finding the right calibre of data capture operator proved difficult. The level of accuracy required and the skills required to proof-check in multiple languages, to correctly mark up complex tables and maths, to identify different types of embedded image, and to correctly identify and enter more than 500 characters means that a fairly high level of intelligence and education are required, but at the same time this intelligence tends to induce boredom and hence inaccuracy and low productivity. It also means that data capture operators are typically earning less than they would if employed in the field of their choice, especially since the recession in the U.K. has led to a fair number of graduates among recruits.

Staff turnover has been fairly high, but we have gradually built up a core of consistently productive and accurate data capture operators who seem ideally suited to the task.

Maintaining a "library" atmosphere where in theory at least only functional talking is permitted is an important aid to sustaining an acceptable accuracy level, especially where operators are involved in proof-checking. Interestingly enough, experience has shown that allowing the use of personal stereos actually increases operator productivity.

A process is in place whereby data capture operators may submit suggestions for improvements to the software, these suggestions are then prioritised by a committee based on approximate software development times and expected cost savings, and the leading suggestions are implemented as soon as possible. As well as improving the quality of the software, the operators achieve satisfaction from knowing that their opinions are valued.

Conclusion

We have designed and implemented a distributed system for accurately capturing and marking-up patent documents, and it has met our quality objectives whilst permitting high operator productivity.

The correction of the text itself occupies most of the operator time, as opposed to the SGML encoding, but it must be borne in mind that the subject matter is in general highly complex, containing many unusual characters, formulae and technical terms.

Line-by-line proof-checking where each line of converted text is visually compared to the original bitmap seems to be a very effective way of achieving 100% accuracy at high productivity levels, although for simple documents the use of a spell-checker plus a side-by-side visual check of original document versus converted text suffices.

We have found it worthwhile to write a custom tool to facilitate efficient and error-free SGML mark-up. Having a separate table mark-up mode based on the layout of the original page has proved effective.

Splitting the capture process into separate tasks and allocating each operator to only one task has improved both productivity and employee satisfaction.

Acknowledgments

I would like to thank Larry Spitz for his support and helpful comments and suggestions during the writing of this paper.

References

[1] ISO 8879: Information processing - Text and office systems - Standard Generalised Markup Language (SGML). 1986-10-15.

[2] Charles F. Goldfarb, The SGML Handbook, Oxford University Press 1990, ISBN 0-19-853737-9.

[3] Kazem Taghva, Julie Borsack, Bryan Bullard, and Allen Condit, Post-Editing through Approximation and Global Correction, International Journal of Pattern Recognition and Artificial Intelligence, 1994.