The detection of headlines achieved a low error rate of 2. 85% as against 6. 52 of previously used methods. During evaluation of segmentation algorithms XYcut was found to gain a lot by noise cleanup, which is an interesting result as it strengthen the claim of XYcut segmentation algorithm as a suitable method for OCRopus. The reengineering and porting of zoneclassification module to OCRopus makes it possible for OCRopus to have a text/image segmentation if it is required in future. Author Abstract OCRopus : Introduction
Though the field of optical character recognition(OCR) is considered to be widely explored, the development of an efficient system for use in real world situations still remains a challenge for developers. OCRopus is a stateoftheart document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, multilingual capabilities and is being developed at IUPR. This being a very big project, I was assigned the tasks of developing tools for layoutanalysis and evaluation. The Goals: Following goals were set as I proceeded in my work: 1.
Conversion of groundtruthdata in MARG database from XML format to hOCR microformat. 2. Development of a rulebased headline detection method using the median black runlength of the lines. 3. Development of segmentationclassification module and evaluation of performance of different segmentation algorithms as against noise. 1. XML to hOCR: hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML.
By building on standard HTML, it automatically inherits welldefined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCRrelated information coexist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation. Due to all above qualities of hOCR format, it is highly desirable to have ground truth in this format. I was assigned the task of converting the MARG database ground truth into hOCR format.
For this purpose I have written following script. Script Name : xmltohocr Language Used: Python Commandlineargument form: xmltohocr FILE. XML FILE. XML : The file in XML format to be converted into hOCR micro format. Note: The script does not take care of latex characters yet. It would be an improvement to incorporate this feature. 2. Headline detection Based on black runlength and its integration into OCRopus: Detection of headlines in document images is one issue that is mostly overlooked but yet is highly desirable to properly format the output of OCR.
OCRopus had till now used a rule based method which used space between lines as the criteria for detection of headlines. Though this method worked for many images, it also failed many times. It was an obvious observation that black runlengths of headlines are more than the black runlength of the normal line, and we tried to build upon this concept. We used median black run length of a line as the deciding criteria. The median was used instead of mean because mean run length could have easily been affected by the noise merging with text and would have produce errors.
The whole approach is simple as discussed below: 1. Calculate the median black runlength for the each line on page. 2. Compare this run length for each line with the lines below and above it. 3. If black runlength for a line has been found K1(a parameter) times the median runlength of line below it, and K2(another parameter) times the median runlength of the line above it,set it as a headline. The value of parameters K1 and K2 was to be found experimentally. After many times evaluating the performance of the program, the value of K1 and K2 has been set to 1. 5 and 1. 1 respectively.
We used histogram based method to find the median runlength. A histogram of the number of occurrences versus runlength was calculated, once we have such a histogram we normalize it with the largest value of occurrence. Then we calculated the cumulative distribution function for this normalized histogram. The point when cumulative distribution function reches a value of 0. 5, corresponds to the median runlength. The program for detection of headlines was written in C++ and used standard OCRopus classes. The program has been successfully integrated into OCRopus and Evaluation:
We also designed a tool which evaluates the performance of the OCRopus in detecting headlines. As according to OCRopus standards, this tool has been developed to work with files in hOCR microformat. This tool comprises of two programs: 1. The first program takes the OCRopus output and the corresponding ground truth file in hOCR format and outputs the total no of false positives and false negatives which occurred in detection. It also outputs the total no of true headlines which are present in the groundtruth. The command line form of this programs is: