git clone https://github.com/mariogarcia/docker.git
Tesseract OCR
Tesseract is an open source Optical Character Recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract typed, handwritten or printed text from images. It supports a wide variety of languages.
Docker
I’m keeping a docker repository with a Debian system with tesseract and some other handy utilities installed. Unfortunately is not available at DockerHub so if you want to use it to follow this entry just follow these steps:
Clone repository.
Go to the tesseract directory
cd docker/tesseract
Build the docker image
./bin/build.sh
Run image
docker run -it mgg/tesseract
That will open a tmux
session where you can execute all the commands
I’m using for this entry.
Basics
To practice a little bit I’m using spanish president Mariano Rajoy public records found at http://www.congreso.es
All records related to congressmen official incomes, properties and taxes are available from http://www.congreso.es in pdf documents. Unfortunately the way these documents are fulfilled are different from one another. |
Tesseract doesn’t read PDFs
In order to enable Tesseract to read PDF documents you have to convert
them to images. The easiest way to do so is by using imagemagick
. You
should make sure to convert pdf to image with the best quality you could:
convert -density 290 image.pdf image.png
Where XX is >72 such as 288 (which is 4x). If the resulting image is too big, then you can do:
convert -density 288 image.pdf -resize 25% image.png
Where resize
=25% or larger when density
=288
ls -l
total 288
-rw-r--r-- 1 dev dev 196746 Nov 13 21:49 document.pdf
-rw-r--r-- 1 dev dev 17657 Nov 13 22:01 output-0.png
-rw-r--r-- 1 dev dev 17003 Nov 13 22:01 output-1.png
-rw-r--r-- 1 dev dev 16590 Nov 13 22:01 output-2.png
-rw-r--r-- 1 dev dev 9819 Nov 13 22:01 output-3.png
-rw-r--r-- 1 dev dev 10569 Nov 13 22:01 output-4.png
There is a very wellknown script to improve converted text image called TEXTCLEANER |
First attempt
I’m getting the first output to get document headers:
tesseract output-0.png header
That would create a header.txt
output with all the recognized
data. Unless you’re processing a book, most of the form-like documents
could end up as a non-sense result.
cat header.txt
51
C.DTP 319 07/07/2016 10
CORTES GENERALES XII LEGISLATURA
DECLARACIDN1 DE BIENES Y RENTAS DE DIPUTADOS Y SENADORES2
Nombre y apellidos
MARIANO RAJOY BREY
Estado civil Régimen econémico matrimonial
CASADO GANANCIALES
...
Still you can use this processing for indexing purposes. However if
you happen to need to process a set of documents with certain
structure you can make use of a uzn
file.
UZN
uzn is a simple text file format for describing sections of a scanned image. The migneuzn tool outputs in this format for its segmentation.
For this example I’m using a UZN file to get only the first output
headers in a more guided way. I will tell tesseract
where to get
every field I’m interested in.
The uzn
file is like a csv
file with the following fields
separated by tab/spaces:x, y, width, height, tag.
140 302 600 66 name
143 416 454 66 civil state
776 614 824 52 city
In order to make tesseract to be aware of the uzn file you have to make the uzn file to have the same name of the processed file and also use a certain segmentation mode (-psm 4):
tesseract output-0.png result_with_uzn -psm 4
And the content of result_with_uzn.txt
is:
MARIANO RAJOY BREY
CASADO
MADRID ""n"—Tv…- r—- … ñ" “ “
Not bad, but I still need a little bit of tuning to get Tesseract to recognize these fields without the current noise, but hey this is just a preliminary research.
My overall impression is that Tesseract is a great project but requires from you a deeper knowledge before getting acceptable results. Also much of the processing has to do with image processing before even using Tesseract: contrast, bluring, quality… I will problably continuing looking into it to know more about this interesting tool.
References
-
Tesseract wiki: https://github.com/tesseract-ocr/tesseract/wiki