Monday, August 18, 2014

Scanning to OCRed PDF in Ubuntu

When I want to scan a document to PDF with OCRed text in Ubuntu, here is how I do it.

These steps are a modification of what I found here (which refer to [1]).
  1. Scanning
    1. I already owned a good scanner. Namely, Canon model (LiDe 210).
    2. I recommend cleaning the scanners glass bed before beginning. Any small "spots" that are removed now don't have to be removed later in post processing.
    3. I scan using the GUI XSane. The settings I used, as recommended by [1], are:
      1. Color: Gray
      2. DPI: 300 or 600 (low enough so your scanner doesn't pause while scanning)
      3. Gamma, brightness, and contrast: default enhancement values of (1,0,0)
      4. Save to TIFF: Type = TIFF
      5. 8-bit images: Preferences ==> Filetype ==> Reduce 16 bit image to 8 bit
      6. No compression: Preferences ==> Filetype ==> TIFF 8 bit image compression = no compression
  2. Post processing
    1. I use ScanTailor.  I recommend applying the same "Select Content" and "Margins" to all pages so that the content on each page is the correct location (modulo noise from scanning) and all pages have the same dimensions.  NOTE: The dimensions of these TIFF files will be the dimensions of the final PDF file.
  3. OCR
    1. I use tesseract.  Everyone says that tesseract dies a fantastic death if you are so bold as to pass it a file with extension "tiff" instead of one with "tif".  For later considerations, each page is OCRed individually (see below for explanation).  The OCR can take a long time, so the following command echos the name of the file currently being OCRed to give a show of progress.  Sometimes I have problems while OCRing individual files in this way.  It seems like tesseract randomly encounters problems.  Finally, the third or fourth execution of the following command was successful.  The command for this step is: for f in *.tif; do echo $f; tesseract $f $f -l eng hocr; done
    2. I use hocr2pdf to pair the OCRed data with the text image and create a PDF, all at the same time.  It seems that hocr2pdf cannot handle multi-page TIFF files, so each page is handled individually using a for loop (like above).
      1. Installation of hocr2pdf is achieved by installing exactimage: sudo apt-get install exactimage.
      2. The command for this step is: for f in *.tif; do hocr2pdf -i $f -o $f.pdf < $f.html; done
  4. Merge
    1. I use pdftk.  The command for this step is: pdftk *.pdf cat output merged.pdf
  5. Edit metadata
    1. The command to get the current metadata is: pdftk merged.pdf dump_data output metadata.txt
    2. Some of the valid keys can be found here.
    3. The command to save the modified metadata is: pdftk merged.pdf update_info metadata.txt output merged2.pdf
    4. Lastly, one might want to edit the page labels.  However, I was not happy with any of the Ubuntu solutions for this, so I used Adobe Acrobat Pro 9 to do this instead.