How To Make scanned PDFs searchable (OCR) using pdfocr

How To: Make scanned PDFs searchable (OCR) using pdfocrThis is a very useful tip, especially if you are storing scanned PDFs inside Google Drive  — it allows you to search across all your Google Drive folders for a particular phrase within a PDF.

Usually, with a scanned PDF, the PDF contains an image of the page, and not the text (unlike a PDF generated from say OpenOffice).  This tip performs Optical Character Recognition on the images inside the PDF and adds a searchable layer of text. The embedded searchable text is not visible, so when printed, you still get back the original PDF layout.

From my few tests it works well — regardless of language. There are some layout issues on where it embeds the searchable text, but since those are invisible anyway, it’s not a problem.

to install (on Ubuntu 13.10):
sudo add-apt-repository ppa:gezakovacs/pdfocr
sudo apt-get update
sudo apt-get install pdfocr

to convert document:
pdfocr -i infile.pdf -o outfile.pdf

Bonus Tip: Creating a PDF from a bunch of scanned images

It happens … some idiot who doesn’t know how to use a scanner sends you a series of JPEGs or you were desperate and had to use a cameraphone to scan some documents. Here’s what you do:

Install ImageMagick
sudo apt-get install imagemagick

Convert images to PDF
convert *.jpg out.pdf

You will need to make sure the images are in sequence; usually they are numbered by the camera or scanner anyway.


Howto: Make scanned PDFs searchable (OCR) using pdfocr

This post was originally published publicly on Google+ at 2014-01-08 13:53:34+0800

You may also like...