How To Make scanned PDFs searchable (OCR) using pdfocr
Usually, with a scanned PDF, the PDF contains an image of the page, and not the text (unlike a PDF generated from say OpenOffice). This tip performs Optical Character Recognition on the images inside the PDF and adds a searchable layer of text. The embedded searchable text is not visible, so when printed, you still get back the original PDF layout.
From my few tests it works well — regardless of language. There are some layout issues on where it embeds the searchable text, but since those are invisible anyway, it’s not a problem.
to install (on Ubuntu 13.10):
sudo add-apt-repository ppa:gezakovacs/pdfocr
sudo apt-get update
sudo apt-get install pdfocr
to convert document:
pdfocr -i infile.pdf -o outfile.pdf
Bonus Tip: Creating a PDF from a bunch of scanned images
It happens … some idiot who doesn’t know how to use a scanner sends you a series of JPEGs or you were desperate and had to use a cameraphone to scan some documents. Here’s what you do:
sudo apt-get install imagemagick
Convert images to PDF
convert *.jpg out.pdf
You will need to make sure the images are in sequence; usually they are numbered by the camera or scanner anyway.