How To Make scanned PDFs searchable (OCR) using pdfocr

by dotsha747 · Published 7 January 2014 · Updated 4 December 2021

How To: Make scanned PDFs searchable (OCR) using pdfocrThis is a very useful tip, especially if you are storing scanned PDFs inside Google Drive — it allows you to search across all your Google Drive folders for a particular phrase within a PDF.

Usually, with a scanned PDF, the PDF contains an image of the page, and not the text (unlike a PDF generated from say OpenOffice). This tip performs Optical Character Recognition on the images inside the PDF and adds a searchable layer of text. The embedded searchable text is not visible, so when printed, you still get back the original PDF layout.

From my few tests it works well — regardless of language. There are some layout issues on where it embeds the searchable text, but since those are invisible anyway, it’s not a problem.

to install (on Ubuntu 13.10):
sudo add-apt-repository ppa:gezakovacs/pdfocr
sudo apt-get update
sudo apt-get install pdfocr

to convert document:
pdfocr -i infile.pdf -o outfile.pdf

Bonus Tip: Creating a PDF from a bunch of scanned images

It happens … some idiot who doesn’t know how to use a scanner sends you a series of JPEGs or you were desperate and had to use a cameraphone to scan some documents. Here’s what you do:

Install ImageMagick
sudo apt-get install imagemagick

Convert images to PDF
convert *.jpg out.pdf

You will need to make sure the images are in sequence; usually they are numbered by the camera or scanner anyway.

LINK:

Howto: Make scanned PDFs searchable (OCR) using pdfocr

This post was originally published publicly on Google+ at 2014-01-08 13:53:34+0800

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

How To Make scanned PDFs searchable (OCR) using pdfocr

You may also like...

Pages

Post Categories

Calendar

Archive

How To Make scanned PDFs searchable (OCR) using pdfocr

You may also like...

So it was Saturday night and there was an Ubuntu Update …

Bye-Bye Gnome Shell

Embedded Acrobat Reader in Google Chrome on Ubuntu 9.10

Pages

Post Categories

Calendar

Archive