{"id":3435,"date":"2014-01-07T16:00:00","date_gmt":"2014-01-07T16:00:00","guid":{"rendered":"http:\/\/localhost:8105\/?p=3435"},"modified":"2021-12-04T21:11:19","modified_gmt":"2021-12-04T21:11:19","slug":"how-to-make-scanned-pdfs-searchable-ocr-using-pdfocr","status":"publish","type":"post","link":"https:\/\/blog.shahada.abubakar.net\/?p=3435","title":{"rendered":"How To Make scanned PDFs searchable (OCR) using pdfocr"},"content":{"rendered":"<div id=\"content\"><b>How To: Make scanned PDFs searchable (OCR) using pdfocr<\/b>This is a very useful tip, especially if you are storing <i>scanned<\/i> PDFs inside Google Drive \u00a0&#8212; it allows you to search across all your Google Drive folders for a particular phrase within a PDF.<\/p>\n<p>Usually, with a scanned PDF, the PDF contains an image of the page, and not the text (unlike a PDF generated from say OpenOffice). \u00a0This tip performs Optical Character Recognition on the images inside the PDF and adds a searchable layer of text.\u00a0The embedded searchable text is not visible, so when printed, you still get back the original PDF layout.<\/p>\n<p>From my few tests it works well &#8212; regardless of language. There are some layout issues on where it embeds the searchable text, but since those are invisible anyway, it&#8217;s not a problem.<\/p>\n<p><i>to install (on Ubuntu 13.10)<\/i>:<br \/>\nsudo add-apt-repository ppa:gezakovacs\/pdfocr<br \/>\nsudo apt-get update<br \/>\nsudo apt-get install pdfocr<\/p>\n<p><i>to convert document<\/i>:<br \/>\npdfocr -i infile.pdf -o outfile.pdf<\/p>\n<p><b>Bonus Tip: Creating a PDF from a bunch of scanned images<\/b><\/p>\n<p>It happens &#8230; some idiot who doesn&#8217;t know how to use a scanner sends you a series of JPEGs or you were desperate and had to use a cameraphone to scan some documents.\u00a0Here&#8217;s what you do:<\/p>\n<p><i>Install ImageMagick<\/i><br \/>\nsudo apt-get install imagemagick<\/p>\n<p><i>Convert images to PDF<\/i><br \/>\nconvert *.jpg out.pdf<\/p>\n<p>You will need to make sure the images are in sequence; usually they are numbered by the camera or scanner anyway.<\/p>\n<hr \/>\n<p>LINK:<br \/>\n<a href=\"http:\/\/ubuntuforums.org\/showthread.php?t=1456756\"><img decoding=\"async\" src=\"https:\/\/blog.shahada.abubakar.net\/wp-content\/uploads\/2014\/01\/1.gif\" \/><br \/>\nHowto: Make scanned PDFs searchable (OCR) using pdfocr<\/a><\/p>\n<p><i>This post was originally <a href=\"https:\/\/plus.google.com\/+shahadaabubakar\/posts\/MRiq6Ko475A\">published<\/a> publicly on <a href=\"http:\/\/plus.google.com\">Google+<\/a> at 2014-01-08 13:53:34+0800<\/i><\/div>\n","protected":false},"excerpt":{"rendered":"<p>How To: Make scanned PDFs searchable (OCR) using pdfocrThis is a very useful tip, especially if you are storing scanned PDFs inside Google Drive \u00a0&#8212; it allows you to search across all your Google&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":3953,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[116],"tags":[],"class_list":["post-3435","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-linux"],"_links":{"self":[{"href":"https:\/\/blog.shahada.abubakar.net\/index.php?rest_route=\/wp\/v2\/posts\/3435","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.shahada.abubakar.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.shahada.abubakar.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.shahada.abubakar.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.shahada.abubakar.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3435"}],"version-history":[{"count":2,"href":"https:\/\/blog.shahada.abubakar.net\/index.php?rest_route=\/wp\/v2\/posts\/3435\/revisions"}],"predecessor-version":[{"id":6543,"href":"https:\/\/blog.shahada.abubakar.net\/index.php?rest_route=\/wp\/v2\/posts\/3435\/revisions\/6543"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.shahada.abubakar.net\/index.php?rest_route=\/wp\/v2\/media\/3953"}],"wp:attachment":[{"href":"https:\/\/blog.shahada.abubakar.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3435"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.shahada.abubakar.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3435"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.shahada.abubakar.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3435"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}