OT: Help getting PDF to OCR or searchable form

Bob Rasmussen ras at anzio.com
Mon Sep 9 20:26:48 PDT 2019


It might help to understand what's inside the PDF and how that does or 
doesn't make it searchable.

If you scan a document and save it as an "image PDF" or similar language, 
the PDF contains one image per page (maybe color, maybe gray, maybe 
black-and-white; various densities). It is not searchable.

OTOH, if you say you want to save a "searchable PDF" or a "text document" 
(or similar), the software will ALSO do OCR on each page, and lay out text 
over the graphic image of each page, but such text will be INVISIBLE. 
Because it's spacially equivalent to the image, you can highlight and then 
copy-to-clipboard the text, based on position, interactively, using a 
variety of PDF viewers. You're actually copying the invisible text, as 
text, to the clipboard (in Windows, say).

On the third hand, if you run software that creates a PDF from some kind 
of a text file, it contains the text but not the images. The file MAY also 
contain some metadata that conveys more structure to the text, such as 
cells in a table, for instance.

I don't believe there's a keyword capability, per se.

Any PDF-to-text extraction program can pull out the text on each page. 
Probably it will localize it to a particular page number; probably not 
more localized or structured than that.

It sounds like what you are looking for is to OCR an existing image-only 
PDF. More to the point, a large batch of PDF files. I can't comment on 
whether you've found such an animal. I will say that typically OCR 
capabilities are not available in free software, but it *might* be 
available at the point the paper documents are originally scanned.

I hope that helps on background...

On Mon, 9 Sep 2019, Laura Brody via Filepro-list wrote:

> Yes, I see that. Now that I know that PDFsandwhich and tesseract will run
> on the Raspberry Pi and do what I need, I have a clear idea what I need to
> do to get searchable PDFs out of the files that I have. Thank you for
> pointing me in the right direction. You saved me a boatload of time and
> aggravation.
>
> Laura Brody
>
> On Mon, Sep 9, 2019 at 10:38 PM Cesar Baquerizo <ces at cescom.com> wrote:
>
>> Yw. You’ll also need tesseract. They are two different Sw. Let me know how
>> it goes.
>>
>> Get Outlook for iOS <https://aka.ms/o0ukef>
>>
>> ------------------------------
>> *From:* Laura Brody <laura.k.brody at gmail.com>
>> *Sent:* Monday, September 9, 2019 10:35 PM
>> *To:* Cesar Baquerizo; Filepro_List
>> *Subject:* Re: OT: Help getting PDF to OCR or searchable form
>>
>> I found a list of Linux flavors that PDFsandwhich has been ported to and
>> Raspberrian Linux was on the list!
>>
>> I will be be working on this project tomorrow. Thank you so much for this
>> lead. I don't think I would have found it by myself.
>>
>> Laura Brody
>>
>> On Mon, Sep 9, 2019 at 10:27 PM Laura Brody <laura.k.brody at gmail.com>
>> wrote:
>>
>>> This is very interesting.
>>>
>>> The only Linux box I have running at the moment is Raspberry Pi 3 B+. I
>>> have 64GB SD card available, so space isn't an issue. Any idea if it will
>>> work on it?
>>>
>>> Laura Brody
>>>
>>> On Mon, Sep 9, 2019 at 9:54 PM Cesar Baquerizo <ces at cescom.com> wrote:
>>>
>>>> Lookup Tesseract and Pdfsandwich. It may help you.
>>>>
>>>> Get Outlook for iOS <https://aka.ms/o0ukef>
>>>>
>>>> ------------------------------
>>>> *From:* Filepro-list <filepro-list-bounces+ces=
>>>> cescom.com at lists.celestial.com> on behalf of Laura Brody via
>>>> Filepro-list <filepro-list at lists.celestial.com>
>>>> *Sent:* Monday, September 9, 2019 9:50 PM
>>>> *To:* Filepro_List
>>>> *Cc:* Laura Brody
>>>> *Subject:* Re: OT: Help getting PDF to OCR or searchable form
>>>>
>>>> Additional information....
>>>>
>>>> I talked to the user and got some history...
>>>>
>>>> The user scanned in legal documents. Saved the images as pages in a PDF.
>>>> That is why I can't search on keywords for most of the files. A few
>>>> files
>>>> were typed up and then exported as PDF. most are images of the pages.
>>>> That
>>>> means that OCR has to be part of the solution.
>>>>
>>>> I discovered that Adobe Acobat Reader has a setting to search all PDFs
>>>> in a
>>>> directory for keywords. The problem is that these files don't contain
>>>> text.
>>>> They contain images of text. Adobe can't search images and find
>>>> keywords.
>>>>
>>>> Laura Brody
>>>>
>>>> On Mon, Sep 9, 2019 at 8:03 PM Laura Brody <laura.k.brody at gmail.com>
>>>> wrote:
>>>>
>>>>> I am hoping that one of you has solved this problem before.....
>>>>>
>>>>> I have over a thousand pages of text in a dozen or so PDF files. Most
>>>>> files are "read-only" and I can not do Ctrl-F to search for keywords.
>>>> I
>>>>> would like to be able to OCR the files and put everything into one
>>>> file
>>>>> that is searchable. Or is there a utility that will search all of the
>>>> PDFs
>>>>> in a directory for a keyword?
>>>>>
>>>>> Suggestions anyone?
>>>>>
>>>>> Laura Brody
>>>>>
>>>> -------------- next part --------------
>>>> An HTML attachment was scrubbed...
>>>> URL: <
>>>> http://mailman.celestial.com/pipermail/filepro-list/attachments/20190909/935e0f40/attachment.html>
>>>>
>>>> _______________________________________________
>>>> Filepro-list mailing list
>>>> Filepro-list at lists.celestial.com
>>>> Subscribe/Unsubscribe/Subscription Changes
>>>> http://mailman.celestial.com/mailman/listinfo/filepro-list
>>>>
>>>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mailman.celestial.com/pipermail/filepro-list/attachments/20190909/b6a2140e/attachment.html>
> _______________________________________________
> Filepro-list mailing list
> Filepro-list at lists.celestial.com
> Subscribe/Unsubscribe/Subscription Changes
> http://mailman.celestial.com/mailman/listinfo/filepro-list
>

Regards,
....Bob Rasmussen,   President,   Rasmussen Software, Inc.

personal e-mail: ras at anzio.com
  company e-mail: rsi at anzio.com
           voice: (US) 503-624-0360 (9:00-6:00 Pacific Time)
             fax: (US) 503-624-0760
             web: http://www.anzio.com
  street address: Rasmussen Software, Inc.
                  10240 SW Nimbus, Suite L9
                  Portland, OR  97223  USA


More information about the Filepro-list mailing list