OT: Help getting PDF to OCR or searchable form
Bob Rasmussen
ras at anzio.com
Mon Sep 9 20:26:48 PDT 2019
It might help to understand what's inside the PDF and how that does or
doesn't make it searchable.
If you scan a document and save it as an "image PDF" or similar language,
the PDF contains one image per page (maybe color, maybe gray, maybe
black-and-white; various densities). It is not searchable.
OTOH, if you say you want to save a "searchable PDF" or a "text document"
(or similar), the software will ALSO do OCR on each page, and lay out text
over the graphic image of each page, but such text will be INVISIBLE.
Because it's spacially equivalent to the image, you can highlight and then
copy-to-clipboard the text, based on position, interactively, using a
variety of PDF viewers. You're actually copying the invisible text, as
text, to the clipboard (in Windows, say).
On the third hand, if you run software that creates a PDF from some kind
of a text file, it contains the text but not the images. The file MAY also
contain some metadata that conveys more structure to the text, such as
cells in a table, for instance.
I don't believe there's a keyword capability, per se.
Any PDF-to-text extraction program can pull out the text on each page.
Probably it will localize it to a particular page number; probably not
more localized or structured than that.
It sounds like what you are looking for is to OCR an existing image-only
PDF. More to the point, a large batch of PDF files. I can't comment on
whether you've found such an animal. I will say that typically OCR
capabilities are not available in free software, but it *might* be
available at the point the paper documents are originally scanned.
I hope that helps on background...
On Mon, 9 Sep 2019, Laura Brody via Filepro-list wrote:
> Yes, I see that. Now that I know that PDFsandwhich and tesseract will run
> on the Raspberry Pi and do what I need, I have a clear idea what I need to
> do to get searchable PDFs out of the files that I have. Thank you for
> pointing me in the right direction. You saved me a boatload of time and
> aggravation.
>
> Laura Brody
>
> On Mon, Sep 9, 2019 at 10:38 PM Cesar Baquerizo <ces at cescom.com> wrote:
>
>> Yw. You’ll also need tesseract. They are two different Sw. Let me know how
>> it goes.
>>
>> Get Outlook for iOS <https://aka.ms/o0ukef>
>>
>> ------------------------------
>> *From:* Laura Brody <laura.k.brody at gmail.com>
>> *Sent:* Monday, September 9, 2019 10:35 PM
>> *To:* Cesar Baquerizo; Filepro_List
>> *Subject:* Re: OT: Help getting PDF to OCR or searchable form
>>
>> I found a list of Linux flavors that PDFsandwhich has been ported to and
>> Raspberrian Linux was on the list!
>>
>> I will be be working on this project tomorrow. Thank you so much for this
>> lead. I don't think I would have found it by myself.
>>
>> Laura Brody
>>
>> On Mon, Sep 9, 2019 at 10:27 PM Laura Brody <laura.k.brody at gmail.com>
>> wrote:
>>
>>> This is very interesting.
>>>
>>> The only Linux box I have running at the moment is Raspberry Pi 3 B+. I
>>> have 64GB SD card available, so space isn't an issue. Any idea if it will
>>> work on it?
>>>
>>> Laura Brody
>>>
>>> On Mon, Sep 9, 2019 at 9:54 PM Cesar Baquerizo <ces at cescom.com> wrote:
>>>
>>>> Lookup Tesseract and Pdfsandwich. It may help you.
>>>>
>>>> Get Outlook for iOS <https://aka.ms/o0ukef>
>>>>
>>>> ------------------------------
>>>> *From:* Filepro-list <filepro-list-bounces+ces=
>>>> cescom.com at lists.celestial.com> on behalf of Laura Brody via
>>>> Filepro-list <filepro-list at lists.celestial.com>
>>>> *Sent:* Monday, September 9, 2019 9:50 PM
>>>> *To:* Filepro_List
>>>> *Cc:* Laura Brody
>>>> *Subject:* Re: OT: Help getting PDF to OCR or searchable form
>>>>
>>>> Additional information....
>>>>
>>>> I talked to the user and got some history...
>>>>
>>>> The user scanned in legal documents. Saved the images as pages in a PDF.
>>>> That is why I can't search on keywords for most of the files. A few
>>>> files
>>>> were typed up and then exported as PDF. most are images of the pages.
>>>> That
>>>> means that OCR has to be part of the solution.
>>>>
>>>> I discovered that Adobe Acobat Reader has a setting to search all PDFs
>>>> in a
>>>> directory for keywords. The problem is that these files don't contain
>>>> text.
>>>> They contain images of text. Adobe can't search images and find
>>>> keywords.
>>>>
>>>> Laura Brody
>>>>
>>>> On Mon, Sep 9, 2019 at 8:03 PM Laura Brody <laura.k.brody at gmail.com>
>>>> wrote:
>>>>
>>>>> I am hoping that one of you has solved this problem before.....
>>>>>
>>>>> I have over a thousand pages of text in a dozen or so PDF files. Most
>>>>> files are "read-only" and I can not do Ctrl-F to search for keywords.
>>>> I
>>>>> would like to be able to OCR the files and put everything into one
>>>> file
>>>>> that is searchable. Or is there a utility that will search all of the
>>>> PDFs
>>>>> in a directory for a keyword?
>>>>>
>>>>> Suggestions anyone?
>>>>>
>>>>> Laura Brody
>>>>>
>>>> -------------- next part --------------
>>>> An HTML attachment was scrubbed...
>>>> URL: <
>>>> http://mailman.celestial.com/pipermail/filepro-list/attachments/20190909/935e0f40/attachment.html>
>>>>
>>>> _______________________________________________
>>>> Filepro-list mailing list
>>>> Filepro-list at lists.celestial.com
>>>> Subscribe/Unsubscribe/Subscription Changes
>>>> http://mailman.celestial.com/mailman/listinfo/filepro-list
>>>>
>>>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mailman.celestial.com/pipermail/filepro-list/attachments/20190909/b6a2140e/attachment.html>
> _______________________________________________
> Filepro-list mailing list
> Filepro-list at lists.celestial.com
> Subscribe/Unsubscribe/Subscription Changes
> http://mailman.celestial.com/mailman/listinfo/filepro-list
>
Regards,
....Bob Rasmussen, President, Rasmussen Software, Inc.
personal e-mail: ras at anzio.com
company e-mail: rsi at anzio.com
voice: (US) 503-624-0360 (9:00-6:00 Pacific Time)
fax: (US) 503-624-0760
web: http://www.anzio.com
street address: Rasmussen Software, Inc.
10240 SW Nimbus, Suite L9
Portland, OR 97223 USA
More information about the Filepro-list
mailing list