OT: OCR / PDF Parsing

Bob Rasmussen ras at anzio.com
Wed Jan 12 17:50:04 PST 2022


Your results will depend on the quality of your OCR library and the 
quality (mainly dot density) of the scan-to-PDF.

Try Adobe Acrobat (not Reader). It can take an image-only PDF and OCR 
it with good accuracy. Then you can Save As it to a new PDF, and/or export 
it to plain text (or other formats).

I don't know about programmability.

On Wed, 12 Jan 2022, Jose Lerebours via Filepro-list wrote:

> I have an GSA that wants data extracted from PDF documents, most of which are 
> scanned
> documents saved as PDF; which in essence makes them images saved as PDF.
>
> I have written code in PHP to save the PDF to PNG and extract TEXT from PNG 
> but this is not proving
> to be reliable since lots of characters are read wrong or not read at all.
>
> It is like pulling teeth, I want this done but do not ask me to get you 
> "true" PDFs, the scanned
> documents is all I can get ... type of scenario.
>
> So, my question is: is anyone here successfully extracting data from scanned 
> documents and if so,
> what are you using?
>
> Regards,
>
>
> -- 
> Jose Lerebours
> 954-559-7186
> https://www.asisuites.com
> Accounting - Retail - Wholesale - Distribution
> Manufacturing - Warehousing - Transportation - eCommerce - Web Development
>
> _______________________________________________
> Filepro-list mailing list
> Filepro-list at lists.celestial.com
> Subscribe/Unsubscribe/Subscription Changes
> http://mailman.celestial.com/mailman/listinfo/filepro-list
>

Regards,
....Bob Rasmussen,   President,   Rasmussen Software, Inc.

personal e-mail: ras at anzio.com
  company e-mail: rsi at anzio.com
           voice: (US) 503-624-0360 (9:00-6:00 Pacific Time)
             fax: (US) 503-624-0760
             web: http://www.anzio.com
  street address: Rasmussen Software, Inc.         NEW ADDRESS AS OF AUGUST 1, 2020
                  8835 SW Canyon Lane, Suite 401
                  Portland, OR  97225  USA


More information about the Filepro-list mailing list