[PLUG] PDF-1.5 docs not searchable

Tomas Kuchta tomas.kuchta.lists at gmail.com
Sat Jul 24 18:08:31 UTC 2021


They are not searchable because they do not contain text to search.
Typically, they contain image only.

The way I deal with it - I OCR the image, generate text document and place
that text into a layer under the image in the output PDF.

Having the text under the image layer preserves the original look of the
pdf why allowing for search and select.

I have seen pdf with text over the image, obscuring it - as well as various
attempts of making the text over the image invisible.

Of course, OCR is not perfect as well as preserving the text in the exact
position under the image. It mostly works for text, not so much for data
extraction from tables, etc.

I do not believe that there is OK-ish free SW solution to this. I use
commercial SW to do that. It works, but I cannot publicly recommend it due
to their nasty commercial behavior - no respect for privacy no sale just
licensing with build in obsolescence.

Tomas

On Fri, Jul 23, 2021, 16:18 Rich Shepard <rshepard at appl-ecosys.com> wrote:

> I've encountered a few PDF-1.5 docs that are not searchable using xpdf,
> mupdf, okular, or MasterPDFEditor. Perhaps they're scanned and I don't know
> how to determine if they are.
>
> My web searches found nothing relevant; my search terms might be
> inefficient.
>
> Has anyone else experienced this?
>
> Rich
>
>
>



More information about the PLUG mailing list