How can I grep in PDF files?

  • Is there a way to search PDF files using the power of grep, without converting to text first in Ubuntu?

    I think you need to parse it thou pdf2text in order to get some usable results back...

    For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files?

  • enzotib

    enzotib Correct answer

    9 years ago

    Install the package pdfgrep, then use the command:

    find /path -iname '*.pdf' -exec pdfgrep pattern {} +
    

    ——————

    Simplest way to do that:

    pdfgrep 'pattern' *.pdf
    pdfgrep 'pattern' file.pdf 
    

    This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.

    Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.

    `pdfgrep` also has a recursive flag. So this answer could perhaps be reduced to: `pdfgrep -R pattern /path/`. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.

    Actually, the `-n` option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).

    This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What's `pattern`? What's `{}`? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.

    @MarkAmery This answer is unnecessarily complex because he is `find`. The usage is simply `pdfgrep 'pattern' file.pdf`. The `{}` is just a way to drop the file name in from `find`.

    As pdfgrep is quite slow, you can increase the speed by using parallel find: `find . -type f -iname \*.pdf -print0 | xargs -0 -P 4 -L 1 pdfgrep -H -n pattern`. That obviously depends on the number of CPUs and available IO.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM