June 29, 2018

Searching for text in PDF files with pypdf2

Portable Document Format (PDF) is wonderful as long as you do just have to read the format, not work with it. The pdf format is not really meant to be tampered with, so that is why pdf editing is normally a hard thing to do. It is defacto a worldwide standard so you will most likely come across it when coding. Read along to see how to tackle the PDF format and how to do a search to find the information contained within them.

The code below is taken from Al Sweigart's book page Automate the Boring Stuff with Python (No affiliation, it is a great book that you can read for free. I do have the hard copy at home also.) I have added some error handling functionality to his code with utf-8 encoding and the strict=False for the PdfReadError.

In an earlier post, we covered how to search for files on your hard drive. We are now going to search inside pdf files instead. for this we need the pypdf2 package which you can install from your command line; py -m pip install pypdf2

I used the pdf document SHIP-ICE INTERACTION IN A CHANNEL found from trafi.fi as an example. According to my pdf reader, the word "ship" is written 83 times. Let's see if we can come to the same number with pypdf2. The code works as follows: first, we open the pdf and read the pdf with the PdfFileReader method.

We loop through the pages and get each page with the getPage method. The count for the word "ship" is 82, so we do not find all of the words. The word "ice" should appear 158 times, but pypdf2 only finds "ice" 153 times. This is expected behavior, since there may be tables and similar formats that pypdf2 does not detect.

We can conclude that the search is still working sufficiently good. Searching through a couple of hundred pdf's would yield good enough results if you are searching for something specific. Do you have a better way to search? Please let me know in the comments. Happy coding!

The Code:

# If you get the PdfReadError: Multiple definitions in dictionary at byte, add strict = False

pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=False)

# if you get the UnicodeEncodeError: 'charmap' codec can't encode characters, add .encode("utf-8") to your text

text = pageObj.extractText().encode('utf-8')

import PyPDF2

pdfFileObj = open('22897-WNRB_research_report_93.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=False)
search_word = "ship"

search_word_count = 0
for pageNum in range(0, pdfReader.numPages):

    pageObj = pdfReader.getPage(pageNum)

    text = pageObj.extractText().encode('utf-8')

    search_text = text.lower().split()

    for word in search_text:

        if search_word in word.decode("utf-8"):

            search_word_count += 1
print("The word {} was found {} times".format(search_word, search_word_count))

Python, pypdf2

Searching for text in PDF files with pypdf2

Searching for text in PDF files with pypdf2

The Code:

Related posts

Rotate PDF files with PyPdf2 and Tkinter!

Count the number of pages in a PDF file with Python and PyPDF2

Splitting a pdf to single pages with PyPDF2