Working with PDF Documents in Python?



First of all We need to understand that what is a PDF Document.It's not a new term to you I think but may be for someone it's a new topic.The PDF acronym stands for Portable Document Format.It is simply a format for Sharing Contents Digitally.Now how can we work with PDF files in Python.

Since Python is a very versatile Programming Language and to deal with PDF Documents it provides a Module called PyPDF2.This Module is purely written in Python Programming Language.



Installing PyPDF2 Module

C:\Users\Your Name>pip install pypdf2


Extracting Metadata

PDF Metadata is basically the data that provides more information about a certain PDF file. PDF metadata often includes information like - creation date, author, capacity and application that created the files.
EXAMPLE

from PyPDF2 import PdfFileReader
def pdfMeta(name):
with open(name, 'rb'as file:
pdf = PdfFileReader(file)
info = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
print("Written By: \t", info.author)
print()
print("Created By: \t", info.creator)
print()
print("Produced By: \t",info.producer)
print()
print("Subject: \t", info.subject)
print()
print("Title of the PDF: \t",info.title)
print()
print("Number of Pages in the pdf: \t",number_of_pages)

if __name__ == '__main__':
n = 'ionic_tutorial_complete.pdf'
pdfMeta(n)

Extracting Text
Now We will see how to Extract Text From a PDF File using PyPDF2 Module.
from PyPDF2 import PdfFileReader
def pdfText(name):
with open(name'rb'as file:
pdf = PdfFileReader(file)
"""get the text from 10th page"""
page = pdf.getPage(10)
print(page)
text = page.extractText()
print(text)
if __name__ == '__main__':
path = 'ionic_tutorial_complete.pdf'
pdfText(path)

Thanks for Reading

No comments: