How to Extract PDF Tables in Python?
This topic is about the way to extract tables from a PDF enter Python. At first, let’s discuss what’s a PDF file?
PDF (Portable Document Format) may be a file format that has captured all the weather of a printed document as a bitmap that you simply can view, navigate, print, or forward to somebody else. PDF files are created using Adobe Acrobat,
Example :
Suppose a PDF file contains a Table
User_ID |
Name |
Occupation |
1 |
David |
Product Manage |
2 |
Leo |
IT Administrator |
3 |
John |
Lawyer |
And we want to read this table into our Python Program. This problem can be solved using several approaches. Let’s discuss each one by one.
Method 1: Using tabula-py
The tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. You can install the tabula-py library using the command.
pip install tabula-py
pip install tabulate
The methods used in the example are :
read_pdf(): reads the data from the tables of the PDF file of the given address
tabulate(): arranges the data in a table format
The PDF file used here is PDF.
Python3
from tabula import read_pdf
from tabulate import tabulate
df = read_pdf( "abc.pdf" ,pages = "all" )
print (tabulate(df))
|
Output:
Method 2: Using Camelot
Camelot is a Python library that helps to extract tables from PDF files. You can install the camelot-py library using the command
pip install camelot-py
The methods used in the example are :
read_pdf(): reads the data from the tables of the pdf file of the given address
tables[index].df: points towards the desired table of a given index
The PDF file used here is PDF.
Python3
import camelot
abc = camelot.read_pdf( "test.pdf" )
print (abc[ 0 ].df)
|
Output:
Last Updated :
21 Oct, 2021
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...