Counting words and characters in text files is straightforward, but what if you need to do the same with PDF files? PDFs are widely used for sharing documents because they maintain formatting across different devices. However, their structure makes text extraction a bit more complex than with plain text files. In this article, I'll guide you through creating a Python script that can count the number of words and characters in a PDF file in Linux.
Table of Contents
Setting Up Your Environment
Before writing the script, you'll need to have Python installed on your system. Additionally, we'll use the PyPDF2
library to extract text from the PDF files. You can install this library using pip:
pip install PyPDF2
With PyPDF2
installed, we're ready to start.
A Python Script to Count Words and Characters in PDF Files
The complete Python script to count the number of words and characters in a PDF file is available in our GitHub's gist page:
This Python script will analyze a PDF file by extracting its text content and then counting the total number of words and characters within that text. It uses the PyPDF2
library to read the PDF file and the argparse
module to handle command-line arguments.
The script includes both options: counting characters with newline characters included and excluding them.
#!/usr/bin/env python3
# ------------------------------------------------------------------
# Script Name: pdfcwcount.py
# Description: A Python Script to Count Characters and Words
# in a PDF File.
# Website: https://gist.github.com/ostechnix
# Version: 1.0
# Usage: python pdfcwcount.py filename
# ------------------------------------------------------------------
import PyPDF2
import argparse
def extract_text_from_pdf(file_path):
"""Extracts text from a PDF file."""
try:
with open(file_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
text = ''
for page in range(len(reader.pages)):
text += reader.pages[page].extract_text()
return text
except FileNotFoundError:
print(f"The file {file_path} does not exist.")
return ''
def count_words_in_text(text):
"""Counts the number of words in a given text."""
words = text.split()
return len(words)
def count_characters_in_text(text, include_newlines=True):
"""Counts the number of characters in a given text."""
if not include_newlines:
text = text.replace('\n', '')
return len(text)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Count the number of words and characters in a PDF file.")
parser.add_argument("file_path", type=str, help="Path to the PDF file.")
args = parser.parse_args()
text = extract_text_from_pdf(args.file_path)
if text:
# Calculate counts
word_count = count_words_in_text(text)
character_count_with_newlines = count_characters_in_text(text, include_newlines=True)
character_count_without_newlines = count_characters_in_text(text, include_newlines=False)
# Display results in a neat format
print("\n--- PDF File Analysis Report ---")
print(f"File: {args.file_path}")
print(f"Total Words: {word_count}")
print(f"Total Characters (including newlines): {character_count_with_newlines}")
print(f"Total Characters (excluding newlines): {character_count_without_newlines}")
print("-----------------------------\n")
How this Script Works
- Text Extraction: The
extract_text_from_pdf
function reads the PDF file and extracts text from each page. This function handles PDF-specific challenges, such as reading binary data and navigating through multiple pages. - Word and Character Counting:
- The
count_words_in_text
function splits the text into words based on spaces and counts them. - The
count_characters_in_text
function counts characters. It also provides an option to include or exclude newline characters, depending on your needs.
- The
- User Input: The script accepts the file path of the PDF as an argument. It then processes the file and displays the results, showing both word count and character count with and without newlines.
Running the Script
You can run the script from the command line by specifying the path to your PDF file. Here’s an example:
python pdfcwcount.py path/to/your/file.pdf
The script will output a neatly formatted report that shows the total number of words, characters including newlines, and characters excluding newlines.
Example:
$ python pdfcwcount.py ~/testfile.pdf
Sample Output:
--- PDF File Analysis Report ---
File: /home/ostechnix/testfile.pdf
Total Words: 6
Total Characters (including newlines): 22
Total Characters (excluding newlines): 21
-----------------------------
As you can see, the testfile.pdf
contains 6 words, 22 characters including newlines and 21 excluding newlines.
Conclusion
This Python script is a handy tool for analyzing PDF files, whether you're working on document processing, content analysis, or simply curious about the contents of a PDF. With the ability to count words and characters both with and without newline characters, the script provides flexibility to suit various needs. Give it a try, and you’ll find it a useful addition to your toolkit for working with PDF files.