Count Characters And Words In PDF Files Using Python In Linux

Counting words and characters in text files is straightforward, but what if you need to do the same with PDF files? PDFs are widely used for sharing documents because they maintain formatting across different devices. However, their structure makes text extraction a bit more complex than with plain text files. In this article, I'll guide you through creating a Python script that can count the number of words and characters in a PDF file in Linux.

Table of Contents

Setting Up Your Environment

Before writing the script, you'll need to have Python installed on your system. Additionally, we'll use the PyPDF2 library to extract text from the PDF files. You can install this library using pip:

pip install PyPDF2

With PyPDF2 installed, we're ready to start.

A Python Script to Count Words and Characters in PDF Files

The complete Python script to count the number of words and characters in a PDF file is available in our GitHub's gist page:

This Python script will analyze a PDF file by extracting its text content and then counting the total number of words and characters within that text. It uses the PyPDF2 library to read the PDF file and the argparse module to handle command-line arguments.

The script includes both options: counting characters with newline characters included and excluding them.

#!/usr/bin/env python3

# ------------------------------------------------------------------
# Script Name:   pdfcwcount.py
# Description:   A Python Script to Count Characters and Words
#                in a PDF File.
# Website:       https://gist.github.com/ostechnix
# Version:       1.0
# Usage:         python pdfcwcount.py filename
# ------------------------------------------------------------------

import PyPDF2
import argparse

def extract_text_from_pdf(file_path):
    """Extracts text from a PDF file."""
    try:
        with open(file_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            text = ''
            for page in range(len(reader.pages)):
                text += reader.pages[page].extract_text()
            return text
    except FileNotFoundError:
        print(f"The file {file_path} does not exist.")
        return ''

def count_words_in_text(text):
    """Counts the number of words in a given text."""
    words = text.split()
    return len(words)

def count_characters_in_text(text, include_newlines=True):
    """Counts the number of characters in a given text."""
    if not include_newlines:
        text = text.replace('\n', '')
    return len(text)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Count the number of words and characters in a PDF file.")
    parser.add_argument("file_path", type=str, help="Path to the PDF file.")
    
    args = parser.parse_args()

    text = extract_text_from_pdf(args.file_path)

    if text:
        # Calculate counts
        word_count = count_words_in_text(text)
        character_count_with_newlines = count_characters_in_text(text, include_newlines=True)
        character_count_without_newlines = count_characters_in_text(text, include_newlines=False)

        # Display results in a neat format
        print("\n--- PDF File Analysis Report ---")
        print(f"File: {args.file_path}")
        print(f"Total Words: {word_count}")
        print(f"Total Characters (including newlines): {character_count_with_newlines}")
        print(f"Total Characters (excluding newlines): {character_count_without_newlines}")
        print("-----------------------------\n")

How this Script Works

Text Extraction: The extract_text_from_pdf function reads the PDF file and extracts text from each page. This function handles PDF-specific challenges, such as reading binary data and navigating through multiple pages.
Word and Character Counting:
- The count_words_in_text function splits the text into words based on spaces and counts them.
- The count_characters_in_text function counts characters. It also provides an option to include or exclude newline characters, depending on your needs.
User Input: The script accepts the file path of the PDF as an argument. It then processes the file and displays the results, showing both word count and character count with and without newlines.

Running the Script

You can run the script from the command line by specifying the path to your PDF file. Here’s an example:

python pdfcwcount.py path/to/your/file.pdf

The script will output a neatly formatted report that shows the total number of words, characters including newlines, and characters excluding newlines.

Example:

$ python pdfcwcount.py ~/testfile.pdf

Sample Output:

--- PDF File Analysis Report ---
File: /home/ostechnix/testfile.pdf
Total Words: 6
Total Characters (including newlines): 22
Total Characters (excluding newlines): 21
-----------------------------

Count Characters and Words in PDF Files Using Python

As you can see, the testfile.pdf contains 6 words, 22 characters including newlines and 21 excluding newlines.

Conclusion

This Python script is a handy tool for analyzing PDF files, whether you're working on document processing, content analysis, or simply curious about the contents of a PDF. With the ability to count words and characters both with and without newline characters, the script provides flexibility to suit various needs. Give it a try, and you’ll find it a useful addition to your toolkit for working with PDF files.

CLI Commandline Linux PDF Python Python Script Script

Count Characters And Words In PDF Files Using Python In Linux

Setting Up Your Environment

A Python Script to Count Words and Characters in PDF Files

How this Script Works

Running the Script

Conclusion

Linux Kernel 6.11 RC3 Released: A Steady Progress Towards Stability

Ubuntu 24.04.1 LTS Release Delayed: What You Need to Know

You May Also Like

Leave a Comment Cancel Reply