How To Convert A PDF File To Markdown (With Images) In Linux

PDF files are great for sharing documents, but they are not easy to edit or convert into other formats. If you want to convert a PDF to Markdown format (while keeping the images), this guide will show you how to do it using poppler-utils and pandoc, two powerful open-source tools used for document processing.

Table of Contents

Why Convert PDF to Markdown?

Markdown is a lightweight text format that is easy to read and edit. Many websites, blogs, and documentation tools use Markdown. By converting a PDF to Markdown, you can:

Edit the content easily,
Keep the text formatting simple,
Store content in a lightweight format.

Tools You Need

To perform the PDF to markdown conversion, you will need:

poppler-utils
pandoc

What is poppler-utils?

poppler-utils is a collection of command-line tools for working with PDF files. It's based on the Poppler PDF rendering library, which is widely used in Linux environments.

Included tools (most-used ones):

pdftotext: Extracts text from PDFs.
pdfimages: Extracts images.
pdftoppm: Converts pages to images.
pdfinfo: Displays metadata (title, author, pages, etc.).
pdfseparate and pdfunite: Split and merge PDFs.

If you need to automate extraction of data, convert PDFs into images, or inspect file metadata — poppler-utils is lightweight and scriptable.

What is pandoc?

pandoc is a document conversion tool. It’s often described as the “Swiss army knife” of markup conversion. It supports dozens of input and output formats.

Supported formats:

Input: Markdown, HTML, LaTeX, DocBook, Word DOCX, etc.
Output: PDF, HTML, EPUB, DOCX, LaTeX, Markdown, etc.

Key features:

Converts Markdown to PDF (via LaTeX or other backends).
Supports citation management.
Enables format transformations for publishing and academic writing.

If you write in Markdown and need to generate nicely formatted PDFs, DOCX files, or slideshows, pandoc is ideal.

Please note that poppler-utils and pandoc are not related directly, but they can complement each other.

For instance, You can use pdftotext (from poppler-utils) to extract plain text from a PDF, then pipe that into pandoc to convert it into HTML, Markdown, or another format.

Example:

pdftotext input.pdf - | pandoc -f plain -t markdown -o output.md

Install poppler-utils and pandoc in Linux

Poppler-utils and Pandoc are packaged to most Linux distributions and are available in the default repositories.

On Debian/Ubuntu Linux, you can install poppler-utils and pandoc using command:

sudo apt install poppler-utils pandoc

On Fedora, Red Hat systems:

sudo dnf install poppler-utils

Important Considerations and Potential Issues

Formatting Loss: PDF is a fixed-layout format, while Markdown is a lightweight markup language focused on semantic structure. You will inevitably lose some formatting during the conversion (e.g., exact positioning of text, complex layouts, tables, images).
Table Conversion: Converting tables from PDF to Markdown can be particularly challenging. The output from pdftotext for tables is often poorly structured. You might need to manually reformat tables in the resulting Markdown file.
Image Handling: pdftotext only extracts text; it does not handle images. If your PDF contains images, they will be lost during this conversion process. You would need a different approach to extract and handle images separately.
Complex Layouts: PDFs with multi-column layouts or unusual text flow might not convert cleanly to Markdown. The text might appear in the wrong order or be difficult to read.
Manual Cleanup: Regardless of the method you choose, you will likely need to spend some time manually cleaning up and reformatting the resulting Markdown file to achieve the desired output.

Steps to Convert a PDF File to Markdown (with Images)

If you want to convert a simple pdf file that has no images or tables, then converting it to markdown is easy! You can do the pdf to markdown conversion with a single command:

pdftotext input.pdf - | pandoc -t markdown -o output.md

But if the pdf file has images, then you need to manually extract the text and images from the pdf, and then convert the text to markdown and finally attach the images in the resulting markdown file one by one.

Step 1: Extract Text from the PDF

Use pdftotext to extract text while keeping the layout:

pdftotext -layout input.pdf output.txt

This will save the text in output.txt.

Step 2: Extract Images from the PDF

Use pdfimages to extract images:

mkdir images
pdfimages -all input.pdf images/image

All images will be saved in the images/ folder.

Step 3: Convert Text to Markdown

Since pdftotext produces plain text, you can convert it to Markdown using pandoc:

pandoc -t markdown output.txt -o output.md

If the text in output.txt is truly plain text (without Markdown formatting), you might not need the -f markdown flag.

Simply running the following command should work fine:

pandoc -f markdown output.txt -o output.md

This converts the extracted text into a Markdown file. You may need to manually adjust formatting if needed.

Step 4: Embed Images in Markdown

As stated already, if your file has images, you have to manually place them at the appropriate place in the output markdown file.

In your Markdown file, you can add images like this:

![Description](images/image-000.png)

You need to adjust the filenames to match your extracted images.

You can also simply copy the image and paste it directly in the markdown file.

Conclusion

Converting a PDF to Markdown with images is simple using poppler (pdftotext, pdfimages), and pandoc. There are also many AI and web-based conversion tools exist to simplify the operation. For simple PDF files, I use this method in Linux.

What's your preferred methods and tools to do PDF to Markdown conversion? Please let us know via the comment section below.

How To Convert A PDF File To Markdown (With Images) In Linux

Why Convert PDF to Markdown?

Tools You Need

What is poppler-utils?

What is pandoc?

Install poppler-utils and pandoc in Linux

Important Considerations and Potential Issues

Steps to Convert a PDF File to Markdown (with Images)

Step 1: Extract Text from the PDF

Step 2: Extract Images from the PDF

Step 3: Convert Text to Markdown

Step 4: Embed Images in Markdown

Conclusion

sk

How To Upgrade To Proxmox VE 8 From Proxmox VE 7

How To Upgrade To Fedora Linux 42 From Fedora 41 (Workstation and Server)

You May Also Like

MarkText: A Simple And Elegant Markdown Editor For...

Expanded Markdown Support is Coming to Google Docs

Working with Industry-Standard XML Formats on Linux

Leave a Comment Cancel Reply