Home Markdown How To Convert A PDF File To Markdown (With Images) In Linux

How To Convert A PDF File To Markdown (With Images) In Linux

By sk
1.7K views 5 mins read

PDF files are great for sharing documents, but they are not easy to edit or convert into other formats. If you want to convert a PDF to Markdown format (while keeping the images), this guide will show you how to do it using poppler-utils and pandoc, two powerful open-source tools used for document processing.

Why Convert PDF to Markdown?

Markdown is a lightweight text format that is easy to read and edit. Many websites, blogs, and documentation tools use Markdown. By converting a PDF to Markdown, you can:

  • Edit the content easily,
  • Keep the text formatting simple,
  • Store content in a lightweight format.

Tools You Need

To perform the PDF to markdown conversion, you will need:

  1. poppler-utils
  2. pandoc

What is poppler-utils?

poppler-utils is a collection of command-line tools for working with PDF files. It's based on the Poppler PDF rendering library, which is widely used in Linux environments.

Included tools (most-used ones):

  • pdftotext: Extracts text from PDFs.
  • pdfimages: Extracts images.
  • pdftoppm: Converts pages to images.
  • pdfinfo: Displays metadata (title, author, pages, etc.).
  • pdfseparate and pdfunite: Split and merge PDFs.

If you need to automate extraction of data, convert PDFs into images, or inspect file metadata — poppler-utils is lightweight and scriptable.

What is pandoc?

pandoc is a document conversion tool. It’s often described as the “Swiss army knife” of markup conversion. It supports dozens of input and output formats.

Supported formats:

  • Input: Markdown, HTML, LaTeX, DocBook, Word DOCX, etc.
  • Output: PDF, HTML, EPUB, DOCX, LaTeX, Markdown, etc.

Key features:

  • Converts Markdown to PDF (via LaTeX or other backends).
  • Supports citation management.
  • Enables format transformations for publishing and academic writing.

If you write in Markdown and need to generate nicely formatted PDFs, DOCX files, or slideshows, pandoc is ideal.

Please note that poppler-utils and pandoc are not related directly, but they can complement each other.

For instance, You can use pdftotext (from poppler-utils) to extract plain text from a PDF, then pipe that into pandoc to convert it into HTML, Markdown, or another format.

Example:

pdftotext input.pdf - | pandoc -f plain -t markdown -o output.md

Install poppler-utils and pandoc in Linux

Poppler-utils and Pandoc are packaged to most Linux distributions and are available in the default repositories.

On Debian/Ubuntu Linux, you can install poppler-utils and pandoc using command:

sudo apt install poppler-utils pandoc

On Fedora, Red Hat systems:

sudo dnf install poppler-utils

Important Considerations and Potential Issues

  • Formatting Loss: PDF is a fixed-layout format, while Markdown is a lightweight markup language focused on semantic structure. You will inevitably lose some formatting during the conversion (e.g., exact positioning of text, complex layouts, tables, images).
  • Table Conversion: Converting tables from PDF to Markdown can be particularly challenging. The output from pdftotext for tables is often poorly structured. You might need to manually reformat tables in the resulting Markdown file.
  • Image Handling: pdftotext only extracts text; it does not handle images. If your PDF contains images, they will be lost during this conversion process. You would need a different approach to extract and handle images separately.
  • Complex Layouts: PDFs with multi-column layouts or unusual text flow might not convert cleanly to Markdown. The text might appear in the wrong order or be difficult to read.
  • Manual Cleanup: Regardless of the method you choose, you will likely need to spend some time manually cleaning up and reformatting the resulting Markdown file to achieve the desired output.

Steps to Convert a PDF File to Markdown (with Images)

If you want to convert a simple pdf file that has no images or tables, then converting it to markdown is easy! You can do the pdf to markdown conversion with a single command:

pdftotext input.pdf - | pandoc -t markdown -o output.md

But if the pdf file has images, then you need to manually extract the text and images from the pdf, and then convert the text to markdown and finally attach the images in the resulting markdown file one by one.

Step 1: Extract Text from the PDF

Use pdftotext to extract text while keeping the layout:

pdftotext -layout input.pdf output.txt

This will save the text in output.txt.

Step 2: Extract Images from the PDF

Use pdfimages to extract images:

mkdir images
pdfimages -all input.pdf images/image

All images will be saved in the images/ folder.

Step 3: Convert Text to Markdown

Since pdftotext produces plain text, you can convert it to Markdown using pandoc:

pandoc -t markdown output.txt -o output.md

If the text in output.txt is truly plain text (without Markdown formatting), you might not need the -f markdown flag.

Simply running the following command should work fine:

pandoc -f markdown output.txt -o output.md

This converts the extracted text into a Markdown file. You may need to manually adjust formatting if needed.

Step 4: Embed Images in Markdown

As stated already, if your file has images, you have to manually place them at the appropriate place in the output markdown file.

In your Markdown file, you can add images like this:

![Description](images/image-000.png)

You need to adjust the filenames to match your extracted images.

You can also simply copy the image and paste it directly in the markdown file.

Conclusion

Converting a PDF to Markdown with images is simple using poppler (pdftotext, pdfimages), and pandoc. There are also many AI and web-based conversion tools exist to simplify the operation. For simple PDF files, I use this method in Linux.

What's your preferred methods and tools to do PDF to Markdown conversion? Please let us know via the comment section below.

Related Read:

You May Also Like

Leave a Comment

* By using this form you agree with the storage and handling of your data by this website.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. By using this site, we will assume that you're OK with it. Accept Read More