PDF files are great for sharing documents, but they are not easy to edit or convert into other formats. If you want to convert a PDF to Markdown format (while keeping the images), this guide will show you how to do it using poppler-utils and pandoc, two powerful open-source tools used for document processing.
Table of Contents
Why Convert PDF to Markdown?
Markdown is a lightweight text format that is easy to read and edit. Many websites, blogs, and documentation tools use Markdown. By converting a PDF to Markdown, you can:
- Edit the content easily,
- Keep the text formatting simple,
- Store content in a lightweight format.
Tools You Need
To perform the PDF to markdown conversion, you will need:
- poppler-utils
- pandoc
What is poppler-utils?
poppler-utils
is a collection of command-line tools for working with PDF files. It's based on the Poppler PDF rendering library, which is widely used in Linux environments.
Included tools (most-used ones):
pdftotext
: Extracts text from PDFs.pdfimages
: Extracts images.pdftoppm
: Converts pages to images.pdfinfo
: Displays metadata (title, author, pages, etc.).pdfseparate
andpdfunite
: Split and merge PDFs.
If you need to automate extraction of data, convert PDFs into images, or inspect file metadata — poppler-utils
is lightweight and scriptable.
What is pandoc?
pandoc
is a document conversion tool. It’s often described as the “Swiss army knife” of markup conversion. It supports dozens of input and output formats.
Supported formats:
- Input: Markdown, HTML, LaTeX, DocBook, Word DOCX, etc.
- Output: PDF, HTML, EPUB, DOCX, LaTeX, Markdown, etc.
Key features:
- Converts Markdown to PDF (via LaTeX or other backends).
- Supports citation management.
- Enables format transformations for publishing and academic writing.
If you write in Markdown and need to generate nicely formatted PDFs, DOCX files, or slideshows, pandoc
is ideal.
Please note that poppler-utils and pandoc are not related directly, but they can complement each other.
For instance, You can use pdftotext
(from poppler-utils
) to extract plain text from a PDF, then pipe that into pandoc
to convert it into HTML, Markdown, or another format.
Example:
pdftotext input.pdf - | pandoc -f plain -t markdown -o output.md
Install poppler-utils and pandoc in Linux
Poppler-utils and Pandoc are packaged to most Linux distributions and are available in the default repositories.
On Debian/Ubuntu Linux, you can install poppler-utils and pandoc using command:
sudo apt install poppler-utils pandoc
On Fedora, Red Hat systems:
sudo dnf install poppler-utils
Important Considerations and Potential Issues
- Formatting Loss: PDF is a fixed-layout format, while Markdown is a lightweight markup language focused on semantic structure. You will inevitably lose some formatting during the conversion (e.g., exact positioning of text, complex layouts, tables, images).
- Table Conversion: Converting tables from PDF to Markdown can be particularly challenging. The output from
pdftotext
for tables is often poorly structured. You might need to manually reformat tables in the resulting Markdown file. - Image Handling:
pdftotext
only extracts text; it does not handle images. If your PDF contains images, they will be lost during this conversion process. You would need a different approach to extract and handle images separately. - Complex Layouts: PDFs with multi-column layouts or unusual text flow might not convert cleanly to Markdown. The text might appear in the wrong order or be difficult to read.
- Manual Cleanup: Regardless of the method you choose, you will likely need to spend some time manually cleaning up and reformatting the resulting Markdown file to achieve the desired output.
Steps to Convert a PDF File to Markdown (with Images)
If you want to convert a simple pdf file that has no images or tables, then converting it to markdown is easy! You can do the pdf to markdown conversion with a single command:
pdftotext input.pdf - | pandoc -t markdown -o output.md
But if the pdf file has images, then you need to manually extract the text and images from the pdf, and then convert the text to markdown and finally attach the images in the resulting markdown file one by one.
Step 1: Extract Text from the PDF
Use pdftotext
to extract text while keeping the layout:
pdftotext -layout input.pdf output.txt
This will save the text in output.txt
.
Step 2: Extract Images from the PDF
Use pdfimages
to extract images:
mkdir images pdfimages -all input.pdf images/image
All images will be saved in the images/
folder.
Step 3: Convert Text to Markdown
Since pdftotext
produces plain text, you can convert it to Markdown using pandoc
:
pandoc -t markdown output.txt -o output.md
If the text in output.txt
is truly plain text (without Markdown formatting), you might not need the -f markdown
flag.
Simply running the following command should work fine:
pandoc -f markdown output.txt -o output.md
This converts the extracted text into a Markdown file. You may need to manually adjust formatting if needed.
Step 4: Embed Images in Markdown
As stated already, if your file has images, you have to manually place them at the appropriate place in the output markdown file.
In your Markdown file, you can add images like this:

You need to adjust the filenames to match your extracted images.
You can also simply copy the image and paste it directly in the markdown file.
Conclusion
Converting a PDF to Markdown with images is simple using poppler (pdftotext
, pdfimages
), and pandoc. There are also many AI and web-based conversion tools exist to simplify the operation. For simple PDF files, I use this method in Linux.
What's your preferred methods and tools to do PDF to Markdown conversion? Please let us know via the comment section below.
Related Read: