Notes on how to losslessly extract single images from PDF

copied from https://redmine.acdh.oeaw.ac.at/issues/11776

The poppler version of pdfimages has a “-list” option on the CLI.

See [[https://poppler.freedesktop.org/]]. This is available for most *nix distros as a package, other OS ports are there as well. The Poppler stuff also comes bundled with TeXLive [[https://www.tug.org/texlive/]] (maybe also other TeX distros).

Example:

PS C:\Users\somebody\mydir> pdfimages.exe -list .\scan_2818_001.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    4960  7015  rgb     3   8  jpeg   no         4  0   601   601 4989K 4.9%
   2     1 image    4960  7015  rgb     3   8  jpeg   no        17  0   601   601 6087K 6.0%
   3     2 image    4960  7015  rgb     3   8  jpeg   no        27  0   601   601 3800K 3.7%
   4     3 image    4960  7015  rgb     3   8  jpeg   no        37  0   601   601 3759K 3.7%

(the original Xpdf version [[http://www.xpdfreader.com/]] does not have the -list option)

An actual call would probably include

  • -all (to save all images of all filetypes within the PDF)

  • -p (include the page number the image file was extracted from)

  • the PDF’s basename as “image root” parameter

pdfimages.exe -all -p .\scan_2818_001.pdf out\scan_2818_001

This yields

ls .\out\

Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----       12.09.2018     16:36        5108636 scan_2818_001-001-000.jpg
-a----       12.09.2018     16:36        6233388 scan_2818_001-002-001.jpg
-a----       12.09.2018     16:36        3891320 scan_2818_001-003-002.jpg
-a----       12.09.2018     16:36        3849424 scan_2818_001-004-003.jpg

Of course, this is also scriptable from the command line. There are poppler implementations for most OSs, plus things to do that by calling poppler functions from e.g. python ([[https://launchpad.net/poppler-python]], untested).

PS Other commands available through poppler (from the xpdf support page [[http://www.xpdfreader.com/support.html]]) - named understandably, except for the last one which says it extracts “embedded files” - Go ahead, daring and curious!):

  • pdftotext(1)

  • pdftops(1)

  • pdftoppm(1)

  • pdftopng(1)

  • pdftohtml(1)

  • pdfinfo(1)

  • pdfimages(1)

  • pdffonts(1)

  • pdfdetach(1)