Notes on how to losslessly extract single images from PDF¶
copied from https://redmine.acdh.oeaw.ac.at/issues/11776
The poppler version of pdfimages has a “-list” option on the CLI.
See [[https://poppler.freedesktop.org/]]. This is available for most *nix distros as a package, other OS ports are there as well. The Poppler stuff also comes bundled with TeXLive [[https://www.tug.org/texlive/]] (maybe also other TeX distros).
Example:
PS C:\Users\somebody\mydir> pdfimages.exe -list .\scan_2818_001.pdf page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 1 0 image 4960 7015 rgb 3 8 jpeg no 4 0 601 601 4989K 4.9% 2 1 image 4960 7015 rgb 3 8 jpeg no 17 0 601 601 6087K 6.0% 3 2 image 4960 7015 rgb 3 8 jpeg no 27 0 601 601 3800K 3.7% 4 3 image 4960 7015 rgb 3 8 jpeg no 37 0 601 601 3759K 3.7%
(the original Xpdf version [[http://www.xpdfreader.com/]] does not have the -list option)
An actual call would probably include
-all (to save all images of all filetypes within the PDF)
-p (include the page number the image file was extracted from)
the PDF’s basename as “image root” parameter
pdfimages.exe -all -p .\scan_2818_001.pdf out\scan_2818_001
This yields
ls .\out\ Mode LastWriteTime Length Name ---- ------------- ------ ---- -a---- 12.09.2018 16:36 5108636 scan_2818_001-001-000.jpg -a---- 12.09.2018 16:36 6233388 scan_2818_001-002-001.jpg -a---- 12.09.2018 16:36 3891320 scan_2818_001-003-002.jpg -a---- 12.09.2018 16:36 3849424 scan_2818_001-004-003.jpg
Of course, this is also scriptable from the command line. There are poppler implementations for most OSs, plus things to do that by calling poppler functions from e.g. python ([[https://launchpad.net/poppler-python]], untested).
PS Other commands available through poppler (from the xpdf support page [[http://www.xpdfreader.com/support.html]]) - named understandably, except for the last one which says it extracts “embedded files” - Go ahead, daring and curious!):
pdftotext(1)
pdftops(1)
pdftoppm(1)
pdftopng(1)
pdftohtml(1)
pdfinfo(1)
pdfimages(1)
pdffonts(1)
pdfdetach(1)