pdf to jpg without quality loss; gscan2pdf
When I convert a pdf file to bunch of jpg files using
convert -quality 100 file.pdf page_%04d.jpg
I have appreciable quality loss.
However if I do the following, there is no (noticeable) quality loss:
Start gscan2pdf, choose file-> import (and choose file.pdf). Then go to the temporary directory of gscan2pdf. There are many pnm files (one for every page of the pdf-file). Now I do
for file in *.pnm; do convert $file $file.jpg done
The resulting jpg-files are (roughly) of the same quality as the original pdf (which is what I want).
Now my question is, if there is a simple command line way to convert the pdf file to a bunch of jpg files without noticeable quality loss? (The solution above is too complicated and time consuming).
It's not clear what you mean by "quality loss". That could mean a lot of different things. Could you post some samples to illustrate? Perhaps cut the same section out of the poor quality and good quality versions (as a PNG to avoid further quality loss).
Perhaps you need to use
-densityto do the conversion at a higher dpi:
convert -density 300 file.pdf page_%04d.jpg
(You can prepend
-units PixelsPerCentimeterif necessary. My copy defaults to ppi.)
Update: As you pointed out,
gscan2pdf(the way you're using it) is just a wrapper for
pdfimagesdoes not do the same thing that
convertdoes when given a PDF as input.
converttakes the PDF, renders it at some resolution, and uses the resulting bitmap as the source image.
pdfimageslooks through the PDF for embedded bitmap images and exports each one to a file. It simply ignores any text or vector drawing commands in the PDF.
As a result, if what you have is a PDF that's just a wrapper around a series of bitmaps,
pdfimageswill do a much better job of extracting them, because it gets you the raw data at its original size. You probably also want to use the
pdfimages, because a PDF can contain raw JPEG data. By default,
pdfimagesconverts everything to PNM format, and converting JPEG > PPM > JPEG is a lossy process.
pdfimages -j file.pdf page
You may or may not need to follow that with a
.jpgstep (depending on what bitmap format the PDF was using).
I tried this command on a PDF that I had made myself from a sequence of JPEG images. The extracted JPEGs were byte-for-byte identical to the source images. You can't get higher quality than that.
+1 I am so glad I didn't submit to the snobbery misreading one of your sentences inspired in me and actually tried pdfimages -- probably the most useful program I have used in months! I'd encourage everyone to try it!
`convert` is also impractical for large PDFs. For example, it took 45 GB of memory to process a book of 700 6-megapixel pages. It also took about a thousand times longer than `pdfimages`.
For the other way round, convert images to a pdf, or better, wrap images into a pdf, use img2pdf, here: https://gitlab.mister-muffin.de/josch/img2pdf (wraps jpg and jpg2000 into a pdf).
by the way, if you don't need jpgs specifically, but want the actual image data from the PDF regardless of format, use `-all` in place of `-j` :-)
I got strange checkered boxes all over my converted image files using the above 'convert' command, until I converted from pdf to pdf first (strange, I know). After that, the above command worked and there were no checkered boxes in the images. I wonder if it's some for of script runningin the original pdf that created the checkers? The original pdf was editable, I wonder if that was why.