Implementing PDF optimization using PDFNet SDK.

Q:

I'd like to implement a similar function to Acrobat's option 'Optimize
PDF' in my program. It should make the PDF as small as possible but
still have acceptable quality for screen. The optimized PDF will not
be used for printing afterwards.

I found the JBIGTest sample for shrinking monochrome images, but I'd
like do this for color images too. I'd like to make a decission based
on the DPI but I can't find those values in the PDF Reference.

Since your site mentions PDF optimizing as a common use case scenario,
I was hoping you could give me some hints to implement this.
----
A:

If need to develop PDF optimization functions, JBIG2 sample project is
a good place to look at. You could extend this code to handle color
images (e.g. for down-sampling and recompression), embedded fonts
(e.g. for subsetting or removal of embedded fonts), etc.

Because the same image may be reused multiple times on a given PDF
page (or throughout the document), PDF format does not specify DPI
(Dots Per Inch) parameter as part of the image dictionary.

DPI of an image drawn on a page depends on several factors:

- image pixel dimensions (i.e. pdftron.PDF.Image.GetImageWidth() &
Height()).
- the area of the PDF page covered with the image (this can be
obtained from current transformation matrix (CTM) of a given image
Element). Let's assume that this area is: element_width x
element_height.

So the DPI of the given image instance is: [element_width/
image.GetImageWidth(), element_height/image.GetImageheight()].

For example, if the same image is drawn twice the size on one page
relative to another page, the effective DPI of the larger image will
be two times smaller.

  For more information on this topic please see the following FAQ
entry:
   How do I get the image resolution and DPI? http://www.pdftron.com/net/faq.html#img_01

  Using PDFNet, you can calculate the effective DPI for every instance
of 'PDF.Image' on the page and you can use this information in your
sub-sampling function.

The sub-sampling function can read image data using
pdftron.PDF.Image.Export or GetBitmap()). After subsampling you can
create a new image using PDF.Image.Create(...) and you can swap the
original image with the new instance using
(sdfdoc.Swap(old_imgage.GetObjNum(), new_imgage.GetObjNum(); i.e.
similar to JBIG2 sample code)).

Q:

I am trying to swap 2 jpeg images in the following code. Everything
seems to work great, however, to get it to work I need to comment out
the line that moves over the ImageMask and the Decode values. When
this is done the transparency and other features are missing. When I
put the ImageMask line back in the code it compiles and runs fine but
it generates the some errors when it is loaded into adobe reader.

My method GetImage takes a jpeg image and resamples and compresses it.
the return image type is Image. The image that is returned is created
the the following command:
pdftron.PDF.Image img = pdftron.PDF.Image.Create(new_doc, outFile);

private void CompressImages()
        {
            Doc cos_doc = new_doc.GetSDFDoc();
            int num_objs = cos_doc.XRefSize();

            for (int i = 1; i < num_objs; ++i)
            {
                Obj obj = cos_doc.GetObj(i);

                if (obj != null && !obj.IsFree() && obj.IsStream())
                {
                    DictIterator itr = obj.Find("Subtype");
                    if (itr == obj.DictEnd() || itr.Value().GetName() !

"Image")
                        continue;

                    pdftron.PDF.Image input_image = new
pdftron.PDF.Image(obj);

                    if (input_image != null)
                    {
                        pdftron.PDF.Image new_image =
GetImage(input_image);
                        Obj new_img_obj = new_image.GetSDFObj();

                        // Bad --
                        // itr = obj.Find("Decode");
                        // if (itr != obj.DictEnd())
new_img_obj.Put("Decode", itr.Value().Clone());

                        // need to make this work
                        // itr = obj.Find("ImageMask");
                        // if (itr != obj.DictEnd())
new_img_obj.Put("ImageMask", itr.Value().Clone());

                        itr = obj.Find("Mask");
                        if (itr != obj.DictEnd())
new_img_obj.Put("Mask", itr.Value().Clone());

                        cos_doc.Swap(i,
new_image.GetSDFObj().GetObjNum());
                    }
                }
            }
        }

---

A:

The problem is that above code is also recompressing image masks.
Image masks are by definition monochrome and 1BPP. The conversion
code would result in 8BPP Jpeg replacement (which is also probably not
very efficient). There is nothing wrong if you would like to
recompress/resample image masks, but in that case you should keep 1BPP
sample representation.

Another second issue was related to Decode array. The recompression
method would convert from Indexed color space to DeviceRGB but this
would invalidate Decode array with have different ranges (and meaning)
depending on the image color space.

Inserting the following line just after 'input_image' is constructed
should resolve the problem:

if (input_image.IsImageMask() || input_image.GetComponentNum()==1)
continue;

Q:
I made the change as you recommended and ran a test. Everything looks
very nice, with one exception. On the compressed output file the
outline of an embedded image is a shade darker than the original
document. Maybe it is similar to the problem you originally detected?
----
A:

The problem is that in the input PDF document images can be
represented using CMYK or ICC color space. During your optimization
function all images are converted from CMYK to DeviceRGB and some
color information is lost.

If the color accuracy is very important for your application, you
could preserve the original color space information. For example, you
would resample CMYK images as CMYK instead of normalizing everything
to RGB. Similarly you would keep grayscale images as grayscale,
instead of converting them to RGB (which may even result in larger
file size), etc. All of this can be implemented by extending your
optimization function with code that handles special cases.

Q:

I noticed that Adobe Acrobat Optimizer can sometimes reduce a PDF file
size by a factor ot 40-80 .

With the 2 page sample PDF (17,808 KB) and code that I previously sent
you our optimizer (based on PDFTron SDK) generates a 8,288 KB file
while Acrobat generates a 198 KB file. One thing that Acrobat does
that may be helpful is change the page resolution (to 600 in this
case). I don't quite understand this concept as the page is not
constrained by resolution only the images within it. Obviously I am
missing something here. When I scale my page to a width of 380 or 760
the byte size stays the same.

All this leads me up to my question. If my 2 page sample PDF only
compresses to 8,251 KB from the 17,808 KB original and the other
factors that I discussed aren't make a major reduction in size then
what is Acrobat doing that allows them to reduce the size of the PDF
to 198KB. Most importantly how can I mimic that behavior in PDFTron
SDK?
----
A:

All optimizations available in Acrobat Pro can be implemented using
PDFNet SDK. Looking over your image recompression code it seems that
you are not changing image resolution (i.e. image width and height
stay the same), but you are only decreasing 'quality' settings for
JPEG encoder.

For significant file optimizations (such as the one available in
Acrobat) you actually need to 'down-sample' the image (i.e. you need
to replace the image with lower resolution equivalent). For example,
if the current dimensions of the image are 1000x500 pixels, you could
replace the image by a downscaled version that is 500x250 pixels (and
at the same time you can decrease quality parameter).

If you are using GDI+ or .NET framework you may want to consider using
Bitmap.GetThumbnailImage() method. The GetThumbnailImage method works
well when the requested thumbnail image has a size of about 120 x 120
pixels. If you request a large thumbnail image (for example, 300 x
300) from an Image that has an embedded thumbnail, there could be a
noticeable loss of quality in the thumbnail image. It might be better
to scale the main image (instead of scaling the embedded thumbnail) by
calling the DrawImage method. More information can be on GDI+ related
dev sites (please see http://www.codeproject.com/csharp/imageresize.asp,
http://www.codeproject.com/cs/media/imageprocessing4.asp, etc)

Page resolution parameter in Acrobat Optimizer is essentially used to
compute the downscaling factor for each image on the page. Since in
PDF an image may be used repeatedly throughout a document, the DPI
parameter is not stored in Image object, but can be inferred from
image Element as described in the following FAQ: http://www.pdftron.com/net/faq.html#img_01.
So knowing the DPI of the source image and the target 'page'
resolution/DPI you can calculate the downscaling factor.

Q:

I have implemented your suggested changes. I got a little better
compression than I thought I would. The gray scale images in the
sample code seem to be about 448 KB. However, that still leaves me
with a best case of 7,099 KB sample PDF size where the Acrobat PDF
size is 198KB.
---

A:

After looking at your test file, we found that the problem is in
embedded color spaces. To verify this you can select Advanced / PDF
Optimizer / Audit Space Usage ... in Acrobat Pro). 7.4 MB of the total
file size if used on color spaces (specifically DeviceN color spaces
listed in page resource dictionary). Using PDFNet SDK you can also
remove or replace all ICC color profiles with device color spaces
(i.e. DeviceRGB / CMYK / Gray ). This should reduce the file size by
at least 7MB so that total file size will be along the expected file
size.