How to properly separate text from other content on a PDFpage?


We need to render a part of PDF page, e. g. only text elements. For
this purpose we use ElementReader and ElementWriter classes. In most
cases we get the good results with them, but some PDF files bring a
problems ( such as a image shifting etc).

A: The fastest way to render a subset of a PDF page is to modify a
crop box, before rendering the page using PDFDraw. For example:
  Rect tmp_rect = page.GetCroBox();
  pdfdraw.Export(page, "out.jpg", "JPEG");
  page.SetCropBox(tmp_rect); // Restore original crop box...

In case you need to edit or separate content using ElementWriter and
ElementBuilder, there are couple of points to keep in mind (as part of
the new PDFNet update we are planning to remove these requirements in
order to make editing process more intuitive):
- Call new_page.SetRotation/SetMediaBox/SetCropBox just before adding
the page to the document (i.e. pdfdoc.PagePushBack(new_page)) and not
when creating a new page.
- If you are skipping any elements you would need to update the
transformation matrix on the output page.

To find more information and sample code related to this topic please
search the Knowledge Base using the following keyword "separate PDF