How to split a PDF document based on file size?

Aaron_Gravesdale · February 28, 2007, 8:17pm

Q:

We are looking to split a PDF document based on a maximum file size,
into multiple files. Is there an API which splits the PDF file based
on the Max file size ?
---

A:

One way to implement this functionality (i.e. split PDF document by
file size) is to:

1) create a temporary document (i.e. PDFDoc tmodoc = new PDFDoc())
2) import page set (starting from a single page)
3) if saved temporary document is smaller than maximum file size
add another page to the import page set and goto 1), otherwise
proceed to 4.
4) done. Start creating a new page set (i.e. goto 1)?

Since you need to work with pages the following resources may be
helpful:

- PDFPage sample project: (http://www.pdftron.com/net/
samplecode.html#PDFPage).
- http://www.pdftron.com/net/usermanual.html#page_manip
- http://www.pdftron.com/net/faq.html#merge_00

Aaron_Gravesdale · March 5, 2007, 9:11pm

Q:

1. To achieve the File splitting logic, we are currently saving each
page of a PDF File into a new PDF document and adding each page untill
the max. size limit is reached. In doing so, the total size of the
splitted file(s) is more than the original file.

Example: I have 6.5MB PDF file. Max. PDF file limit is 3MB.
After splitting, the file sizes are 4MB, 4MB, 2MB, 2MB, 1MB. The code
snippet is pasted below. Please guide on how to proceed on this.

2. Is there an API which split a PDF File based on a Maximum file size
limit. If so, please provide a pointer on the same.

3. Is there a way to find the Page size of an PDF File without
actually saving each page of the PDF file locally on to a Hard drive?.

CODE LOGIC:
1. Check the file size of the whole PDF to see if it cross the max
file size limit.
2. If it crosses the Max File size limit, save every page of the input
file into a Temp file under Temp Output folder.
3. Find the page size after saving the file locally on disk.
4. Add the size of every page thus saved to the AccruedPageSize
counter.
5. Continue the step 2, 3 and 4 untill the AccruedPageSize <=
MaxFileSizeLimit
6. Once it crosses the limit, save the file in to the Expected Output
Folder and reset the AccruedPageSize counter.
7. Continue the step 2 to 6 untill input file's last page is reached.
---
A:

In doing so, the total size of the splitted file(s) is more than the original file.

This is as expected and is not a limitation of PDFNet SDK (i.e. using
any other tool you will get similar result). The problem is that
inside of multi-page PDF documents pages can share the same resources
(e.g. embedded fonts, color spaces, images, and Forms XObjects (i.e.
shared content streams)).

After splitting a document into pages these resources can't be shared
and they are replicated in each file where they are used (hence the
increase in the overall file size). In case when the output documents
(the result of PDF split operation) contains more than one page, you
should use doc.ImportPages() prior to inserting pages in PDF document
sequence. For more information of this topic, please see
http://www.pdftron.com/net/usermanual.html#copy_pg.

2. Is there an API which split a PDF File based on a Maximum file size limit.

There is no built-in method, but this can be implemented along the
lines of your code. Please keep in mind that in some cases it is not
possible to set a 'Maximum file size limit'. For example, but due to
images, fonts, and other resources a page may take more than 1 MB,
even though you set maximum size to be 1MB.

Again, these limitations are not related to PDFNet SDK itself but to
the nature of PDF format.

Is there a way to find the Page size of an PDF File without actually
saving each page of the PDF file locally on to a Hard drive?.

The overall PDF file size is not directly related to the exported
page(s) (e.g. a two page document may take less space than if pages
are separated, there may be other unrelated embedded information,
forms, annotations, etc). As a result there is no utility method to
return the byte size for a given page in PDF document.