Using PDF to HTML converter its one PDF page to one HTML page. For another tool we are using, we need to have everything in a single HTML file. How can we do this?
Please note that having 1:1 mapping of PDF page to HTML page allows for better management of page layout and performance. For example you could do magazine style page layout. Also, for PDF files that are hundreds of pages, having everything in a single file could result in slow loading/performance.
However, this is important, than you can follow the following steps.
Since the output is fully valid XML, you can parse and generate a single XHTML file using the following steps.
Essentially, what needs to be automated, is to post-process PDFNet’s current output is the following rough steps.
- Start new HTML file, and begin injecting the following HTML+CSS
- Start Styles entry and add FontSrc.css text
- Inject OTF fonts by base64 encoding them, and add them by replacing the existing src url from step 2.
- Add the TextContainer1 CSS from the 1st page, but rename to just TextContainer
- End style and begin body
- For each page, take the with id=“Page” and add to the new body, and make the following three changes
7.a) rename the id from “Page” to “PageX” (not required, but could be helpful.)
7.b) rename PageContainerX class name to just PageContainer.
7.c) for all pages after the first, you need to adjust the CSS top value of the page div so that the page is below the previous pages.
Step 7.c is the most “complicated” part, as you need to track the total height of all preceding pages and position the nth page with some gap. So in the example, the pages are 841.92px high, so I added a buffer of 10px, and each page is an additional 852px below the previous. This is because of the usage of absolute positioning. It might be possible to switch to normal relative positioning, but that had a number of knock on effects, so at least for now, I would just do the simple calculation above, which you can see in the sample file provided.
With the above steps, and knowing the source files are all XML compliant, your team can automate this with not too much trouble, and convert the existing HTML output to the style that you need.