Does HTML2PDF conversion loose the information related to H1-H6 headings?

Product: PDFNet Windows

Product Version:9.2.0

I am trying to convert an HTML string to PDF using HTML2PDF and make it accessible using the tag information that comes with the PDFDoc.
HTML string conversion looks like this;
PDFDoc doc = new PDFDoc();
string html = “<html><body><h1>Heading</h1><p>Paragraph.</p></body></html>
converter.InsertFromHtmlString(html);
converter.Convert(doc);

After the conversion, when I tried to process the elements in the PDFDoc and tried to see what the MCTag returns for each text element(e_text); (in this case Heading and Paragraph)
var tag = element.GetMCTag();
tag.GetName() returns “P” as the element tag for Heading which I expected to have “h1” instead. For “Paragraph” it gave the correct MCTag which is “P”.

Am I missing something during the conversion? or is there a way to get the correct tag information when it comes to headings (H1-H6) in HTML or any way of getting the heading info after the conversion?

Thank you
waiting for a quick answer
Maneesha

Hello, I’m Ron, an automated tech support bot :robot:

While you wait for one of our customer support representatives to get back to you, please check out some of these documentation pages:

Guides:Forums:

Currently no, the output does not include H1-H6 headings.

Could you elaborate on how not having the H1-H6 headings exactly affects your users?

Hi Ryan,
Yes, we are trying to create a logical structure out of an untagged pdf (which is converted from HTML to PDF using HTML2PDF). Because the requirement is to make a PDF/UA compliant PDF. So in the logical structure tree, the headings should be identified as headings and tagged accordingly. So the assistive technology can identify what’s heading and what’s a paragraph.

Thank you for the clarifications.

I have added this feature request to the product backlog, but at this time it is not on our schedule.

If instead your input was a DOCX file, then the PDF output with our SDK is fully tagged with the H1-H6 tags. Is switching from HTML to DOCX an option for you?

No, unfortunately, DOCX type is not used within the application.
Thank you for the information :+1:

I am happy to report that starting with our next release, PDFNet 9.4, the HTML2PDF module output will be a Tagged PDF, and the H1-H6 entries will be preserved and present in the PDF output.

To get notified for the next official SDK release you can join our Discourse Announcements channel.

For platform specific notifications, such as Nuget, NPM, CocoaPods, please see the respective PDFTron product documentation page.

1 Like

That’s good news. Will you be considering nested elements as well, like Tables and lists in the output tagged PDF?

Hi @Ryan,
When can we expect the next release (9.4) of PDFTron SDK? Will it be within this year?

You can download a Preview build here:

https://pdftron.s3.amazonaws.com/custom/ID-zJWLuhTffd3c/support/html2pdf/html2pdf_chromium/HTML2PDFWindows_TaggedOutput.zip

While not as fully tested as our official releases, this should be fine for production usage, as the only change from the official release is the flag to generate a Tagged PDF.

Hi,
Thank you for the response.
I was checking the dll, but the output was the same - untagged PDF. Do I have to set some properties or is it only will be available in the released version?
I am using this on the attached file here;
html.txt (10.4 KB)

HTML2PDF converter = new HTML2PDF();
converter.InsertFromHtmlString(contentString);
converter.Convert(doc)