How to parse the original unicode (utf8) characters from the PDF source when converting to EPUB

Q)

I am using the Pdf to ePub feature. In some pdfs the generated html contains UTF-8 characters not defined in the UTF-8 standard (they are in the private use area of Unicode chars). These characters are correctly displayed using the generated font, but we have problems when we want to find text inside the html.

I would like to know if I can find a way to replace these characters to standard UTF-8 characters. I don’t care about the visualization of the ePub in this case, because I will use the replaced text only for searches.

I have no problem with the mentioned characters when I display the xhtml using a browser, I mean, the content is displayed correctly. But I need to parse the content of the xhtml, in standard UTF-8 codification, to provide searching functionality with our own search engine. I can’t use the Unicode text of the pdf because I need the xhtml content, with its tags.

A)

The root of the problem is that HTML doesn’t separate glyphs, from unicode (unlike PDF, SVG and XPS which all do).

For instance, fi is the character entity reference for unicode U+FB01, the ligature FI.

However, try copy/pasting the ligatures from the web page below into a text editor. You won’t see “fi”, instead you will get a garbage character.
http://adamdscott.com/typography/ligatures-on-the-web/

There other cases though, where conversion decides that a glyph needs to be mapped to the Private Unicode Area, to ensure that the generated html looks as close as possible to the input PDF page.

That being said, to include the full utf8 encoding of the text from a PDF the following should work.

  1. Run ToEpub in expanded mode (unzipped)
  2. Run text extractor on the PDF, and get the text as utf8 format. Use TextExtractor::GetAsText(), and then call UString::ConvertToUtf8()
  3. Clean up the utf8 characters to ensure no xml reserved characters are included. (e.g. ‘>’ must become ‘>’)
  4. Inject the xml compliant utf8 into the respective xhtml pages. I assume you could put this into it’s own div, and mark the div as not visible, so nothing gets rendered.
  5. Run “epubcheck -mode exp -save” on the epub folder. This will validate the epub files and then zip into a valid .epub file.
  6. You can get epubcheck here: http://code.google.com/p/epubcheck/

Has been some time since I wrote you asking about the “special” character references in the HTML generated by PDFTron when I convert a PDF to ePub.

Now I’m here again with another question for you:

We have about 50 pdfs converted to ePub for our testing.

I have looked for character references in the Private Use Area of UTF-8 in each file of each converted Pdf.

I found that 6 pdfs of the 50 don’t contain any reference to these “special” characters.

I would like to know why does it happen, I mean, how I can prevent these special characters in the xhtml when we make the pdfs?

Unfortunately, there are different situations where we output PUA values, and this logic can change from version to version as we try to improve it. So whatever the logic is now it is not something you want to build around as it can change.

Here is one example of why you might see PUA used in the output.

First, to clarify terminology.

Character Code: The value in the PDF file. This is mapped to a Glyph (what you see), and mapped to one or more Unicode values (what you get in text selection).

Glyph: What the font draws on screen

Unicode: What we take that glyph to mean.

In the string “Criterios a seguir”, on the first page of the generated HTML, the space glyphs are mapped to the PUA.

In the PDF though these space glyphs are actually character code 0x1D, and in the font they map to Unicode U+0020 (the space character). However, in any PDF viewer, and in any Browser, they will apply what is called word-spacing on character U+0020. So if we write that into the HTML the positioning of letters will be wrong. That is, in the PDF character code 0x1D was used, hence only letter-spacing was applied, while if we use character code 0x20 in the HTML, then word-spacing would be used. The Private Usage Area is used, because we don’t want another (incorrect) glyph to be used, so using the PUA forces the usage of the correct glyph by the browser.