Q)
I am using the Pdf to ePub feature. In some pdfs the generated html contains UTF-8 characters not defined in the UTF-8 standard (they are in the private use area of Unicode chars). These characters are correctly displayed using the generated font, but we have problems when we want to find text inside the html.
I would like to know if I can find a way to replace these characters to standard UTF-8 characters. I don’t care about the visualization of the ePub in this case, because I will use the replaced text only for searches.
I have no problem with the mentioned characters when I display the xhtml using a browser, I mean, the content is displayed correctly. But I need to parse the content of the xhtml, in standard UTF-8 codification, to provide searching functionality with our own search engine. I can’t use the Unicode text of the pdf because I need the xhtml content, with its tags.
A)
The root of the problem is that HTML doesn’t separate glyphs, from unicode (unlike PDF, SVG and XPS which all do).
For instance, fi is the character entity reference for unicode U+FB01, the ligature FI.
However, try copy/pasting the ligatures from the web page below into a text editor. You won’t see “fi”, instead you will get a garbage character.
http://adamdscott.com/typography/ligatures-on-the-web/
There other cases though, where conversion decides that a glyph needs to be mapped to the Private Unicode Area, to ensure that the generated html looks as close as possible to the input PDF page.
That being said, to include the full utf8 encoding of the text from a PDF the following should work.
- Run ToEpub in expanded mode (unzipped)
- Run text extractor on the PDF, and get the text as utf8 format. Use TextExtractor::GetAsText(), and then call UString::ConvertToUtf8()
- Clean up the utf8 characters to ensure no xml reserved characters are included. (e.g. ‘>’ must become ‘>’)
- Inject the xml compliant utf8 into the respective xhtml pages. I assume you could put this into it’s own div, and mark the div as not visible, so nothing gets rendered.
- Run “epubcheck -mode exp -save” on the epub folder. This will validate the epub files and then zip into a valid .epub file.
- You can get epubcheck here: http://code.google.com/p/epubcheck/