Problems extracting PDF text with Mongolia content

moka · December 31, 2016, 9:13am

Hello,

Currently we are working on a Win32 application written in C++ (we are evaluating the 6.6.1 sdk).

We have a document with Mongolia content(attachment is sample).
When using Adobe PDF Reader, the text copied unicode is: 1830 1822 182A 1830 1822 182D 180C
When using PDFViewSimpleTest, the text copied unicode is: 1830 1822 182A 1830 1822 182D
When using ElementReaderAdvTest, “for (CharIterator itr = element.GetCharIterator(); itr.HasNext(); itr.Next())” unicode is: 1830 1822 182A 1830 1822 182D

PDFNet SDK missing 1 characters “180C”.
How can we correct this?

sample.pdf (27.7 KB)

Ryan · January 4, 2017, 7:19pm

Adobe was the only PDF reader that returned U+180C. Every other reader I tried did not return this.

Note that U+180C is a MONGOLIAN FREE VARIATION SELECTOR TWO

Is getting this unicode selector important for you?

moka · January 5, 2017, 10:24am

This is important for us.Attachment is the different display for 180C.
I want to know how Adobe got MONGOLIAN FREE VARIATION SELECTOR TWO ? By font info? Or other mapping?

在 2017年1月5日星期四 UTC+8上午3:19:57，Ryan写道：

Adobe was the only PDF reader that returned U+180C. Every other reader I tried did not return this.

Note that U+180C is a MONGOLIAN FREE VARIATION SELECTOR TWO

Is getting this unicode selector important for you?

Ryan · January 12, 2017, 8:27pm

It looks like the extra entry is stored in what is called Marked Content, which is a sort of metadata in the content stream.

You can parse this marked content using this sample code.
https://www.pdftron.com/pdfnet/samplecode.html#LogicalStructure

Where you are looking for an entry called ActualText. The sample code above will dump everything out so you can see what I mean.

moka · January 18, 2017, 1:47pm

Thank you for your reply but for me this “LogicalStructure” sample code unfortunately does not work.

Can you give me some sample code with “ActualText”?

The following is the output of “LogicalStructure” with “sample.pdf” :

PDFNet is running in demo mode.

Permission: read

Sample 1 - Traverse logical structure tree…

This document does not contain any logical structure.

Done 1.

Sample 2 - Get parent logical structure elements from

layout elements.

TEXT:

TEXT: \U1830

TEXT: \U1822\U182A

TEXT: \U1830

TEXT: \U1822

TEXT: \U182D

TEXT:

PATH:

Done 2.

Sample 3 - ‘XML style’ extraction of PDF logical structure and page content.

Done 2.