How do I deal with right to left languages (RTL; e.g. arabic) when extracting text from PDF using PDFNet?

Aaron_Gravesdale · March 18, 2011, 9:40pm

Q: I am using PDFNet SDK for text extraction. Everthing works great,
but I am not sure how to deal with languages with right to left text
order (in particular, I'm trying arabic). The characters in the output
are reversed, so the ones that appear leftmost in the document come
first. Are there any plans to support this feature in future, or any
known work arounds?
---------------------

A: We will probably a more direct support for RTL languages in future
(e.g. a flag that marks a word or line as RTL), but in the meantime a
simple workaround would be to reverse the order of characters for
words/lines that contain the characters in the Arabic range: For
example you can use the following snippet to identify words/lines with
RTL text:

// Check for RTL lines/words: that start with a character from an RTL
script.
// RTL scripts are: Arabic, Hebrew, Syriac, Thaana
// Unicode ranges reference: http://www.ssec.wisc.edu/~tomw/java/unicode.html
// string p_tag = "<p>";
// string rtl_p_tag = "<p style='direction:rtl; text-align: right'>";
if(char >= 1424 && char <= 1983) {
// p_tag = rtl_p_tag;
}