Extracting sentences from PDF

Ivanho · May 23, 2014, 7:13pm

Q:

I’m was wondering if there is a concept of a sentence. I can see lines but nothing about sentences.

If not is there an easy/recommended way to build up the sentences within a document, as this would be useful for on-line speech streaming

A:

A PDF document does not include a concept of a sentence. It does not even include a concept of a paragraph, a line, or a word. At the same time we offer a utility class (pdftron.PDF.TextExtractor, as shown in TextExtract sample - http://www.pdftron.com/pdfnet/samplecode.html#TextExtract) that can be used to reconstruct words, lines, and paragraphs based on several cues (e.g. spatial info, font, size, color, etc.).

TextExtractor does not recognize sentences, however if you are using ‘.’ as the only cue to separate sentences, it should be fairly easy to group words within a paragraph/flow. Btw. you can obtain Unicode/text data for each word/line using GetString () method:

http://www.pdftron.com/pdfnet/docs/PDFNetC/de/d3a/classpdftron_1_1_p_d_f_1_1_text_extractor_1_1_word.html
http://www.pdftron.com/pdfnet/docs/PDFNetC/dc/db5/classpdftron_1_1_p_d_f_1_1_text_extractor_1_1_line.html
http://www.pdftron.com/pdfnet/docs/PDFNetC/d3/d88/classpdftron_1_1_p_d_f_1_1_text_extractor.html

Ivanho · May 26, 2014, 5:41am

Just to add to the above post — after extracting text with TextExtractor, you may want to look into third-party tools for grouping natural language text into sentences. One of these is NLTK’s “punkt” module:
http://www.nltk.org/_modules/nltk/tokenize/punkt.html