Do you have tools for getting info about text blocks that we can see after converting to Powerpoint? For example, I need to get info about text paragraphs, bullet and numbered lists, tables, etc. without creating a Powerpoint file.
Converting to Powerpoint provides much better results than Converting to HTML or Text Extraction
Do you have SDK methods for converting PDF files to Powerpoint?
Yes, the new API that pdf.online is using will be released in upcoming PDFTron SDK release early next year. To get notified of the release you can subscribe here.
Do you have tools for getting info about text blocks that we can see after converting to Powerpoint?
I need to get info about text paragraphs, bullet and numbered lists, tables, etc. without creating a Powerpoint file.
It would be best if you could elaborate on why getting info about text paragraphs/bullets/etc. is important for you?
When you say “info” what do you mean exactly? X/Y position (relative to what)?
What do you do with this information once you have it? How does it help you or your users with this info?
Once I know your overall objective then I can assist you best.
Our goal is converting pdf to specific HTML. And we need a tool that can detect text paragraphs, tables, bullets and numbered lists, etc. You have tools for converting PDF to HTML or Text Extraction. But the result isn’t the same as what I can see in Powerpoint.
When I say “info” it means the ability to get, for example, JSON with all information about shapes from Powerpoint. We don’t need a ppt file but we need info which contains in ppt. For example, absolute positions of shapes, width and height, paragraphs info (margins, line height, list options, etc), text runs content and font properties (size, weight, font family, color and other).
An additional issue - getting original fonts or font families from. For example, during converting to HTML is used right fonts. I know that these fonts don’t contain all glyphs. Can we have the same fonts in Powerpoint? Or if we say about getting info, add to text runs original font families.
Our goal is converting pdf to specific HTML.
Yes, but why do want to do this?
Why not operate on the PDF itself?
How does having Powerpoint and/HTML help you exactly?
What do you do with the HTML/Powerpoint output?
If you prefer the Powerpoint output over the HTML output, then why not use that?
absolute positions of shapes, width and height, paragraphs info (margins, line height, list options, etc), text runs content and font properties (size, weight, font family, color and other).
Why is this info important for you?
What do you do with this info?
The better I understand your overall objective/requirements the best I can assist you.
We create a tool that allows users to convert pdf to HTML with the ability to edit converted HTML then and publish it as the website. For user convenience for editing, we want to provide HTML that will contain correct text paragraphs, lists, tables, etc. So we need all info about text positions, font properties, etc. Even if we can’t embed a font to a page automatically we want to provide for using correct font family and users can upload the needed fonts.
And when I say about Powerpoint I mean that your conversion to PowerPoint is much better than to HTML in the context of grouping non-related text from pdf to text blocks (paragraphs, tables, lists).
So I’m interested in the ability to get info about these text blocks without creating ppt file.