Extracting Alt Tags using marked_content flags.

Zonker_Harris · August 25, 2014, 7:09am

Hey all,

I have been trying to get this example to work.

https://groups.google.com/forum/#!topic/pdfnet-sdk/HzuCcCgSThU

DictIterator itr = mc_prop.GetDictIterator();
while (itr.HasNext()) {
Obj key = itr.Key();
// Console.WriteLine("{0}", key.GetName()); // Key
Obj value = itr.Value();
// …
itr.Next()
}

The itr.Value() works but I can not for the life of me figure out how to extract the values it returns.

I have tried a bunch of different approaches along the lines of

Console.WriteLine("{0}", value.GetACCESSOR()); // value

and nothing seems to work.

Could you post a short chuck of example code to show me what I am missing.

Thanks in advance for you help

zonker harris
Vitalsource (Ingram)

agravesdale · August 25, 2014, 6:39pm

Hello Zonker,

The SDF.Obj API (http://www.pdftron.com/pdfnet/PDFNet/html/Methods_T_pdftron_SDF_Obj.htm) follows the composite pattern, so from an Obj you can call GetAsPDFText (http://www.pdftron.com/pdfnet/PDFNet/html/M_pdftron_SDF_Obj_GetAsPDFText.htm) to get a printable string for Name and String objects. You can also use Obj.IsNumber() / Obj.GetNumber() to obtain doubles from PDF numbers.

Ivanho · August 27, 2014, 10:30pm

For deeper coverage of SDF API see https://www.pdftron.com/pdfnet/intro.html
and SDF sample project: https://www.pdftron.com/pdfnet/samplecode.html#SDF

Zonker_Harris · August 29, 2014, 6:28pm

Hey Aaron.

I am still not getting them.

Here is the output around a Figure/Caption block

CURRENT TAG: Span

Traversing the marked content properties dictionary

Key: MCID

Text: 5.0

CURRENT TAG: Figure

Traversing the marked content properties dictionary

Key: BBox

Key: MCID

Text: 53.0

Key: Type

CURRENT TAG: Caption

Traversing the marked content properties dictionary

Key: MCID

Text: 44.0

And here is the code that generates that

itr = mcProp.GetDictIterator

puts “Traversing the marked content properties dictionary”

while itr.HasNext do

key = itr.Key

puts "Key: " + key.GetName.to_s

value = itr.Value

##this is a really dumb way to do it

##but if i can find the alt tag this way, can figure out a better way to

##extract them

begin

eval value.GetNumber.to_s

rescue StandardError => boom

else

puts "Text: " + value.GetNumber.to_s

end

begin

eval value.GetAsPDFText.to_s

rescue StandardError => boom

else

puts "Text: " + value.GetAsPDFText.to_s

end

begin

eval value.IsArray

rescue StandardError => boom

else

puts “Warn: Is array”

end

itr.Next

end

elsif element.GetType == Element::E_marked_content_end

puts “MC End”

end

puts “\n”

end

element = reader.Next

end

It is seeing everything except the Alt Tags.

I have checked the pdf source. The tags are there as prescribed by Adobe. So I have no idea what i am missing.

Thanks for you help in advance.

zonker

Vitalsource (Ingram)

Ivanho · September 8, 2014, 11:49pm

It looks like you are able to extract MCID (Marked Content Identifier), so the remaining question is how do you get the relevant ‘Structure Element’. This is shown in LogicalStructure sample project:

https://www.pdftron.com/pdfnet/samplecode/LogicalStructureTest.cs.html

https://www.pdftron.com/pdfnet/samplecode.html#LogicalStructure

For more info about marked content, see Section 14.6-7 Marked Convent & Logical Structure in PDF Reference:
http://xodo.com/view/#/c0c11968-ee14-478e-9b09-6dc5635c0915