Does PDFNet SDK provide support for PDF V9 Portfolios?

Aaron_Gravesdale · March 5, 2009, 1:55am

Q: Does PDFNet SDK provide support for PDF V9 Portfolios?
-----
A: You can use PDFNet’s SDF/Cos API to work with PDF Portfolios (this
feature was also available in PDF 1.7 under the name "PDF package" but
was not properly implemented in Acrobat).

As an example, you could use the following lines to detect is a PDF
document contains a portable collection (a.k.a. PDF package, or PDF
portfolio)

Obj collection = doc.GetRoot().FindObj("Collection");
if (collection != NULL) {
// The PDF contains a collection.
}

When a PDF viewer first opens a PDF document containing a collection,
it should display the contents of the initial document, along with a
list of the documents present in the "EmbeddedFiles" name tree.

// C# Pseudocode to list parts or extract parts of a PDF package
(a.k.a. PDF Portfolio)

pdftron.SDF.NameTree files = NameTree.Find(doc.GetSDFDoc(),
"EmbeddedFiles");
if(files.IsValid()) { // Traverse the list of embedded files.
  for (DictIterator i = files.GetIterator(); i.HasNext(); i.Next()) {
    string entry_name = i.Key().GetAsPDFText();
    if (extract_emebdded_files) {
      FileSpec file_spec = new FileSpec(i.Value());
      Filter stm = file_spec.GetFileData();
      if (stm!=null) {
        FilterReader reader = new FilterReader(stm);
        reader.Read(...);
      }
    }
  }
}

// C++ Pseudocode
#include <SDF/NameTree.h>
...
pdftron::SDF::NameTree files = NameTree::Find(*doc.GetSDFDoc(),
"EmbeddedFiles");
if(files.IsValid()) { // Traverse the list of embedded files.
  for (DictIterator i = files.GetIterator(); i.HasNext(); i.Next()) {
    UString entry_name;
    i.Key().GetAsPDFText(entry_name);

    if (extract_emebdded_files) {
      FileSpec file_spec(i.Value());
      AutoPtr<Filter> stm = file_spec.GetFileData();
      if (stm.get()) {
        FilterReader reader(*stm);
        Reader.Read(...);
      }
    }
  }
}

Similarly you can use SDF/Cos API to edit or create new PDF package/
portfolio entries. To make some package operations a bit more
intuitive we will introduce a small high-level API in the near future.

Aaron_Gravesdale · March 6, 2009, 8:51pm

Q: I am developing using VB.NET. Can you point me in the direction of
an example to extract an embedded stream in PDF?

Here is what I have so far. From your example above

Dim file_spec As FileSpec = New FileSpec(itr.Value())
Dim stm As Filter = file_spec.GetFileData()

If Not (stm Is Nothing) Then
      Dim reader As FilterReader = New FilterReader(stm)
      Dim byt(stm.Size) As Byte
        reader.Read(byt)
        ' WRITE TO NEW DOC HERE
End if
------
A: The simplest approach to extract (save on disk) a file
specification (FileSpec) embedded in PDF is using file_spec.Export
(filename) method. For example:

Dim file_spec As FileSpec = New FileSpec(itr.Value())
file_spec.Export("c:\" + entry_name)

Using 'file_spec.GetFileData()' returns a data stream for the embedded
file, but this method is a bit more tricky to use (it is primarily
useful if you must extract the embedded data in a memory buffer).

Another approach to save embedded file is as follows:

Dim stm As Filter = file_spec.GetFileData()
If Not (stm Is Nothing) Then
  Dim reader As FilterReader = New FilterReader(stm)
  Dim outfile As StdFile = New StdFile("my.dat",
StdFile.OpenMode.e_write_mode)
  Dim writer As FilterWriter = New FilterWriter(outfile)
  writer.WriteFilter(reader)
  writer.Flush()
  outfile.Close()
End if

Aaron_Gravesdale · March 17, 2009, 5:51pm

Q: I just had a question about some unexpected (positive) results when
working with PDF portfolios using PDFNet SDK. It seems that the new
documents I am generating are already text searchable? I have spent
the last two weeks builing an OCR module. And now it seems that
perhaps using PDFNet some of the docs (portfolios, that Ive extracted
and merged) have no need to be OCR'd?

Does this make any sense to you? Here is the code I am using to
extract and merge documents in a PDF portfolio.

Private Function mergePortfolio(ByVal filename As String) As String
Dim appDir = Application.StartupPath

Dim pdf As PDFDoc = New PDFDoc(filename)

log("In mergePortfolio()")
'Log("stageDIR=" + stageDIR)

Dim files As NameTree = NameTree.Find(pdf, "EmbeddedFiles")
If (files.IsValid()) Then

' to put all attachments in
Dim tempMergeDoc As New PDFDoc()

            ' Traverse the list of embedded files.
            Dim itr As NameTreeIterator = files.GetIterator()
            While (itr.HasNext())

                ' get the file name
                Dim entry_name As String = itr.Key().GetAsPDFText()
                log("Breaking out entry: " + entry_name)

                ' extract the document
                Dim fileSpec As FileSpec = New FileSpec(itr.Value())
                Dim extractFilename As String = appDir + "\" +
entry_name.Replace("<0>", "")
                log("extractFilename=" + extractFilename)

fileSpec.Export(extractFilename)
Dim extractedPDF As PDFDoc = New PDFDoc
(extractFilename)

                'Iterate over pages and add to merge document
                Dim itrPages As PageIterator =
extractedPDF.GetPageIterator
                While itrPages.HasNext
                    tempMergeDoc.PagePushBack(itrPages.Current)
                    itrPages.Next()
                End While

extractedPDF.Close()

                System.IO.File.Delete(extractFilename) ' remove the
extracted documents, we dont need them
                itr.Next() ' next item in collection
            End While

Dim newFilename = filename.Substring(0,
filename.LastIndexOf(".") - 1) + "-MERGED.PDF"

' Log("newFilename=" + newFilename)
tempMergeDoc.Save(newFilename,
PDFTRON.SDF.SDFDoc.SaveOptions.e_incremental)

            pdf.Close()
            Return newFilename
        Else
            log("Invalid Portfolio PDF")
            Return ""
        End If ' file isValid
    End Function
----

A: At the moment PDFNet does not come with a built-in OCR module,
however it is possible that PDF files you are working with are already
searchable (e.g. many scanning applications these days perform
automatic OCR when exporting to PDF).