Is that possible to pass input file for conversion as InputStream in Java

Reza_Asadollahi · January 15, 2016, 9:53pm

Convert.wordToPdf methods only accept input as filename. Is that possible to introduce a new method that accepts InputStream instead. PDFDoc.save method accepts OutputStream, so it makes sense to be able to read from InputStream specially when it comes from other sources like URL.

Thanks,

Ryan · April 5, 2019, 5:15pm

Yes. There is now a WordToPdf API that takes in a filter. For Java it is the following.
https://www.pdftron.com/pdfnet/docs/PDFNetJava/com/pdftron/pdf/Convert.html#wordToPdf%28com.pdftron.sdf.Doc,%20com.pdftron.filters.Filter,%20com.pdftron.pdf.WordToPDFOptions%29

Latest builds here
https://www.pdftron.com/pdfnet/downloads.html

Now, how do you use it with an InputStream? The following code will do this for you. Note that currently everything needs to be loaded in memory. This is because document formats like docx and pdf, require random access to bytes, so the entire stream needs to be loaded.

stream = new FileInputStream(file);
com.pdftron.filters.MemoryFilter memoryFilter = new com.pdftron.filters.MemoryFilter(stream.available(), false); // false = sink
com.pdftron.filters.FilterWriter writer = new com.pdftron.filters.FilterWriter(memoryFilter); // helper filter to allow us to write to buffer
int buf_sz = 1024 * 1024; // set intermediate buffer to 1MiB
byte[] buf = new byte[buf_sz];
int read;
int total_read = 0;
while ((read = stream.read(buf)) != -1) {
	if(read < buf_sz) {
		// last read will (certainly) contain less bytes, so write just those
		for(int i = 0; i < read; ++i) {
			writer.writeUChar(buf[i]);
		}
	} else {
		writer.writeBuffer(buf);
	}
	total_read += read;
}
writer.flush(); // Don't forget to flush!
memoryFilter.setAsInputFilter(); // switch from sink to source
Convert.officeToPdf(pdfdoc, memoryFilter, null);

Ryan · April 5, 2019, 6:39pm

Another customer asked for C# code, to convert Office file entirely in memory.

// For demo purpose use FileStream
FileStream fs = new FileStream(input_path + "simple-word_2007.docx", FileMode.Open);
pdftron.Filters.MemoryFilter memoryFilter = new pdftron.Filters.MemoryFilter((int)fs.Length, false); // false = sink
pdftron.Filters.FilterWriter writer = new pdftron.Filters.FilterWriter(memoryFilter); // helper filter to allow us to write to buffer
int bytes_read = 0;
byte[] buf = new byte[10 * 1024]; // 10 MiB buffer
do
{
    bytes_read = fs.Read(buf, 0, buf.Length);
    if(bytes_read < buf.Length)
    {
        for(int i = 0; i < bytes_read; i++)
        {
           writer.WriteUChar(buf[i]);
        }
    }
    else
    {
        writer.WriteBuffer(buf);
    }
} while (bytes_read > 0);
writer.Flush();
memoryFilter.SetAsInputFilter(); // switch from sink to source
PDFDoc pdfdoc = new PDFDoc();
pdftron.PDF.Convert.OfficeToPDF(pdfdoc, memoryFilter, null);
// For demo purpose write back to disk
pdfdoc.Save(output_path + "simple-word_2007.docx.pdf", SDFDoc.SaveOptions.e_linearized);
// But most likely you want to save in memory
byte[] pdfData = pdfdoc.Save(SDFDoc.SaveOptions.e_linearized);

alex · March 7, 2022, 4:25pm

And here’s the code I used for Python users - byte stream plus read / write on cloud storage via Google Cloud Platform ( GCP ) in case useful.

import json, io 
from google.cloud import storage
from PDFNetPython3 import *

storage_client = storage.Client()
bucket = storage_client.bucket( "template_bucket" )
office_template_file = bucket.blob( "path/to/read/docx" )

PDFNet.Initialize( current_app.config["PDFTRON_API_KEY"] )

parameters = {
    "first_name": "Bob",
    "last_name": "Banks",
}

output_pdf_object = PDFDoc()
options = OfficeToPDFOptions()
options.SetTemplateParamsJson(json.dumps(parameters))

office_template_blob = office_template_file.download_as_bytes()
mf = MemoryFilter(len(office_template_blob), False)
fw = FilterWriter(mf)
for byte in office_template_blob:
    fw.WriteUChar(byte)
fw.Flush()
mf.SetAsInputFilter()
Convert.OfficeToPDF(output_pdf_object, mf, options)
pdf_file_in_memory = output_pdf_object.Save(SDFDoc.e_linearized)


buffer = io.BytesIO(pdf_file_in_memory)
buffer.seek(0)
bucket = storage_client.bucket("pdf_bucket")
pdf_destination_file = bucket.blob( "path/to/write/pdf" )
pdf_destination_file.upload_from_file( buffer )

mbmahesha47 · October 10, 2023, 1:45pm

Hi Rayn,

Convert.officeToPdf(pdfdoc, filter, null); is utilizing high memory for DOCX to PDF conversion and memory release is not happening.

Any suggestion on how to handle memory optimization while converting MS office to PDF using PDFTron java-sdk 10.2