How to read a PDF Portfolio using iTextSharp - c#

I'm using iTextSharp, in a C# app that reads PDF files and breaks out the pages as separate PDF documents. It works well, except in the case of portfolios. Now I'm trying to figure out how to read a PDF portfolio (or Collection, as they seem to be called in iText) that contains two embedded PDF documents. I want to simply open the portfolio, enumerate the embedded files and then save them as separate, simple PDF files.
There's a good example of how to programmatically create a PDF portfolio, here:
Kubrick Collection Example
But I haven't seen any examples that read portfolios. Any help would be much appreciated!

The example you referenced adds the embedded files as document-level attachments. So you can extract the files like this:
PdfReader reader = new PdfReader(readerPath);
PdfDictionary root = reader.Catalog;
PdfDictionary documentnames = root.GetAsDict(PdfName.NAMES);
PdfDictionary embeddedfiles =
documentnames.GetAsDict(PdfName.EMBEDDEDFILES);
PdfArray filespecs = embeddedfiles.GetAsArray(PdfName.NAMES);
for (int i = 0; i < filespecs.Size; ) {
filespecs.GetAsString(i++);
PdfDictionary filespec = filespecs.GetAsDict(i++);
PdfDictionary refs = filespec.GetAsDict(PdfName.EF);
foreach (PdfName key in refs.Keys) {
PRStream stream = (PRStream) PdfReader.GetPdfObject(
refs.GetAsIndirectObject(key)
);
using (FileStream fs = new FileStream(
filespec.GetAsString(key).ToString(), FileMode.OpenOrCreate
)){
byte[] attachment = PdfReader.GetStreamBytes(stream);
fs.Write(attachment, 0, attachment.Length);
}
}
}
Pass the output file from the Kubrick Collection Example you referenced to the PdfReader constructor (readerPath) if you want to test this.
Java version: part4.chapter16.KubrickDocumentary
C# version.
Hopefully I'll have time to update the C# examples this month from version 5.2.0.0 (the iTextSharp version is about three weeks behind the Java version right now).

Related

GemBox DocumentModel.Load() cannot read Pdf file

Currently i am unable to load original pdf document using GemBox. it gives me below error in image. and I am using Acrobat 9.
I have tried using 8/16/2018 fixes too. Any suggestion will be highly appreciated.
Basic Code i am using is,
using GemBox.Document;
using System;
namespace Pdf2Text
{
class Program
{
[STAThread]
static void Main(string[] args)
{
ComponentInfo.SetLicense("My-License");
DocumentModel document = null;
document = DocumentModel.Load(#"E:\data\testing\HA021.pdf");
document.Save(#"E:\data\testing\HA021.docx");
}
}
}
The current implementation of PDF reader in GemBox.Document is still in beta and cannot handle this PDF feature, an "iref streams" which are a cross-reference tables stored in streams.
However, GemBox.Pdf can handle cross-reference streams so as a workaround what you could do is something like the following:
// Load PDF with GemBox.Pdf.
var pdfDocument = PdfDocument.Load("Sample.pdf");
pdfDocument.SaveOptions.CrossReferenceType = PdfCrossReferenceType.Table;
// Save PDF with GemBox.Pdf.
var pdfStream = new MemoryStream();
pdfDocument.Save(pdfStream);
// Load PDF with GemBox.Document.
var document = DocumentModel.Load(pdfStream, LoadOptions.PdfDefault);
Last regarding the conversion of PDF to DOCX, GemBox.Document's PDF reader is currently intended for extracting text and tables from PDF files, it's not intended for any high fidelity requirement.

How to open an existing PDF file with Migradoc PDF library

I am trying to use the Migradoc library from PDFSharp (http://www.pdfsharp.net/) to print pdf files. So far I have found that Migradoc does support printing through its MigraDoc.Rendering.Printing.MigraDocPrintDocument class. However, I have not found a way to actually open an existing PDF file with MigraDoc.
I did find a way to open an existing PDF file using PDFSharp, but I cannot successfully convert a PDFSharp.Pdf.PdfDocument into a MigraDoc.DocumentObjectModel.Document object. So far I have not found the MigraDoc and PDFSharp documentation to be very helpful.
Does anyone have any experience using these libraries to work with existing PDF files?
I wrote the following code with help from this sample, but the result when my input PDF is 2 pages is an output PDF with 2 blank pages.
using MigraDoc.DocumentObjectModel;
using MigraDoc.Rendering;
using PdfSharp.Drawing;
using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;
...
public void PrintPDF(string filePath, string outFilePath)
{
var document = new Document();
var docRenderer = new DocumentRenderer(document);
docRenderer.PrepareDocument();
var inPdfDoc = PdfReader.Open(filePath, PdfDocumentOpenMode.Modify);
for (var i = 0; i < inPdfDoc.PageCount; i++)
{
document.AddSection();
docRenderer.PrepareDocument();
var page = inPdfDoc.Pages[i];
var gfx = XGraphics.FromPdfPage(page);
docRenderer.RenderPage(gfx, i+1);
}
var renderer = new PdfDocumentRenderer();
renderer.Document = document;
renderer.RenderDocument();
renderer.PdfDocument.Save(outFilePath);
}
Your code modifies the inPdfDoc in memory without saving the changes. Complicated code without any visual effect.
MigraDoc cannot open PDF files, MigraDoc cannot print PDF files, PDFsharp cannot print PDF files.
http://www.pdfsharp.net/wiki/PDFsharpFAQ.ashx

Word document orientation lost after using OpenXML SDK AddAlternativeFormatImportPart

I am attempting to merge several Word documents together into a single Word document. I am using the AltChunk capability from Microsoft's OpenXML SDK 2.5. The final report needs to be in landscape orientation, thus we have put each component document into landscape mode. I am merging the documents using the following code.
for (int i = 0; i < otherDocs.Length; i++)
{
using (var headerDoc = WordprocessingDocument.Open(headerPath, true))
{
var mainPart = headerDoc.MainDocumentPart;
string altChunkId = "AltChunkId" + i;
var chunk = mainPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.WordprocessingML, altChunkId);
using (var fileStream = File.Open(otherDocs[i], FileMode.Open))
{
chunk.FeedData(fileStream);
}
var altChunk = new AltChunk();
altChunk.Id = altChunkId;
mainPart.Document.Body.InsertAfter(altChunk, mainPart.Document.Body.Elements<Paragraph>().Last());
mainPart.Document.Save();
DocumentWriter.SetPrintOrientation(headerDoc, PageOrientationValues.Landscape);
headerDoc.Close();
}
}
When I run this code, the final output document has a mix of landscape and portrait orientation if the component documents each have at least one section break.
The DocumentWriter.SetPrintOrientation() method is implemented according to instructions from MSDN. It seems to have no effect on the actual orientation of the document. I have also examined the underlying XML files, and all "orient" attributes are set to landscape.
Is there some configuration option or API call I can use to ensure the final document will have landscape orientation across all sections?
An OpenXML Word document is a zipped collections of XML documents that define its content, formatting, and metadata. When merging Word documents together using AddAlternativeFormatImportPart, the Word documents being merged into the original document (AltChunks) are copied into the zip archive, and XML elements referencing the documents are added into the XML document definition. Then, the next time anyone opens the resulting document in Microsoft Word (or any other OpenXML compatible document editor), the application handles merging in the AltChunks. In this case, Microsoft Word 2010 (the version we are using) has a bug causing Word to ignore some formatting information defined in Word document sections. One of these pieces of information is orientation, making the AltChunk approach ineffective at preserving orientation information.
Instead, we used DocumentBuilder from the OpenXml Power Tools project. This resulted in much simpler code that solved our problem in two lines:
var sourceList = documentPaths.Select(doc => new Source(new WmlDocument(doc), true)).ToList();
DocumentBuilder.BuildDocument(sourceList, outputPath);

Merging two PDF pages on top of each other

I am looking for a way to merge the content of two pdf pages.
It could be a watermark, an image or whatever.
The scenario is as follows:
I have a Word-addin that allows the user to create different templates for different customers based on several template forms. For each new customer, the user can provide a new letter paper containing header image / logos and footer. This shall be applied anyhow at runtime. Could be an image that is loaded directly into the header of the template (then I would need to render pdf to image, for the letter paper will mostly be provided as pdf-file) or when exporting the document (merging letter paper as background).
But the template shall not be accessible by the user, so this must be done programmatically.
So far, I tried Pdfsharp library, which does not support neither the version of my provided backpapers, nor the version of my documents created in Word 2007.
iTextSharp seemed very promising, but I could not manage to merge the contents so far.
I also tried pdftk.exe, but even when i ran it manually from command line, I got the error: "Done. Input errors, so no output created."
It does not matter how it is handled, but the output matters.
I forgot to mention, there is a whiteline created in the Word-template for archiving purposes, so this part may not be added as image or it has to be added afterwords into the output document.
Thanks in advance!
StampStationery.cs, a sample from the Webified iTextSharp Examples which essentially are the C#/iTextSharp versions of the Java/iText samples from the book iText in Action — 2nd Edition, does show how to add the contents of a page from one PDF document as stationery behind the content of each page of another PDF.
The central method is this:
public byte[] ManipulatePdf(byte[] src, byte[] stationery)
{
// Create readers
PdfReader reader = new PdfReader(src);
PdfReader s_reader = new PdfReader(stationery);
using (MemoryStream ms = new MemoryStream())
{
// Create the stamper
using (PdfStamper stamper = new PdfStamper(reader, ms))
{
// Add the stationery to each page
PdfImportedPage page = stamper.GetImportedPage(s_reader, 1);
int n = reader.NumberOfPages;
PdfContentByte background;
for (int i = 1; i <= n; i++)
{
background = stamper.GetUnderContent(i);
background.AddTemplate(page, 0, 0);
}
}
return ms.ToArray();
}
}
This method returns the manipulated PDF as a byte[].

Strip Adobe Reader and Version requirements from PDF before outputting it to browser

I am planning on using pdf.js to have PDF context via the browser with Javascript. The problem is that some PDFs, the ones I am using, require Adobe's Reader with a specific Version. pdf.js does does not yet(ever?) support spoofing of these. What I need to know is if there's a way in C# to open the PDF and remove these Reader and Version requirements and how to do it. I was planning on using itextsharp to do other PDF manipulation server-side so an example using this would be most helpful. I plan on serving these as an ActionResult from an ajax request via MVC 4, so a MemoryStream would be most helpful at the end of this manipulation.
So in the end pdf.js was unable to do what I needed it too, however, what I was able to do was convert the Xfa/Pdf to a C# object then send the pages as needed via Json to my Javascript for rendering in the HTML5 Canvas. The code below takes an xfa-in-a-pdf file and turns it into a C# object with the help of itextsharp:
PdfReader.unethicalreading = true;
PdfReader reader = new PdfReader(new FileStream(Statics.PdfUploadLocation + PdfFileName, FileMode.Open, FileAccess.Read));
XfaForm xfaForm = new XfaForm(reader);
XDocument xDoc = XDocument.Parse(xfaForm.DomDocument.InnerXml);
string xfaNamespace = #"{http://www.xfa.org/schema/xfa-template/2.6/}";
List<XElement> formPages = xDoc.Descendants(xfaNamespace + "subform").Descendants(xfaNamespace + "subform").ToList();
TotalPages = formPages.Count();
var fieldIndex = 0;
RawPdfFields = new List<XfaField>();
for (int page = 0; page < formPages.Count(); page++)
{
RawPdfFields.AddRange(formPages[page].Descendants(xfaNamespace + "field")
.Select(x => new XfaField
{
Page = page,
Index = fieldIndex++,
Name = (string)x.Attribute("name"),
Height = GetUnitFromPossibleString((string)x.Attribute("h")),
Width = GetUnitFromPossibleString((string)x.Attribute("w")),
XPosition = GetUnitFromPossibleString((string)x.Attribute("x")),
YPosition = GetUnitFromPossibleString((string)x.Attribute("y")),
Reference = GetReference(x.Descendants(xfaNamespace + "traverse")),
AssistSpeak = GetAssistSpeak(x.Descendants(xfaNamespace + "speak"))
}).ToList());
}
Your PDF file n-400.pdf uses the Adobe XML Forms Architecture (XFA). This means you require a viewer that also supports XFA which pdf.js seemingly does not.
Such a PDF normally contains some standard PDF content which indicates that the PDF requires some viewer that supports XFA. In your case the content contains
If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document.
This actually indicates what a XFA enabled viewer does, it renders some pages based upon information in the XFA XML data and displays it instead of the PDF style page descriptions.
While being defined proprietarily by Adobe, the PDF specification ISO 32000-1 describes how XFA data is to be embedded in a PDF document, cf. section 12.7.8 XFA Forms.
If you only need those forms in a flattened state, you might want to have a look at iText Demo: Dynamic XFA forms in PDF.

Categories