C# iTextsharp Replace Page of a multi-page PDF - c#

Say, I now have a 5-page PDF called 'a.pdf' which page 2 and 4 are empty. And another 2-page PDF called 'b.pdf'. Now what I want is to copy the the first page of 'b.pdf' to page2 of 'a.pdf' and second page of 'b.pdf' to page 4 of 'a.pdf'.
I found it's quite hard to find any examples, what I found is someone provided here,
http://itextsharp.10939.n7.nabble.com/Replace-Pages-with-ItextSharp-td2956.html
Called 'PdfStamper.ReplacePage()', I guess this is what I'm looking for, but I did a simple demo but didn't work out. Can someone have a check for me?
string _outMergeFile = Server.MapPath("~/11/a.pdf");
string file2 = Server.MapPath("~/11/b.pdf");
PdfReader readerA = new PdfReader(_outMergeFile);
PdfReader readerB = new PdfReader(file2);
PdfStamper cc = new PdfStamper(readerA,new MemoryStream());
cc.ReplacePage(readerB, 1, 2);
cc.ReplacePage(readerB, 2, 4);
cc.Close();
Thanks in advance.
=================================================================================
Thanks to Jose's suggestion. The code works now. I'm now providing a simple sample here for others to reference.
public void MyFunction()
{
string _outMergeFile = Server.MapPath("~/11/a.pdf");
string file2 = Server.MapPath("~/11/b.pdf");
PdfReader readerA = new PdfReader(_outMergeFile);
PdfReader readerB = new PdfReader(file2);
PdfStamper cc = new PdfStamper(readerA, new FileStream(Server.MapPath("~/11/result.pdf"), FileMode.Append));
cc.ReplacePage(readerB, 1, 2);
cc.Close();
}

OK, I think I've found your problem. cc is created in memory, and I don't see any code to save the actual changes to the file before you close it, so the alterations made to the in-memory file are lost. One option is to create it with a new FileStream () instead of a memory stream

Related

Unable to cast object of type 'iTextSharp.text.pdf.PdfArray' to type 'iTextSharp.text.pdf.PRStream'

I have to replace number "14-1" into "10-2". I am using following iText code but getting following type cast error. Can any one help me by modifying the program and remove the casting issue:
I have many PDF's where i have to replace the numbers at same location. I also need to understand it logically to how to do this:
using System;
using System.IO;
using System.Text;
using iTextSharp.text.io;
using iTextSharp.text.pdf;
using System.Windows.Forms;
namespace iText5
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
public const string src = #"D:\test1\A.pdf";
public const string dest = #"D:\test1\ENV1.pdf";
private void button1_Click(object sender, EventArgs e)
{
FileInfo file = new FileInfo(dest);
file.Directory.Create();
manipulatePdf(src, dest);
}
public void manipulatePdf(String src, String dest)
{
PdfReader reader = new PdfReader(src);
PdfDictionary dict = reader.GetPageN(1);
PdfObject obj = dict.GetDirectObject(PdfName.CONTENTS);
PRStream stream = (PRStream)obj;
byte[] data = PdfReader.GetStreamBytes(stream);
string xyz = Encoding.UTF8.GetString(data);
byte[] newBytes = Encoding.UTF8.GetBytes(xyz.Replace("14-1", "10-2"));
stream.SetData(newBytes);
PdfStamper stamper = new PdfStamper(reader, new FileStream(dest, FileMode.Create));
stamper.Close();
reader.Close();
}
}
}
This is a problem:
PdfDictionary dict = reader.GetPageN(1);
PdfObject obj = dict.GetDirectObject(PdfName.CONTENTS);
PRStream stream = (PRStream)obj;
First you get a page dictionary. That page dictionary has a /Contents entry. If you read the PDF standard (ISO 32000), then you see that the value of the /Contents entry can be either a stream, or an array. You assume that it's always a stream. In some cases, your code will work, but in cases where the value of the /Contents entry is an array of references to a series of streams, you will get a class cast error (for the obvious reason that an array of streams is not the same as a stream).
I think that you want to do something like this:
byte[] data = reader.GetPageContent(i);
string xyz = PdfEncodings.ConvertToString(data, PdfObject.TEXT_PDFDOCENCODING);
string abc = xyz.Replace("14-1", "10-2");
reader.SetPageContent(i, PdfEncodings.ConvertToBytes(abc, PdfObject.TEXT_PDFDOCENCODING));
However, that's a very bad idea, because of the reasons explained in the answers to these questions:
Replace the text in pdf document using itextSharp
PDF text replace not working
Is there any API in C# or .net to edit pdf documents?
Using ContentByteUtils for raw PDF manipulation
...
You are making the assumption that you will find a literal string with value "14-1" in the content. That might be true for simple PDF documents, but in many cases the appearance of "14-1" on a page (that you can read with your eyes) doesn't mean the string "14-1" is present as such in the content (that you extract with GetPageContent). That string could be part of an XObject, or the syntax to render "14-1" could be constructed in such a way that xyz.Replace("14-1", "10-2") won't change xyz in any way.
Bottom line: PDF is not a format for editing. A page in a PDF file consists of content that is added at absolute positions. The content on a page doesn't reflow if you change it (e.g. the existing content won't move to the next line or to the next page if you add extra content). Instead of editing a PDF document, you should edit the source that was used to create the document, and then create a new PDF from that source.
Important: you are using an old version of iText. We abandoned the name iTextSharp more than two years ago in favor of iText for .NET. The current version of iText is iText 7.1.2; see Nuget: https://www.nuget.org/packages/itext7/
Many people think that iText 5.5.13 is the latest version. That assumption is wrong. iText 5 has been discontinued and is no longer supported. The recent 5.5.x versions are maintenance releases for paying customers who can't migrate to iText 7 right away.

Editing a pdf document and saving it using ItextSharp writes an empty pdf document [duplicate]

I can extract text from pages in a PDF in many ways:
String pageText = PdfTextExtractor.GetTextFromPage(reader, i);
This can be used to get any text on a page.
Alternatively:
byte[] contentBytes = iTextSharp.text.pdf.parser.ContentByteUtils.GetContentBytesForPage(reader, i);
Possibilities are endless.
Now I want to remove/redact a certain word, e.g. explicit words, sensitive information (putting black boxes over them obviously is a bad idea :) or whatever from the PDF (which is simple and text only). I can find that word just fine using the approach above. I can count its occurrences etc...
I do not care about layout, or the fact that PDF is not really meant to be manipulated in this way.
I just wish to know if there is a mechanism that would allow me to manipulate the raw content of my PDF in this way. You could say I'm looking for "SetContentBytesForPage()" ...
If you want to change the content of a page, it isn't sufficient to change the content stream of a page. A page may contain references to Form XObjects that contain content that you want to remove.
A secondary problem consists of images. For instance: suppose that your document consists of a scanned document that has been OCR'ed. In that case, it isn't sufficient to remove the (vector) text, you'll also need to manipulate the (pixel) text in the image.
Assuming that your secondary problem doesn't exist, you'll need a double approach:
get the content from the page as text to detect in which pages there are names or words you want to remove.
recursively loop over all the content streams to find that text and to rewrite those content streams without that text.
From your question, I assume that you have already solved problem 1. Solving problem 2 isn't that trivial. In chapter 15 of my book, I have an example where extracting text returns "Hello World", but when you look inside the content stream, you see:
BT
/F1 12 Tf
88.66 367 Td
(ld) Tj
-22 0 Td
(Wor) Tj
-15.33 0 Td
(llo) Tj
-15.33 0 Td
(He) Tj
ET
Before you can remove "Hello World" from this stream snippet, you'll need some heuristics so that your program recognizes the text in this syntax.
Once you've found the text, you need to rewrite the stream. For inspiration, you can take a look at the OCG remover functionality in the itext-xtra package.
Long story short: if your PDFs are relatively simple, that is: the text can be easily detected in the different content stream (page content and Form XObject content), then it's simply a matter of rewriting those streams after some string manipulations.
I've made you a simple example named ReplaceStream that replaces "Hello World" with "HELLO WORLD" in a PDF.
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfDictionary dict = reader.getPageN(1);
PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
if (object instanceof PRStream) {
PRStream stream = (PRStream)object;
byte[] data = PdfReader.getStreamBytes(stream);
stream.setData(new String(data).replace("Hello World", "HELLO WORLD").getBytes());
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
stamper.close();
reader.close();
}
Some caveats:
I check if object is a stream. It could also be an array of streams. In that case, you need to loop over that array.
I don't check if there are form XObjects defined for the page.
I assume that Hello World can be easily detected in the PDF Syntax.
...
In real life, PDFs are never that simple and the complexity of your project will increase dramatically with every special feature that is used in your documents.
The C# equivalent of the code by Bruno:
static void manipulatePdf(String src, String dest)
{
PdfReader reader = new PdfReader(src);
PdfDictionary dict = reader.GetPageN(1);
PdfObject pdfObject = dict.GetDirectObject(PdfName.CONTENTS);
if (pdfObject.IsStream()) {
PRStream stream = (PRStream)pdfObject;
byte[] data = PdfReader.GetStreamBytes(stream);
stream.SetData(System.Text.Encoding.ASCII.GetBytes(System.Text.Encoding.ASCII.GetString(data).Replace("Hello World", "HELLO WORLD")));
}
FileStream outStream = new FileStream(dest, FileMode.Create);
PdfStamper stamper = new PdfStamper(reader, outStream);
reader.Close();
}
I'll update this if it would turn out to still contain errors.
In follow-up to my previous C# code and the remark by Bruno that GetDirectObject(PdfName.CONTENTS) might as well return an array as opposed to a stream: In my particular case, this turned out to be true.
The PdfObject returned returned "true" for IsArray(). I checked and the array elements were all PdfIndirectReference.
A further look at the API yielded two useful bits of info:
PdfIndirectReference had a "Number" property, leading you to another PdfObject.
You can get to the referenced object using reader.GetPdfObject(int ref), where ref is the "Number" property of the IndirectReferenceObject
From there on out, you get a new PdfObject that you can check using IsStream() and modify as per the previously posted code.
So it works out to this (mind you, this is quick and dirty, but it works for my particular purposes...):
// Get the contents of my page...
PdfObject pdfObject = pageDict.GetDirectObject(PdfName.CONTENTS);
// Check that this is, in fact, an array or something else...
if (pdfObject.IsArray())
{
PdfArray streamArray = pageDict.GetAsArray(PdfName.CONTENTS);
for (int j = 0; j < streamArray.Size; j++)
{
PdfIndirectReference arrayEl = (PdfIndirectReference)streamArray[j];
PdfObject refdObj = reader.GetPdfObject(arrayEl.Number);
if (refdObj.IsStream())
{
PRStream stream = (PRStream)refdObj;
byte[] data = PdfReader.GetStreamBytes(stream);
stream.SetData(System.Text.Encoding.ASCII.GetBytes(System.Text.Encoding.ASCII.GetString(data).Replace(targetedText, newText)));
}
}
}

ITextSharp/Pdftk: place Base64 Image from Web on PDF as Pseude-Signature

I am trying to conceptualize a way to get base64 image onto an already rendered PDF in iText. The goal is to have the PDF save to disk then reopen to apply the "signature" in the right spot.
I haven't had any success with finding other examples online so I'm asking Stack.
My app uses .net c#.
Any advice on how to get started?
As #mkl mentioned the question is a confusing, especially the title - usually base64 and signature do not go together. Guessing you want to place a base64 image from web on the PDF as a pseudo signature?!?!
A quick working example to get you started:
static void Main(string[] args)
{
string currentDir = AppDomain.CurrentDomain.BaseDirectory;
// 'INPUT' => already rendered pdf in iText
PdfReader reader = new PdfReader(INPUT);
string outputFile = Path.Combine(currentDir, OUTPUT);
using (var stream = new FileStream(outputFile, FileMode.Create))
{
using (PdfStamper stamper = new PdfStamper(reader, stream))
{
AcroFields form = stamper.AcroFields;
var fldPosition = form.GetFieldPositions("lname")[0];
Rectangle rectangle = fldPosition.position;
string base64Image = "";
Regex regex = new Regex(#"^data:image/(?<mediaType>[^;]+);base64,(?<data>.*)");
Match match = regex.Match(base64Image);
Image image = Image.GetInstance(
Convert.FromBase64String(match.Groups["data"].Value)
);
// best fit if image bigger than form field
if (image.Height > rectangle.Height || image.Width > rectangle.Width)
{
image.ScaleAbsolute(rectangle);
}
// form field top left - change parameters as needed to set different position
image.SetAbsolutePosition(rectangle.Left + 2, rectangle.Top - 2);
stamper.GetOverContent(fldPosition.page).AddImage(image);
}
}
}
If you're not working with a PDF form template, (AcroFields in code snippet) explicitly set the absolute position and scale the image as needed.

Remove outer print marks on PDF iTextSharp

I have a pdf file with a cover that looks like the following:
Now, I need to remove the so-called 'galley marks' around the edges of the cover. I am using iTextSharp with C# and I need code using iTextSharp to create a new document with only the intended cover or use PdfStamper to remove that. Or any other solution using iTextSharp that would deliver the results.
I have been unable to find any good code samples in my search to this point.
Do you have to actually remove them or can you just crop them out? If you can just crop them out then the code below will work. If you have to actually remove them from the file then to the best of my knowledge there isn't a simple way to do that. Those objects aren't explicitly marked as meta-objects to the best of my knowledge. The only way I can think of to remove them would be to inspect everything and see if it fits into the document's active area.
Below is sample code that reads each page in the input file and finds the various boxes that might exist, trim, art and bleed. (See this page.)
As long as it finds at least one it sets the page's crop box to the first item in the list. In your case you might actually have to perform some logic to find the "smallest" of all of those items or you might be able to just know that "art" will always work for you. See the code for additional comments. This targets iTextSharp 5.4.0.0.
//Sample input file
var inputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Binder1.pdf");
//Sample output file
var outputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Cropped.pdf");
//Bind a reader to our input file
using (var r = new PdfReader(inputFile)) {
//Get the number of pages
var pageCount = r.NumberOfPages;
//See this for a list: http://api.itextpdf.com/itext/com/itextpdf/text/pdf/PdfReader.html#getBoxSize(int, java.lang.String)
var boxNames = new string[] { "trim", "art", "bleed" };
//We'll create a list of all possible boxes to pick from later
List<iTextSharp.text.Rectangle> boxes;
//Loop through each page
for (var i = 1; i <= pageCount; i++) {
//Initialize our list for this page
boxes = new List<iTextSharp.text.Rectangle>();
//Loop through the list of known boxes
for (var j = 0; j < boxNames.Length; j++) {
//If the box exists
if(r.GetBoxSize(i, boxNames[j]) != null){
//Add it to our collection
boxes.Add(r.GetBoxSize(i, boxNames[j]));
}
}
//If we found at least one box
if (boxes.Count > 0) {
//Get the page's entire dictionary
var dict = r.GetPageN(i);
//At this point we might want to apply some logic to find the "inner most" box if our trim/bleed/art aren't all the same
//I'm just hard-coding the first item in the list for demonstration purposes
//Set the page's crop box to the specified box
dict.Put(PdfName.CROPBOX, new PdfRectangle(boxes[0]));
}
}
//Create our output file
using (var fs = new FileStream(outputFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
//Bind a stamper to our reader and output file
using(var stamper = new PdfStamper(r,fs)){
//We did all of our PDF manipulation above so we don't actually have to do anything here
}
}
}

ItextSharp Expected a Dict Object When trying to print

I have a web page that allows a user to view a pdf and print pdf. The print pdf is a copy of the display pdf and i am using ItextSharp to inject the javascript to allow auto printing. I have a method that allows a user to upload a pdf and it calls this method below to copy the display copy into a pdf. Both pdf's are then saved in the database. However , when a user goes to click on the print button on my web page they receive the following error "expected a dict object". below is my code that adds in the auto print, which works fine for me but not on my clients site.
I am doing anything wrong that could be corrupting the file. The original pdf content is passed in as a Binary Object.
Any help on this is much appreciated as i am highly confused on this one. Also i am using ASP.NET MVC2.
MemoryStream originalPdf = new MemoryStream(Content.BinaryData);
MemoryStream updatedPdf = new MemoryStream();
updatedPdf.Write(Content.BinaryData,0, Content.BinaryData.Length);
PdfReader pdfReader = new PdfReader(originalPdf);
PdfStamper pdfStamper = new PdfStamper(pdfReader, updatedPdf);
if (autoPrinting)
{
pdfStamper.JavaScript = "this.print(true);\r";
}
else
{
pdfStamper.JavaScript = null;
}
pdfStamper.Close();
pdfReader.Close();
Content.BinaryData = updatedPdf.ToArray();
Don't write the original PDF to your output. pdfStamper.close() will do all the writing for you, even in append mode (which you're not using).
Your code should read:
MemoryStream originalPdf = new MemoryStream(Content.BinaryData);
MemoryStream updatedPdf = new MemoryStream();
// Don't do that.
//updatedPdf.Write(Content.BinaryData,0, Content.BinaryData.Length);
PdfReader pdfReader = new PdfReader(originalPdf);
PdfStamper pdfStamper = new PdfStamper(pdfReader, updatedPdf);
if (autoPrinting) {
pdfStamper.JavaScript = "this.print(true);\r";
} else {
pdfStamper.JavaScript = null;
}
pdfStamper.Close(); // this does it for you.
pdfReader.Close();
Content.BinaryData = updatedPdf.ToArray();
I'm surprised that this "works for you". If nothing else, I'd expect the JS to fail because the byte offsets would be all wrong... in fact, all your offsets would be all wrong. I think my ignorance of C# is showing.
But Write() behaves the way I thought it would, so I'm back to being surprised.

Categories