Remove outer print marks on PDF iTextSharp - c#

I have a pdf file with a cover that looks like the following:
Now, I need to remove the so-called 'galley marks' around the edges of the cover. I am using iTextSharp with C# and I need code using iTextSharp to create a new document with only the intended cover or use PdfStamper to remove that. Or any other solution using iTextSharp that would deliver the results.
I have been unable to find any good code samples in my search to this point.

Do you have to actually remove them or can you just crop them out? If you can just crop them out then the code below will work. If you have to actually remove them from the file then to the best of my knowledge there isn't a simple way to do that. Those objects aren't explicitly marked as meta-objects to the best of my knowledge. The only way I can think of to remove them would be to inspect everything and see if it fits into the document's active area.
Below is sample code that reads each page in the input file and finds the various boxes that might exist, trim, art and bleed. (See this page.)
As long as it finds at least one it sets the page's crop box to the first item in the list. In your case you might actually have to perform some logic to find the "smallest" of all of those items or you might be able to just know that "art" will always work for you. See the code for additional comments. This targets iTextSharp 5.4.0.0.
//Sample input file
var inputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Binder1.pdf");
//Sample output file
var outputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Cropped.pdf");
//Bind a reader to our input file
using (var r = new PdfReader(inputFile)) {
//Get the number of pages
var pageCount = r.NumberOfPages;
//See this for a list: http://api.itextpdf.com/itext/com/itextpdf/text/pdf/PdfReader.html#getBoxSize(int, java.lang.String)
var boxNames = new string[] { "trim", "art", "bleed" };
//We'll create a list of all possible boxes to pick from later
List<iTextSharp.text.Rectangle> boxes;
//Loop through each page
for (var i = 1; i <= pageCount; i++) {
//Initialize our list for this page
boxes = new List<iTextSharp.text.Rectangle>();
//Loop through the list of known boxes
for (var j = 0; j < boxNames.Length; j++) {
//If the box exists
if(r.GetBoxSize(i, boxNames[j]) != null){
//Add it to our collection
boxes.Add(r.GetBoxSize(i, boxNames[j]));
}
}
//If we found at least one box
if (boxes.Count > 0) {
//Get the page's entire dictionary
var dict = r.GetPageN(i);
//At this point we might want to apply some logic to find the "inner most" box if our trim/bleed/art aren't all the same
//I'm just hard-coding the first item in the list for demonstration purposes
//Set the page's crop box to the specified box
dict.Put(PdfName.CROPBOX, new PdfRectangle(boxes[0]));
}
}
//Create our output file
using (var fs = new FileStream(outputFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
//Bind a stamper to our reader and output file
using(var stamper = new PdfStamper(r,fs)){
//We did all of our PDF manipulation above so we don't actually have to do anything here
}
}
}

Related

parse PDF with iTextSharp and then extract specific text to the screen

So I am trying to extract from the PDF file certain content. So it is an invoice, I want to be able to search the PDF file for the word "Invoice Number:" and then "First Name" and extract them in the
Console.WriteLine();
So at the moment this is what I got and I need to figure out how to move further.
using iTextSharp.text.pdf;
using System.IO;
using iTextSharp.text.pdf.parser;
using System;
namespace PdfProperties
{
class Program
{
static void Main(string[] args)
{
PdfReader reader = new PdfReader("C:/PDF/invoiceDetail.pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
FileStream fs = new FileStream("C:/PDF/result0.txt", FileMode.Create);
StreamWriter sw = new StreamWriter(fs);
SimpleTextExtractionStrategy strategy;
string text = "";
for (int i = 1; i <= reader.NumberOfPages; i++)
{
strategy = parser.ProcessContent(i, new SimpleTextExtractionStrategy());
sw.WriteLine(strategy.GetResultantText());
text = strategy.GetResultantText();
String[] splitText = text.Split(new char[] {'.' });
Console.WriteLine("Test");
Console.WriteLine(text);
}
sw.Flush();
sw.Close();
}
}
}
Any help would be greatly appreciated
Hy
you could try this:
String[] splitText = text.Split(".");
for(int i =0; i<splitText.Lenght;i++)
{
if(splitText[i].toString() =="Invoice Number:")
(
// we have Invoice Number
// now we search for First Name
if(splitText[i].toString() == "First Name")
(
// now we have also First Name
)
)
}
There are 2 ways of going about this:
You can try to process the invoice yourself. That means handling structure, and dealing with edge-cases. What if the content isn't always aligned in the same way? What if the template of the invoice changes? What if some text in the invoice is variable and you can't really rely on the precise text being extracted? ..
This is, in short, not a trivial problem to solve.
Use pdf2Data. It was specifically designed to handle documents that are rich in structure. Like invoices. It uses a concept called "selectors" that allow you to define where you expect certain content to be. Either by position (somewhere in the rectangle defined by coordinates ..) or by structural blocks (row .. from this table) etc.
Even though the add-on is closed source, you can always try it out by using a trial-license. After evaluating pdf2Data, you can at least make a more informed decision about which route you're willing to take to tackle this problem.
Check out itextpdf.com/itext7/pdf2Data for more information

Copy a visio page to a new document

What I want to accomplish:
I want to copy the active page in my Visio application to a new document and save it (and make it a byte[] for the db), I am already doing this but in a slightly "wrong" way as there is too much interaction with the Visio application.
Method to copy page to byte array:
private static byte[] VisioPageToBytes()
{
//Make a new invisible app to dump the shapes in
var app = new InvisibleApp();
Page page = MainForm.IVisioApplication.ActivePage;
app.AlertResponse = 2;
//Selact all shapes and copy, then deselect
MainForm.IVisioApplication.ActiveWindow.SelectAll();
MainForm.IVisioApplication.ActiveWindow.Selection.Copy();
MainForm.IVisioApplication.ActiveWindow.DeselectAll();
//Add empty document to invisible app and dump shapes
app.Documents.Add( string.Empty );
app.ActivePage.Paste();
//Save document and convert to byte[]
app.ActiveDocument.SaveAs( Application.UserAppDataPath + #"/LastStored.vsd" );
app.ActiveDocument.Close();
app.Quit();
app.AlertResponse = 0;
var bytes = File.ReadAllBytes( Application.UserAppDataPath + #"/LastStored.vsd" );
Clipboard.Clear();
return bytes;
}
Why it's wrong:
This code makes selections in the visio page and has to open an invisible window to store the page. I'm looking for a way with less interaction with the Visio application (as its unstable). The opening of the 2nd (invisible) Visio application occasionally makes my main Visio application crash.
I would like to do something like:
Page page = MainForm.IVisioApplication.ActivePage;
Document doc;
doc.Pages.Add( page ); //Pages.Add has no parameters so this doesn't work
doc.SaveAs(Application.UserAppDataPath + #"/LastStored.vsd");
If this is not possible in a way with less interaction (by "building" the document), please comment to let me know.
TL;DR;
I wan't to make a new Visio document without opening Visio and copy (the content of) 1 page to it.
If you want to create a copy page then you might find the Duplicate method on Page handy, but by the sounds of it just save the existing doc should work:
void Main()
{
var vApp = MyExtensions.GetRunningVisio();
var sourcePage = vApp.ActivePage;
var sourcePageNameU = sourcePage.NameU;
var vDoc = sourcePage.Document;
vDoc.Save(); //to retain original
var origFileName = vDoc.FullName;
var newFileName = Path.Combine(vDoc.Path, $"LastStored{Path.GetExtension(origFileName)}");
vDoc.SaveAs(newFileName);
//Remove all other pages
for (short i = vDoc.Pages.Count; i > 0; i--)
{
if (vDoc.Pages[i].NameU != sourcePageNameU)
{
vDoc.Pages[i].Delete(0);
}
}
//Save single page state
vDoc.Save();
//Close copy and reopen original
vDoc.Close();
vDoc = vApp.Documents.Open(origFileName);
}
GetRunningVisio is my extension method for using with LinqPad:
http://visualsignals.typepad.co.uk/vislog/2015/12/getting-started-with-c-in-linqpad-with-visio.html
...but you've already got a reference to your app so you can use that instead.
Update based on comments:
Ok, so how about this modification of your original code? Note that I'm creating a new Selection object from the page but not changing the Window one, so this shouldn't interfere with what the user sees or change the source doc at all.
void Main()
{
var vApp = MyExtensions.GetRunningVisio();
var sourcePage = vApp.ActivePage;
var sourceDoc = sourcePage.Document;
var vSel = sourcePage.CreateSelection(Visio.VisSelectionTypes.visSelTypeAll);
vSel.Copy(Visio.VisCutCopyPasteCodes.visCopyPasteNoTranslate);
var copyDoc = vApp.Documents.AddEx(string.Empty,
Visio.VisMeasurementSystem.visMSDefault,
(int)Visio.VisOpenSaveArgs.visAddHidden);
copyDoc.Pages[1].Paste(Visio.VisCutCopyPasteCodes.visCopyPasteNoTranslate);
var origFileName = sourceDoc.FullName;
var newFileName = Path.Combine(sourceDoc.Path, $"LastStored{Path.GetExtension(origFileName)}");
copyDoc.SaveAs(newFileName);
copyDoc.Close();
}
Note that this will only create a default page so you might want to include copying over page cells such as PageWidth, PageHeight, PageScale and DrawingScale etc. prior to pasting.

Editing a pdf document and saving it using ItextSharp writes an empty pdf document [duplicate]

I can extract text from pages in a PDF in many ways:
String pageText = PdfTextExtractor.GetTextFromPage(reader, i);
This can be used to get any text on a page.
Alternatively:
byte[] contentBytes = iTextSharp.text.pdf.parser.ContentByteUtils.GetContentBytesForPage(reader, i);
Possibilities are endless.
Now I want to remove/redact a certain word, e.g. explicit words, sensitive information (putting black boxes over them obviously is a bad idea :) or whatever from the PDF (which is simple and text only). I can find that word just fine using the approach above. I can count its occurrences etc...
I do not care about layout, or the fact that PDF is not really meant to be manipulated in this way.
I just wish to know if there is a mechanism that would allow me to manipulate the raw content of my PDF in this way. You could say I'm looking for "SetContentBytesForPage()" ...
If you want to change the content of a page, it isn't sufficient to change the content stream of a page. A page may contain references to Form XObjects that contain content that you want to remove.
A secondary problem consists of images. For instance: suppose that your document consists of a scanned document that has been OCR'ed. In that case, it isn't sufficient to remove the (vector) text, you'll also need to manipulate the (pixel) text in the image.
Assuming that your secondary problem doesn't exist, you'll need a double approach:
get the content from the page as text to detect in which pages there are names or words you want to remove.
recursively loop over all the content streams to find that text and to rewrite those content streams without that text.
From your question, I assume that you have already solved problem 1. Solving problem 2 isn't that trivial. In chapter 15 of my book, I have an example where extracting text returns "Hello World", but when you look inside the content stream, you see:
BT
/F1 12 Tf
88.66 367 Td
(ld) Tj
-22 0 Td
(Wor) Tj
-15.33 0 Td
(llo) Tj
-15.33 0 Td
(He) Tj
ET
Before you can remove "Hello World" from this stream snippet, you'll need some heuristics so that your program recognizes the text in this syntax.
Once you've found the text, you need to rewrite the stream. For inspiration, you can take a look at the OCG remover functionality in the itext-xtra package.
Long story short: if your PDFs are relatively simple, that is: the text can be easily detected in the different content stream (page content and Form XObject content), then it's simply a matter of rewriting those streams after some string manipulations.
I've made you a simple example named ReplaceStream that replaces "Hello World" with "HELLO WORLD" in a PDF.
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfDictionary dict = reader.getPageN(1);
PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
if (object instanceof PRStream) {
PRStream stream = (PRStream)object;
byte[] data = PdfReader.getStreamBytes(stream);
stream.setData(new String(data).replace("Hello World", "HELLO WORLD").getBytes());
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
stamper.close();
reader.close();
}
Some caveats:
I check if object is a stream. It could also be an array of streams. In that case, you need to loop over that array.
I don't check if there are form XObjects defined for the page.
I assume that Hello World can be easily detected in the PDF Syntax.
...
In real life, PDFs are never that simple and the complexity of your project will increase dramatically with every special feature that is used in your documents.
The C# equivalent of the code by Bruno:
static void manipulatePdf(String src, String dest)
{
PdfReader reader = new PdfReader(src);
PdfDictionary dict = reader.GetPageN(1);
PdfObject pdfObject = dict.GetDirectObject(PdfName.CONTENTS);
if (pdfObject.IsStream()) {
PRStream stream = (PRStream)pdfObject;
byte[] data = PdfReader.GetStreamBytes(stream);
stream.SetData(System.Text.Encoding.ASCII.GetBytes(System.Text.Encoding.ASCII.GetString(data).Replace("Hello World", "HELLO WORLD")));
}
FileStream outStream = new FileStream(dest, FileMode.Create);
PdfStamper stamper = new PdfStamper(reader, outStream);
reader.Close();
}
I'll update this if it would turn out to still contain errors.
In follow-up to my previous C# code and the remark by Bruno that GetDirectObject(PdfName.CONTENTS) might as well return an array as opposed to a stream: In my particular case, this turned out to be true.
The PdfObject returned returned "true" for IsArray(). I checked and the array elements were all PdfIndirectReference.
A further look at the API yielded two useful bits of info:
PdfIndirectReference had a "Number" property, leading you to another PdfObject.
You can get to the referenced object using reader.GetPdfObject(int ref), where ref is the "Number" property of the IndirectReferenceObject
From there on out, you get a new PdfObject that you can check using IsStream() and modify as per the previously posted code.
So it works out to this (mind you, this is quick and dirty, but it works for my particular purposes...):
// Get the contents of my page...
PdfObject pdfObject = pageDict.GetDirectObject(PdfName.CONTENTS);
// Check that this is, in fact, an array or something else...
if (pdfObject.IsArray())
{
PdfArray streamArray = pageDict.GetAsArray(PdfName.CONTENTS);
for (int j = 0; j < streamArray.Size; j++)
{
PdfIndirectReference arrayEl = (PdfIndirectReference)streamArray[j];
PdfObject refdObj = reader.GetPdfObject(arrayEl.Number);
if (refdObj.IsStream())
{
PRStream stream = (PRStream)refdObj;
byte[] data = PdfReader.GetStreamBytes(stream);
stream.SetData(System.Text.Encoding.ASCII.GetBytes(System.Text.Encoding.ASCII.GetString(data).Replace(targetedText, newText)));
}
}
}

c# Novacode.Picture to System.Drawing.Image

I'm reading in a .docx file using the Novacode API, and am unable to create or display any images within the file to a WinForm app due to not being able to convert from a Novacode Picture (pic) or Image to a system image. I've noticed that there's very little info inside the pic itself, with no way to get any pixel data that I can see. So I have been unable to utilize any of the usual conversion ideas.
I've also looked up how Word saves images inside the files as well as Novacode source for any hints and I've come up with nothing.
My question then is is there a way to convert a Novacode Picture to a system one, or should I use something different to gather the image data like OpenXML? If so, would Novacode and OpenXML conflict in any way?
There's also this answer that might be another place to start.
Any help is much appreciated.
Okay. This is what I ended up doing. Thanks to gattsbr for the advice. This only works if you can grab all the images in order, and have descending names for all the images.
using System.IO.Compression; // Had to add an assembly for this
using Novacode;
// Have to specify to remove ambiguous error from Novacode
Dictionary<string, System.Drawing.Image> images = new Dictionary<string, System.Drawing.Image>();
void LoadTree()
{
// In case of previous exception
if(File.Exists("Images.zip")) { File.Delete("Images.zip"); }
// Allow the file to be open while parsing
using(FileStream stream = File.Open("Images.docx", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using(DocX doc = DocX.Load(stream))
{
// Work rest of document
// Still parse here to get the names of the images
// Might have to drag and drop images into the file, rather than insert through Word
foreach(Picture pic in doc.Pictures)
{
string name = pic.Description;
if(null == name) { continue; }
name = name.Substring(name.LastIndexOf("\\") + 1);
name = name.Substring(0, name.Length - 4);
images[name] = null;
}
// Save while still open
doc.SaveAs("Images.zip");
}
}
// Use temp zip directory to extract images
using(ZipArchive zip = ZipFile.OpenRead("Images.zip"))
{
// Gather all image names, in order
// They're retrieved from the bottom up, so reverse
string[] keys = images.Keys.OrderByDescending(o => o).Reverse().ToArray();
for(int i = 1; ; i++)
{
// Also had to add an assembly for ZipArchiveEntry
ZipArchiveEntry entry = zip.GetEntry(String.Format("word/media/image{0}.png", i));
if(null == entry) { break; }
Stream stream = entry.Open();
images[keys[i - 1]] = new Bitmap(stream);
}
}
// Remove temp directory
File.Delete("Images.zip");
}

iTextSharp is giving me the error: "PDF header signature not found"

Everything I have read about this error says the file must be missing a "%PDF-1.4" or something similar at the top; however, my file includes it. I am not an expert in PDF formatting, but I did double check that I don't have multiple %%EOF or trailer tags, so now I'm at a loss as to what is causing my PDF header signature to be bad. Here is a link to the file if you would like to look at it: Poorly formatted PDF
Here is what I'm doing. I am getting each page of the PDF in the form of a MemoryStream, so I have to append each page to the end of the previous pages. In order to do this, I am using iTextSharp's PdfCopy class. Here is the code I am using:
/// <summary>
/// Takes two PDF streams and appends the second onto the first.
/// </summary>
/// <param name="firstPdf">The PDF to which the other document will be appended.</param>
/// <param name="secondPdf">The PDF to append.</param>
/// <returns>A new stream with the second PDF appended to the first.</returns>
public Stream ConcatenatePdfs(Stream firstPdf, Stream secondPdf)
{
// If either PDF is null, then return the other one
if (firstPdf == null) return secondPdf;
if (secondPdf == null) return firstPdf;
var destStream = new MemoryStream();
// Set the PDF copier up.
using (var document = new Document())
{
using (var copy = new PdfCopy(document, destStream))
{
document.Open();
copy.CloseStream = false;
// Copy the first document
using (var reader = new PdfReader(firstPdf))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
copy.AddPage(copy.GetImportedPage(reader, i));
}
}
// Copy the second document
using (var reader = new PdfReader(secondPdf))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
copy.AddPage(copy.GetImportedPage(reader, i));
}
}
}
}
return destStream;
}
Every time I receive a new PDF page, I pass the previously concatenated pages (firstPdf) along with the new page (secondPdf) to this function. For the first page, I don't have any previously concatenated pages, so firstPdf is null, thereby resulting in secondPdf being returned as the result. The second time I go through, the first page is passed in as firstPdf and the new second page is passed in as secondPdf. The concatenation works just fine and the results are actually in the First.pdf file linked above.
The problem is when I go to add a third page. I am using the output of the previous pass (the first two pages) as the input for the third pass, along with a new PDF stream. The exception occus when I try to initialize the PdfReader with the PDF previously concatenated pages.
What I find particularly interesting is that it fails to read its own output. I feel like I must be doing something wrong, but I can neither figure out how to avoid the problem, nor why there is a problem with the header; it looks perfectly normal to me. If someone could show me either what I'm doing wrong with the my code or at least what is wrong with the PDF file, I would really appreciate it.
(comment to answer)
I strongly recommend not passing the raw streams themselves around and instead pass around a byte array by calling .ToArray() on your MemoryStream. iTextSharp assumes that is has a dedicated empty stream for writing to since it can't edit existing files "in-place". Although streams essentially map to bytes, they also have inherent properties like Open and Closed and Position that can mess things up.

Categories