Everything I have read about this error says the file must be missing a "%PDF-1.4" or something similar at the top; however, my file includes it. I am not an expert in PDF formatting, but I did double check that I don't have multiple %%EOF or trailer tags, so now I'm at a loss as to what is causing my PDF header signature to be bad. Here is a link to the file if you would like to look at it: Poorly formatted PDF
Here is what I'm doing. I am getting each page of the PDF in the form of a MemoryStream, so I have to append each page to the end of the previous pages. In order to do this, I am using iTextSharp's PdfCopy class. Here is the code I am using:
/// <summary>
/// Takes two PDF streams and appends the second onto the first.
/// </summary>
/// <param name="firstPdf">The PDF to which the other document will be appended.</param>
/// <param name="secondPdf">The PDF to append.</param>
/// <returns>A new stream with the second PDF appended to the first.</returns>
public Stream ConcatenatePdfs(Stream firstPdf, Stream secondPdf)
{
// If either PDF is null, then return the other one
if (firstPdf == null) return secondPdf;
if (secondPdf == null) return firstPdf;
var destStream = new MemoryStream();
// Set the PDF copier up.
using (var document = new Document())
{
using (var copy = new PdfCopy(document, destStream))
{
document.Open();
copy.CloseStream = false;
// Copy the first document
using (var reader = new PdfReader(firstPdf))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
copy.AddPage(copy.GetImportedPage(reader, i));
}
}
// Copy the second document
using (var reader = new PdfReader(secondPdf))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
copy.AddPage(copy.GetImportedPage(reader, i));
}
}
}
}
return destStream;
}
Every time I receive a new PDF page, I pass the previously concatenated pages (firstPdf) along with the new page (secondPdf) to this function. For the first page, I don't have any previously concatenated pages, so firstPdf is null, thereby resulting in secondPdf being returned as the result. The second time I go through, the first page is passed in as firstPdf and the new second page is passed in as secondPdf. The concatenation works just fine and the results are actually in the First.pdf file linked above.
The problem is when I go to add a third page. I am using the output of the previous pass (the first two pages) as the input for the third pass, along with a new PDF stream. The exception occus when I try to initialize the PdfReader with the PDF previously concatenated pages.
What I find particularly interesting is that it fails to read its own output. I feel like I must be doing something wrong, but I can neither figure out how to avoid the problem, nor why there is a problem with the header; it looks perfectly normal to me. If someone could show me either what I'm doing wrong with the my code or at least what is wrong with the PDF file, I would really appreciate it.
(comment to answer)
I strongly recommend not passing the raw streams themselves around and instead pass around a byte array by calling .ToArray() on your MemoryStream. iTextSharp assumes that is has a dedicated empty stream for writing to since it can't edit existing files "in-place". Although streams essentially map to bytes, they also have inherent properties like Open and Closed and Position that can mess things up.
Related
I have to replace number "14-1" into "10-2". I am using following iText code but getting following type cast error. Can any one help me by modifying the program and remove the casting issue:
I have many PDF's where i have to replace the numbers at same location. I also need to understand it logically to how to do this:
using System;
using System.IO;
using System.Text;
using iTextSharp.text.io;
using iTextSharp.text.pdf;
using System.Windows.Forms;
namespace iText5
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
public const string src = #"D:\test1\A.pdf";
public const string dest = #"D:\test1\ENV1.pdf";
private void button1_Click(object sender, EventArgs e)
{
FileInfo file = new FileInfo(dest);
file.Directory.Create();
manipulatePdf(src, dest);
}
public void manipulatePdf(String src, String dest)
{
PdfReader reader = new PdfReader(src);
PdfDictionary dict = reader.GetPageN(1);
PdfObject obj = dict.GetDirectObject(PdfName.CONTENTS);
PRStream stream = (PRStream)obj;
byte[] data = PdfReader.GetStreamBytes(stream);
string xyz = Encoding.UTF8.GetString(data);
byte[] newBytes = Encoding.UTF8.GetBytes(xyz.Replace("14-1", "10-2"));
stream.SetData(newBytes);
PdfStamper stamper = new PdfStamper(reader, new FileStream(dest, FileMode.Create));
stamper.Close();
reader.Close();
}
}
}
This is a problem:
PdfDictionary dict = reader.GetPageN(1);
PdfObject obj = dict.GetDirectObject(PdfName.CONTENTS);
PRStream stream = (PRStream)obj;
First you get a page dictionary. That page dictionary has a /Contents entry. If you read the PDF standard (ISO 32000), then you see that the value of the /Contents entry can be either a stream, or an array. You assume that it's always a stream. In some cases, your code will work, but in cases where the value of the /Contents entry is an array of references to a series of streams, you will get a class cast error (for the obvious reason that an array of streams is not the same as a stream).
I think that you want to do something like this:
byte[] data = reader.GetPageContent(i);
string xyz = PdfEncodings.ConvertToString(data, PdfObject.TEXT_PDFDOCENCODING);
string abc = xyz.Replace("14-1", "10-2");
reader.SetPageContent(i, PdfEncodings.ConvertToBytes(abc, PdfObject.TEXT_PDFDOCENCODING));
However, that's a very bad idea, because of the reasons explained in the answers to these questions:
Replace the text in pdf document using itextSharp
PDF text replace not working
Is there any API in C# or .net to edit pdf documents?
Using ContentByteUtils for raw PDF manipulation
...
You are making the assumption that you will find a literal string with value "14-1" in the content. That might be true for simple PDF documents, but in many cases the appearance of "14-1" on a page (that you can read with your eyes) doesn't mean the string "14-1" is present as such in the content (that you extract with GetPageContent). That string could be part of an XObject, or the syntax to render "14-1" could be constructed in such a way that xyz.Replace("14-1", "10-2") won't change xyz in any way.
Bottom line: PDF is not a format for editing. A page in a PDF file consists of content that is added at absolute positions. The content on a page doesn't reflow if you change it (e.g. the existing content won't move to the next line or to the next page if you add extra content). Instead of editing a PDF document, you should edit the source that was used to create the document, and then create a new PDF from that source.
Important: you are using an old version of iText. We abandoned the name iTextSharp more than two years ago in favor of iText for .NET. The current version of iText is iText 7.1.2; see Nuget: https://www.nuget.org/packages/itext7/
Many people think that iText 5.5.13 is the latest version. That assumption is wrong. iText 5 has been discontinued and is no longer supported. The recent 5.5.x versions are maintenance releases for paying customers who can't migrate to iText 7 right away.
I can extract text from pages in a PDF in many ways:
String pageText = PdfTextExtractor.GetTextFromPage(reader, i);
This can be used to get any text on a page.
Alternatively:
byte[] contentBytes = iTextSharp.text.pdf.parser.ContentByteUtils.GetContentBytesForPage(reader, i);
Possibilities are endless.
Now I want to remove/redact a certain word, e.g. explicit words, sensitive information (putting black boxes over them obviously is a bad idea :) or whatever from the PDF (which is simple and text only). I can find that word just fine using the approach above. I can count its occurrences etc...
I do not care about layout, or the fact that PDF is not really meant to be manipulated in this way.
I just wish to know if there is a mechanism that would allow me to manipulate the raw content of my PDF in this way. You could say I'm looking for "SetContentBytesForPage()" ...
If you want to change the content of a page, it isn't sufficient to change the content stream of a page. A page may contain references to Form XObjects that contain content that you want to remove.
A secondary problem consists of images. For instance: suppose that your document consists of a scanned document that has been OCR'ed. In that case, it isn't sufficient to remove the (vector) text, you'll also need to manipulate the (pixel) text in the image.
Assuming that your secondary problem doesn't exist, you'll need a double approach:
get the content from the page as text to detect in which pages there are names or words you want to remove.
recursively loop over all the content streams to find that text and to rewrite those content streams without that text.
From your question, I assume that you have already solved problem 1. Solving problem 2 isn't that trivial. In chapter 15 of my book, I have an example where extracting text returns "Hello World", but when you look inside the content stream, you see:
BT
/F1 12 Tf
88.66 367 Td
(ld) Tj
-22 0 Td
(Wor) Tj
-15.33 0 Td
(llo) Tj
-15.33 0 Td
(He) Tj
ET
Before you can remove "Hello World" from this stream snippet, you'll need some heuristics so that your program recognizes the text in this syntax.
Once you've found the text, you need to rewrite the stream. For inspiration, you can take a look at the OCG remover functionality in the itext-xtra package.
Long story short: if your PDFs are relatively simple, that is: the text can be easily detected in the different content stream (page content and Form XObject content), then it's simply a matter of rewriting those streams after some string manipulations.
I've made you a simple example named ReplaceStream that replaces "Hello World" with "HELLO WORLD" in a PDF.
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfDictionary dict = reader.getPageN(1);
PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
if (object instanceof PRStream) {
PRStream stream = (PRStream)object;
byte[] data = PdfReader.getStreamBytes(stream);
stream.setData(new String(data).replace("Hello World", "HELLO WORLD").getBytes());
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
stamper.close();
reader.close();
}
Some caveats:
I check if object is a stream. It could also be an array of streams. In that case, you need to loop over that array.
I don't check if there are form XObjects defined for the page.
I assume that Hello World can be easily detected in the PDF Syntax.
...
In real life, PDFs are never that simple and the complexity of your project will increase dramatically with every special feature that is used in your documents.
The C# equivalent of the code by Bruno:
static void manipulatePdf(String src, String dest)
{
PdfReader reader = new PdfReader(src);
PdfDictionary dict = reader.GetPageN(1);
PdfObject pdfObject = dict.GetDirectObject(PdfName.CONTENTS);
if (pdfObject.IsStream()) {
PRStream stream = (PRStream)pdfObject;
byte[] data = PdfReader.GetStreamBytes(stream);
stream.SetData(System.Text.Encoding.ASCII.GetBytes(System.Text.Encoding.ASCII.GetString(data).Replace("Hello World", "HELLO WORLD")));
}
FileStream outStream = new FileStream(dest, FileMode.Create);
PdfStamper stamper = new PdfStamper(reader, outStream);
reader.Close();
}
I'll update this if it would turn out to still contain errors.
In follow-up to my previous C# code and the remark by Bruno that GetDirectObject(PdfName.CONTENTS) might as well return an array as opposed to a stream: In my particular case, this turned out to be true.
The PdfObject returned returned "true" for IsArray(). I checked and the array elements were all PdfIndirectReference.
A further look at the API yielded two useful bits of info:
PdfIndirectReference had a "Number" property, leading you to another PdfObject.
You can get to the referenced object using reader.GetPdfObject(int ref), where ref is the "Number" property of the IndirectReferenceObject
From there on out, you get a new PdfObject that you can check using IsStream() and modify as per the previously posted code.
So it works out to this (mind you, this is quick and dirty, but it works for my particular purposes...):
// Get the contents of my page...
PdfObject pdfObject = pageDict.GetDirectObject(PdfName.CONTENTS);
// Check that this is, in fact, an array or something else...
if (pdfObject.IsArray())
{
PdfArray streamArray = pageDict.GetAsArray(PdfName.CONTENTS);
for (int j = 0; j < streamArray.Size; j++)
{
PdfIndirectReference arrayEl = (PdfIndirectReference)streamArray[j];
PdfObject refdObj = reader.GetPdfObject(arrayEl.Number);
if (refdObj.IsStream())
{
PRStream stream = (PRStream)refdObj;
byte[] data = PdfReader.GetStreamBytes(stream);
stream.SetData(System.Text.Encoding.ASCII.GetBytes(System.Text.Encoding.ASCII.GetString(data).Replace(targetedText, newText)));
}
}
}
I have a working solution to load and render a PDF document from a byte array in a Windows Store App. Lately some users have reported out-of-memory errors though. As you can see in the code below there is one stream I am not disposing of. I've commented out the line. If I do dispose of that stream, then the PDF document does not render anymore. It just shows a completely white image. Could anybody explain why and how I could load and render the PDF document and dispose of all disposables?
private static async Task<PdfDocument> LoadDocumentAsync(byte[] bytes)
{
using (var stream = new InMemoryRandomAccessStream())
{
await stream.WriteAsync(bytes.AsBuffer());
stream.Seek(0);
var fileStream = RandomAccessStreamReference.CreateFromStream(stream);
var inputStream = await fileStream.OpenReadAsync();
try
{
return await PdfDocument.LoadFromStreamAsync(inputStream);
}
finally
{
// do not dispose otherwise pdf does not load / render correctly. Not disposing though may cause memory issues.
// inputStream.Dispose();
}
}
}
and the code to render the PDF
private static async Task<ObservableCollection<BitmapImage>> RenderPagesAsync(
PdfDocument document,
PdfPageRenderOptions options)
{
var items = new ObservableCollection<BitmapImage>();
if (document != null && document.PageCount > 0)
{
for (var pageIndex = 0; pageIndex < document.PageCount; pageIndex++)
{
using (var page = document.GetPage((uint)pageIndex))
{
using (var imageStream = new InMemoryRandomAccessStream())
{
await page.RenderToStreamAsync(imageStream, options);
await imageStream.FlushAsync();
var renderStream = RandomAccessStreamReference.CreateFromStream(imageStream);
using (var stream = await renderStream.OpenReadAsync())
{
var bitmapImage = new BitmapImage();
await bitmapImage.SetSourceAsync(stream);
items.Add(bitmapImage);
}
}
}
}
}
return items;
}
As you can see I am using this RandomAccessStreamReference.CreateFromStream method in both of my methods. I've seen other examples that skip that step and use the InMemoryRandomAccessStream directly to load the PDF document or the bitmap image, but I've not managed to get the PDF to render correctly then. The images will just be completely white again. As I mentioned above, this code does actually render the PDF correctly, but does not dispose of all disposables.
Why
I assume LoadFromStreamAsync(IRandomAccessStream) does not parse the whole stream into the PdfDocument object but instead only parses the main PDF dictionaries and holds a reference to the IRandomAccessStream.
This actually is the sane thing to do, why parse the whole PDF into own objects (a possibly very expensive operation resource-wise) if the user eventually only wants to render one page, or even merely wants to query the number of pages...
Later on, when other methods of the returned PdfDocument are called, e.g. GetPage, these methods try to read the additional data from the stream they need for their task, e.g. for rendering. Unfortunately in your case that means after the finally { inputStream.Dispose(); }
How else
You have to postpone the inputStream.Dispose() until all operations on the PdfDocument are finished. That means some hopefully minor architectural changes for your code. Probably moving the LoadDocumentAsync code as a frame into the RenderPagesAsync method or its caller suffices.
I have a pdf file with a cover that looks like the following:
Now, I need to remove the so-called 'galley marks' around the edges of the cover. I am using iTextSharp with C# and I need code using iTextSharp to create a new document with only the intended cover or use PdfStamper to remove that. Or any other solution using iTextSharp that would deliver the results.
I have been unable to find any good code samples in my search to this point.
Do you have to actually remove them or can you just crop them out? If you can just crop them out then the code below will work. If you have to actually remove them from the file then to the best of my knowledge there isn't a simple way to do that. Those objects aren't explicitly marked as meta-objects to the best of my knowledge. The only way I can think of to remove them would be to inspect everything and see if it fits into the document's active area.
Below is sample code that reads each page in the input file and finds the various boxes that might exist, trim, art and bleed. (See this page.)
As long as it finds at least one it sets the page's crop box to the first item in the list. In your case you might actually have to perform some logic to find the "smallest" of all of those items or you might be able to just know that "art" will always work for you. See the code for additional comments. This targets iTextSharp 5.4.0.0.
//Sample input file
var inputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Binder1.pdf");
//Sample output file
var outputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Cropped.pdf");
//Bind a reader to our input file
using (var r = new PdfReader(inputFile)) {
//Get the number of pages
var pageCount = r.NumberOfPages;
//See this for a list: http://api.itextpdf.com/itext/com/itextpdf/text/pdf/PdfReader.html#getBoxSize(int, java.lang.String)
var boxNames = new string[] { "trim", "art", "bleed" };
//We'll create a list of all possible boxes to pick from later
List<iTextSharp.text.Rectangle> boxes;
//Loop through each page
for (var i = 1; i <= pageCount; i++) {
//Initialize our list for this page
boxes = new List<iTextSharp.text.Rectangle>();
//Loop through the list of known boxes
for (var j = 0; j < boxNames.Length; j++) {
//If the box exists
if(r.GetBoxSize(i, boxNames[j]) != null){
//Add it to our collection
boxes.Add(r.GetBoxSize(i, boxNames[j]));
}
}
//If we found at least one box
if (boxes.Count > 0) {
//Get the page's entire dictionary
var dict = r.GetPageN(i);
//At this point we might want to apply some logic to find the "inner most" box if our trim/bleed/art aren't all the same
//I'm just hard-coding the first item in the list for demonstration purposes
//Set the page's crop box to the specified box
dict.Put(PdfName.CROPBOX, new PdfRectangle(boxes[0]));
}
}
//Create our output file
using (var fs = new FileStream(outputFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
//Bind a stamper to our reader and output file
using(var stamper = new PdfStamper(r,fs)){
//We did all of our PDF manipulation above so we don't actually have to do anything here
}
}
}
I have a stream of bytes which actually (if put right) will form a valid Word file, I need to convert this stream into a Word file without writing it to disk, I take the original stream from SQL Server database table:
ID Name FileData
----------------------------------------
1 Word1 292jf2jf2ofm29fj29fj29fj29f2jf29efj29fj2f9 (actual file data)
the FileData field carries the data.
Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document doc = new Microsoft.Office.Interop.Word.Document();
doc = word.Documents.Open(#"C:\SampleText.doc");
doc.Activate();
The above code opens and fill a Word file from File System, I don't want that, I want to define a new Microsoft.Office.Interop.Word.Document, but I want to fill its content manually from byte stream.
After getting the in-memory Word document, I want to do some parsing of keywords.
Any ideas?
Create an in memmory file system, there are drivers for that.
Give word a path to an ftp server path (or something else) which you then use to push the data.
One important thing to note: storing files in a database is generally not good design.
You could look at how Sharepoint solves this. They have created a web interface for documents stored in their database.
Its not that hard to create or embed a webserver in your application that can serve pages to Word. You don't even have to use the standard ports.
There probably isn't any straight-forward way of doing this. I found a couple of solutions searching for it:
Use the OpenOffice SDK to manipulate the document instead of Word
Interop
Write the data to the clipboard, and then from the Clipboard to Word
I don't know if this does it for you, but apparently the API doesn't provide what you're after (unfortunately).
There are really only 2 ways to open a Word document programmatically - as a physical file or as a stream. There's a "package", but that's not really applicable.
The stream method is covered here: https://learn.microsoft.com/en-us/office/open-xml/how-to-open-a-word-processing-document-from-a-stream
But even it relies on there being a physical file in order to form the stream:
string strDoc = #"C:\Users\Public\Public Documents\Word13.docx";
Stream stream = File.Open(strDoc, FileMode.Open);
The best solution I can offer would be to write the file out to a temp location where the service account for the application has permission to write:
string newDocument = #"C:\temp\test.docx";
WriteFile(byteArray, newDocument);
If it didn't have permissions on the "temp" folder in my example, you would simply just add the service account of your application (application pool, if it's a website) to have Full Control of the folder.
You'd use this WriteFile() function:
/// <summary>
/// Write a byte[] to a new file at the location where you choose
/// </summary>
/// <param name="byteArray">byte[] that consists of file data</param>
/// <param name="newDocument">Path to where the new document will be written</param>
public static void WriteFile(byte[] byteArray, string newDocument)
{
using (MemoryStream stream = new MemoryStream())
{
stream.Write(byteArray, 0, (int)byteArray.Length);
// Save the file with the new name
File.WriteAllBytes(newDocument, stream.ToArray());
}
}
From there, you can open it with OpenXML and edit the file. There's no way to open a Word document in byte[] form directly into an instance of Word - Interop, OpenXML, or otherwise - because you need a documentPath, or the stream method mentioned earlier that relies on there being a physical file. You can edit the bytes you would get by reading the bytes into a string, and XML afterwards, or just edit the string, directly:
string docText = null;
byte[] byteArray = null;
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(documentPath, true))
{
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
docText = sr.ReadToEnd(); // <-- converts byte[] stream to string
}
// Play with the XML
XmlDocument xml = new XmlDocument();
xml.LoadXml(docText); // the string contains the XML of the Word document
XmlNodeList nodes = xml.GetElementsByTagName("w:body");
XmlNode chiefBodyNode = nodes[0];
// add paragraphs with AppendChild...
// remove a node by getting a ChildNode and removing it, like this...
XmlNode firstParagraph = chiefBodyNode.ChildNodes[2];
chiefBodyNode.RemoveChild(firstParagraph);
// Or play with the string form
docText = docText.Replace("John","Joe");
// If you manipulated the XML, write it back to the string
//docText = xml.OuterXml; // comment out the line above if XML edits are all you want to do, and uncomment out this line
// Save the file - yes, back to the file system - required
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(docText);
}
}
// Read it back in as bytes
byteArray = File.ReadAllBytes(documentPath); // new bytes, ready for DB saving
Reference:
https://learn.microsoft.com/en-us/office/open-xml/how-to-search-and-replace-text-in-a-document-part
I know it's not ideal, but I have searched and not found a way to edit the byte[] directly without a conversion that involves writing out the file, opening it in Word for the edits, then essentially re-uploading it to recover the new bytes. Doing byte[] byteArray = Encoding.UTF8.GetBytes(docText); prior to re-reading the file will corrupt them, as would any other Encoding I tried (UTF7,Default,Unicode, ASCII), as I found when I tried to write them back out using my WriteFile() function, above, in that last line. When not encoded and simply collected using File.ReadAllBytes(), and then writing the bytes back out using WriteFile(), it worked fine.
Update:
It might be possible to manipulate the bytes like this:
//byte[] byteArray = File.ReadAllBytes("Test.docx"); // you might be able to assign your bytes here, instead of from a file?
byte[] byteArray = GetByteArrayFromDatabase(fileId); // function you have for getting the document from the database
using (MemoryStream mem = new MemoryStream())
{
mem.Write(byteArray, 0, (int)byteArray.Length);
using (WordprocessingDocument wordDoc =
WordprocessingDocument.Open(mem, true))
{
// do your updates -- see string or XML edits, above
// Once done, you may need to save the changes....
//wordDoc.MainDocumentPart.Document.Save();
}
// But you will still need to save it to the file system here....
// You would update "documentPath" to a new name first...
string documentPath = #"C:\temp\newDoc.docx";
using (FileStream fileStream = new FileStream(documentPath,
System.IO.FileMode.CreateNew))
{
mem.WriteTo(fileStream);
}
}
// And then read the bytes back in, to save it to the database
byteArray = File.ReadAllBytes(documentPath); // new bytes, ready for DB saving
Reference:
https://learn.microsoft.com/en-us/previous-versions/office/office-12//ee945362(v=office.12)
But note that even this method will require saving the document, then reading it back in, in order to save it to bytes for the database. It will also fail if the document is in .doc format instead of .docx on that line where the document is being opened.
Instead of that last section for saving the file to the file system, you could just take the memory stream and save that back into bytes once you are outside of the WordprocessingDocument.Open() block, but still inside the using (MemoryStream mem = new MemoryStream() { ... } statement:
// Convert
byteArray = mem.ToArray();
This will have your Word document byte[].