Can't read some PDF files with iTextSharp - c#

I have a Win32 application that reads PDFs using iTextSharp which inserts an image into the document as a seal.
It works fine with 99% of the files we are processing over a year, but these days some files just don't read.
When I execute the code below:
string inputfile = "C:\test.pdf";
PdfReader reader = new PdfReader(inputfile);
It gives the exception:
System.NullReferenceException occurred
Message="Object reference not set to an instance of an object."
Source="itextsharp"
StackTrace:
em iTextSharp.text.pdf.PdfReader.ReadPages()
em iTextSharp.text.pdf.PdfReader.ReadPdf()
em iTextSharp.text.pdf.PdfReader..ctor(String filename, Byte[] ownerPassword)
em iTextSharp.text.pdf.PdfReader..ctor(String filename)
em MyApp.insertSeal() na C:\MyApp\Stamper.cs:linha 659
The pdf files that throw these exception can be normally read by adobe pdf and when I open one of these files with Acrobat and save it I can read this saved file with my application.
Are the files corrupted but still can be opened with Adobe Reader?
I am sharing with you two samples of files.
A file that NOT work : Not-Ok-Version.pdf
And a file that works, after a opened and saved it with Acrobat. Download it here OK-Version.pdf

Here's the (java, sorry) source for readPages:
protected internal void ReadPages() {
catalog = trailer.GetAsDict(PdfName.ROOT);
rootPages = catalog.GetAsDict(PdfName.PAGES);
pageRefs = new PageRefs(this);
}
trailer,catalog,rootPages, andpageRefs` are all member variables of PdfReader.
If the trailer or root/catalog object of a PDF are simply missing, your PDF is REALLY BADLY BROKEN. It's more likely that the xref table is a bit off, and the objects in question simply aren't exactly where they're supposed to be (which is Bad, but recoverable).
HOWEVER, when PdfReader first opens a PDF, it parses ALL the objects in the file, and converts them to the appropriate PdfObject-derived classes.
What it isn't doing is checking to see that the object number claimed by the xref table and the object number read in from the file Actually Match. Highly Unlikely, but possible. Bad software could write out their PDF objects in the wrong order but keep the byte offsets in the xref table correct. Software that overrode the object number from the xref table with the number from that particular byte offset in the file would be fine.
iText is not fine.
I still want to see the PDF.
Yep. That PDF is broken alright. Specifically:
The file's first 70kb or so define a pretty clean little PDF. Changes were then appended to the PDF.
Check that. Someone attempted to append changes to the PDF and failed. Badly. To understand just how badly, let me explain some of the internal syntax of a PDF, illustrated with this example:
%%PDF1.6
1 0 obj
<</Type/SomeObject ...>>
endobj
2 0 obj
<</Type/SomeOtherObj /Ref 1 0 R>>
endobj
3 0 obj
...
endobj
<etc>
xref
0 10
0000000000 65535 f
0000000010 00001 n
0000000049 00002 n
0000000098 00003 n
...
trailer
<</Root 4 0 R /Size 10>>
startxref 124
%%EOF
So we have a header/version "%%PDF1.v", a list of objects (the ones here are called dictionaries), a cross (x) reference table listing the byte offsets and object numbers of all the objects in the list, and a trailer giving the root object & the number of objects in the PDF, and the byte offset to the 'x' in 'xref'.
You can append changes to an existing PDF. To do so you just add any new or changed objects after the existing %%EOF, a cross reference table to those new objects, and a trailer. The trailer of an appended change should include a /Prev key with the byte offset to the previous cross reference table.
In your NOT-OKAY pdf, someone tried to append changes to a PDF, AND FAILED HORRIBLY.
The original PDF is still there, intact. That's what Reader shows you, and what you get when you save the PDF. I hacked off everything after the first %%EOF in a hex editor, and the file was fine.
So here's the layout of your NOT-OKAY pdf:
%PDF1.4.1
1 0 obj...
2 through 7
xref
0 7
<healthy xref>
trailer <</Size 8 /Root 6 0 R /Info 7 0 R>>
startxref 68308
%%EOF
So far so good. Here's where things get ugly
<binary garbage>
endstream
endobj
xref
0 7
<horribly wrong xref>
trailer <</ID [...] /Info 1 0 R /Root 2 0 R /Size 7>>
startxref 223022
%%EOF
The only thing RIGHT about that section is the startxref value.
Problems:
The second trailer has no /Prev key.
ALL the byte offsets in the second xref table are wrong.
The is part of a "stream" object, but the beginning of that object IS MISSING. Streams should look something like this
1 0 obj
<</Type/SomeType/Length 123>>
stream
123 bytes of data
endstream
endobj
The end of this file is made up of some portion of a (compressed I'd imagine) stream... but without the dictionary at the beginning telling us what filters its using and how long it is (to say nothing of any missing data), you can't do anything with it.
I suspect that someone tried to completely rebuild this PDF, then accidentally wrote the original 70kb over the beginning of their version. Kaboom.
It would appear that Adobe is simply ignoring the bad appended changes. iText could do this too, but so can you:
When iText fails to open a PDF:
1. Search backwards through the file looking for the second to last %%EOF. Ignore the one at the very end, we want the previous state of the file.
2. Delete everything after the 2nd-to-last %%EOF (if any), and try to open it again.
The sad thing is that this broken PDF could have been completely different from the "original" 70kb, and then some IO error overwrote the first part of the file. Unlikely, but there's no way to be sure.

Considering that they are now up to version 5.0, my guess would be that you are seeing increasing numbers of PDFs written to PDF version specs that your version of iTextSharp does not support. It may be time to do an upgrade.

Maybe this will help someone...
I had code that worked for years that started hanging on reading the bookmarks from a PDF file (outlines variable below). It turned out that it broke when the code was updated from .NET 4.0 to .NET 4.5.
As soon as I rolled it back to .NET 4.0, it worked again.
RandomAccessFileOrArray raf = null;
PdfReader reader1 = null;
System.Collections.ArrayList outlines = null;
raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sFile);
reader1 = new iTextSharp.text.pdf.PdfReader(raf, null);
outlines = iTextSharp.text.pdf.SimpleBookmark.GetBookmark(reader1);
Just for notes, the same VS web application project uses AjaxControlToolkit (from NuGet). Before I rolled it back, I also updated iTextSharp to ver 5.5.5 and it still hung on the same line.

When I pull down the source and run it against the bad PDF there's an exception in ReadPdf() in the 4th try block when it calls ReadDocObj():
"Invalid object number. at file pointer 16"
tokens.StringValue is j
#Mark Storer, you're the iText guy so maybe that means something to you.
From a higher level, at least to my eyes, it seems that when RebuildXref() is called (which I assume is when an invalid PDF is read) it rebuilds trailer but not catalog. The latter is what the NRE is complaining about. Then again, that's just a guess.

Also make sure your html doesn't contains hr tag while converting html to pdf
hdnEditorText.Value.Replace("\"", "'").Replace("<hr />", "").Replace("<hr/>", "")

Related

How do i start reading a text file from a specific point?

So my question is basically, how do i start reading a file from a specific line, like for example line 14 until line 18?
Im working on a simple ContactList app and the only thing missing is deleting the information from a specific name. The user can create a new contact which has a name, a number and an address as information. I want the user to also be able to delete the data of that person by typing in their name. Then, the program should read the name and all of the 4 lines under it and remove them from the text File. How could i achieve this?
You can jump to any offset within a file. However, there isn't any way to know where a particular line begins unless you know the length of every line.
If you are writing a contact app, you should not use a regular text file unless:
You pad line lengths so that you can easily calculate the position of each line.
You are loading the entire file into memory.
You can't. You need to read the first n lines in order to find out which line has which number. Except if your records have a fixed length per line (which is not a good idea - there's always someone with a longer name that you could think of).
Likewise, you can't delete a line from the text file. The space on disk does not move by itself. You need an algorithm that implements safe saving and rearranges the data:
foreach line in input_file:
if line is needed:
write line to temporary_output_file
else:
ignore (don't write = delete)
delete input_file
move temporary_output_file to input_file
Disadvantage: you need about double the disk space while input_file and temporary_output_file both exist.
With safe saving, the NTFS file system driver will give the moved file the same time stamp that it had before deleting the file. Read the Windows Internals 7 book (should be part 2, chapter 11) to understand it in detail.
Depending on how large the contact list is (probably it's less than 10M entries), there's no problem of loading the whole database into memory, deleting the record and then writing everything back.

How can I set halftone (sethalftone) to each separation color with device tiffsep1 and other separation ones?

The code works but the commented code will create an error. The error are not solved by changing -sDEVICE to tiffgray, for example.
String[] ARGS = new String[] {
"",
"-sDEVICE=tiffsep1",
"-r1200",
"-o out.tiff",
"SOSample.pdf",
//"-c",
//"<< /HalftoneType 1 /Frequency 300 /Angle 45 /SpotFunction {180 mul cos exch 180 mul cos add 2 div} >> sethalftone",
//"-f"
};
How can I define sethalftone with ghostscript and how can I set it for each color of tiffsep1? What am I doing wrong with one color and how to make it for separations?
I'm using:
[DllImport("gsdll64.dll", EntryPoint = "gsapi_init_with_args")]
public static extern int INSTANCEStart(IntPtr instance, int argc, string[] argv);
and so on.
I'm working with Ghostscript 9.52.
Something that could help (\"):
"-c",
"\"<</Orientation 1>> setpagedevice\"",
You need to use the sethalftone PostScript operator in order to change the halftone. Obviously this will involve writing some PostScript.
Not only that, but you really need to set the default halftone, or set the halftone at the start of the page, because the current PDF interpreter in Ghostscript does an initgraphics at the start of every page of a PDF file.
For all of this you are going to need a copy of the PostScript Language Reference Manual, which you can get from somewhere on the Adobe web site. They keep moving stuff around so I'm not going to try and post a link, just google for the name of the manual. You want the third edition.
So you need to write a BeginPage procedure, which you will find covered in Chapter 6 under device control, pages 427 onwards.
The BeginPage procedure will need to set a halftone, and you will find halftones covered in Section 7.4, page 480 onwards. You will presumably want to use either a type 2 or type 4 halftone dictionary.
When you've assembled that, you then need to pass it to Ghostscript before you process the PDF file. The simplest method is to put the PostScript program in a file (called eg setup.ps) and then put that filename on the command line immediately before the PDF filename.
Eg:
gs -r1200 -sDEVICE=tiffsep1 -o out%d.tif setup.ps sample.pdf
Note that PDF files can contain a halftone specification themselves (this is deprecated in PDF 2.0) and Ghostscript will honour any halftone in a PDF file.
Finally; this is an unusual request and, given that you are writing code to link to the Ghostscript DLL, makes me think you may be using Ghostscript commercially. You should review the AGPL to ensure you are complying with the terms of the license. If you plan on distributing your application you will almost certainly need a commercial license.

How do I access the data in a Avro.snz file with C#

I have an Avro.snz file whose
avro.codecs is snappy
This can be opened with com.databricks.avro in Spark but it seems snappy is unsupported by Apache.Avro and Confluent.Avro, they only have deflate and null. Although they can get me the Schema, I cannot get at the data.
The next method gets and error. Ironsnappy is unable to decompress the file too, it says the input is
using (Avro.File.IFileReader<generic> reader = Avro.File.DataFileReader<generic>.OpenReader(avro_path))
{
schema = reader.GetSchema();
Console.WriteLine(reader.HasNext()); //true
var hi = reader.Next(); // error
Console.WriteLine(hi.ElementAt(0).ToString()); // error
}
I'm starting to wonder if there is anything in the Azure HDInsight library, but I cant seem to find the nuget package that gives me a way to read Avro with support for Snappy compression.
I'm open to any solution, even if that means downloading the source for Apache.Avro and adding in Snappy support manually, but to be honest, I'm sort of a newbie and have no idea how compression even works let alone add support to a library.
Can anyone help?
Update:
Just adding the snappy codec to Apache.Avro and changing the DeflateStream to Ironsnappy stream failed. It gave Corrupt input again. Is there anything anywhere that can open Snappy compressed Avro files with C#?
Or how do I determine what part of the Avro is snappy compressed and pass that to Ironsnappy.
Ok, so not even any comments on this. But I eventually solved my problem. Here is how I solved it.
I tried Apache.Avro and Confluent version as well, but their .net version has no snappy support darn. But I can get the schema as that is uncompressed apparently.
Since Parquet.Net uses IronSnappy, I built/added out the snappy codec in Apache.Avro by basically cloning its deflate code and changing a few names. Failed. Corrupt input Ironsnappy says.
I research Avro and see that it is seperated by an uncompressed Schema, followed by the name of the compression codec of the data, then the data itself, which are divided into blocks. Well, I have no idea where a block starts and ends. Somehow the binary in the file gives that info somehow, but I still have no idea, I couldn't get it with a hex editor even. I think Apache.Avro takes a long or a varint somehow, and the hex editor I used doesn't give me that info.
I found the avro-tools.jar tool inside Apache.Avro. To make it easier to use, I made it an executable with launch4j totally superfluous move but whatever. Then I used that cat my avro into 1 row, uncompressed and snappy. I used that as my base and followed the flow of Apache.Avro in the debugger. While also tracking the index of bytes and such with the hex editor and the debugger in C#.
With 1 row, it is guaranteed 1 block. So I ran a loop on the byte start index and end index. I found my Snappy block and was able to decompress it with IronSnappy. I modified the codec portion of my Apache.Avro snappy codec code to make it work with 1 block. (which was basically whatever block Apache.Avro took minus 4 bytes which I assume is the Snappy CRC check which I ignored.
It fails with multi blocks. I found its because Apache.Avro always throws the deflate codec a 4096 byte array after the first block. I reduced it to read size and did the minus 4 size thing again. It worked.
Success! So basically it was copy over deflate as a template for snappy, reduce block byte by 4, then make sure to resize the byte array to block byte size before getting Ironsnappy to decompress.
public override byte[] Decompress(byte[] compressedData)
{
int snappySize = compressedData.Length - 4;
byte[] compressedSnappy_Data = new byte[snappySize];
System.Array.Copy(compressedData, compressedSnappy_Data, snappySize);
byte[] result = IronSnappy.Snappy.Decode(compressedSnappy_Data);
return result;
}
if (_codec.GetHashCode() == DataFileConstants.SnappyCodecHash)
{
byte[] snappyBlock = new byte[(int)_currentBlock.BlockSize];
System.Array.Copy(_currentBlock.Data, snappyBlock, (int)_currentBlock.BlockSize);
_currentBlock.Data = snappyBlock;
}
I didn't bother with actually using the checksum as I don't know how or need to? At least not right now. And I totally ignored the compress function.
but if you really want my compress function here it is
public override byte[] Compress(byte[] uncompressedData)
{
return new byte[0];
}
The simplest solution would be to use:
ResultModel resultObject = AvroConvert.Deserialize<ResultModel>(byte[] avroObject);
From https://github.com/AdrianStrugala/AvroConvert
null
deflate
snappy
gzip
codes are supported

Remove Byte Order Mark from signed PDF file?

I am using iTextSharp 5.5.1 in order to sign PDF files digitally with a detached signature (obtained from a third party authority). Everything seems to work fine, the file is valid and e.g. Adobe Reader reports no problems, displays the signatures as valid etc.
The problem is that the Java Clients have apparently some problems with those files - the file can be neither opened nor parsed.
The files have a byte order mark in the header which seems to cause the behavior (\x00EF\x00BB\x00BF).
I could identify the BOM like this:
PdfReader reader = new PdfReader(path);
byte[] metadata = reader.Metadata;
// metadata[0], metadata[1], metadata[2] contain the BOM
How can I either remove the BOM (without losing the validity of the signature), or force the iTextSharp library not to append these bytes into the files?
First things first: once a PDF is signed, you shouldn't change any byte of that PDF, because you invalidate the signature if you do.
Second observation: the byte order mark is not part of the PDF header (a PDF always starts with %PDF-1.). In this context, it is the value of the begin attribute in the processing instruction of XMP metadata. I don't know of any Java client that has a problem with that byte sequence anywhere in a file. If they do have a problem with it, there's a problem with that client, not with the file.
The Byte Order Mark is an indication of the presence of UTF-8 characters. In the context of XMP, we have a stream inside the PDF that contains a clear text XML file that can be consumed by software that is not "PDF aware". For instance:
2 0 obj
<</Type/Metadata/Subtype/XML/Length 3492>>stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.1.0-jc003">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
dc:format="application/pdf"
pdf:Keywords="Metadata, iText, PDF"
pdf:Producer="iText® 5.5.4-SNAPSHOT ©2000-2014 iText Group NV (AGPL-version); modified using iText® 5.5.4-SNAPSHOT ©2000-2014 iText Group NV (AGPL-version)"
xmp:CreateDate="2014-11-07T16:36:55+01:00"
xmp:CreatorTool="My program using iText"
xmp:ModifyDate="2014-11-07T16:36:56+01:00"
xmp:MetadataDate="2014-11-07T16:36:56+01:00">
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default">This example shows how to add metadata</rdf:li>
</rdf:Alt>
</dc:description>
<dc:creator>
<rdf:Seq>
<rdf:li>Bruno Lowagie</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:subject>
<rdf:Bag>
<rdf:li>Metadata</rdf:li>
<rdf:li>iText</rdf:li>
<rdf:li>PDF</rdf:li>
</rdf:Bag>
</dc:subject>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">Hello World example</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
endstream
Such non-PDF-aware software will look for the sequence W5M0MpCehiHzreSzNTczkc9d, which is a sequence that is unlikely to appear by accident in a data stream.
The begin attribute is there to indicate that the characters in the stream use UTF-8 encoding. They are there because it is good practice for them to be there, but they are not mandatory (ISO-16684-1).
You could retrieve the metadata the way you do (byte[] metadata = reader.Metadata;), remove the bytes, and change the stream with a PdfStamper instance like this:
stamper.XmpMetadata = metadata;
After you have changed the metadata, you can sign the PDF.
Note that one aspect of your question surprises me. You write:
// metadata[0], metadata[1], metadata[2] contain the BOM
It is very strange that the first three bytes of the XMP metadata contain the BOM. XMP metadata is suppose to start with <?xpacket. If it doesn't, you are doing the right thing by removing those bytes.
Caveat: a PDF can contain XMP metadata at different levels. Right now, you are examining the most common one: document-level metadata. You may encounter PDFs with page-level XMP metadata, with XMP inside an image, etc...
Just a quick approach:
First: save both files un-encrypted.
Second: remove metadata 0 through 2 before saving the file
There are some considerations however: does the signing method require a BOM? Does the encryption method require a BOM?
You will also have to ascertain at what stage the BOM is added before you can determine whether you can/should remove the BOM.
I will have a quick hunt about for my pdf structure docs and see what I can get, however the simplest way would be (untried) load the whole thing as a byte array and simply remove xEF xBB xBF from the start of the file, then do any signing/encryption. However they may add it in again...
I will post an update over the weekend:)

Get document properties from PDF in iTextSharp

I'm trying to get some information out of a PDF file. I've tried using PdfSharp, and it has properties for the information I need, but it cannot open iref streams, so i've had to abandon it.
Instead i'm trying iTextSharp. so far i've managed to get some basic information out, like the title, aurhor and subject, from the Info array.
However, i'm now after a bit more information, but cannot find where it is exposed (if it is exposed) in iTextSharp.... The information I am after is highlighted in the image below:
I cannot figure out where this information is stored. Any and all help will be much appreciated.
For documents encrypted using standard password encryption you can retrieve the permissions after opening the file in a PdfReader pdfReader using
getPermissions() in case of iText/Java
int permissions = pdfReader.getPermissions()
Permissions in case of iTextSharp/.Net
int permissions = pdfReader.Permissions
The int value returned is the P value of the encryption dictionary which contains
A set of flags specifying which operations shall be permitted when the document is opened with user access (see Table 22).
[...]
The value of the P entry shall be interpreted as an unsigned 32-bit quantity containing a set of flags specifying which access permissions shall be granted when the document is opened with user access. Table 22 shows the meanings of these flags. Bit positions within the flag word shall be numbered from 1 (low-order) to 32 (high order). A 1 bit in any position shall enable the corresponding access permission.
[...]
Bit position Meaning
3 (Security handlers of revision 2) Print the document. (Security handlers of revision 3 or greater) Print the document (possibly not at the highest quality level, depending on whether bit 12 is also set).
4 Modify the contents of the document by operations other than those controlled by bits 6, 9, and 11.
5 (Security handlers of revision 2) Copy or otherwise extract text and graphics from the document, including extracting text and graphics (in support of accessibility to users with disabilities or for other purposes). (Security handlers of revision 3 or greater) Copy or otherwise extract text and graphics from the document by operations other than that controlled by bit 10.
6 Add or modify text annotations, fill in interactive form fields, and, if bit 4 is also set, create or modify interactive form fields (including signature fields).
9 (Security handlers of revision 3 or greater) Fill in existing interactive form fields (including signature fields), even if bit 6 is clear.
10 (Security handlers of revision 3 or greater) Extract text and graphics (in support of accessibility to users with disabilities or for other purposes).
11 (Security handlers of revision 3 or greater) Assemble the document (insert, rotate, or delete pages and create bookmarks or thumbnail images), even if bit 4 is clear.
12 (Security handlers of revision 3 or greater) Print the document to a representation from which a faithful digital copy of the PDF content could be generated. When this bit is clear (and bit 3 is set), printing is limited to a low-level representation of the appearance, possibly of degraded quality.
(Section 7.6.3.2 "Standard Encryption Dictionary" in the PDF specification ISO 32000-1)
You can use the PdfWriter.ALLOW_* constants in this context.
Concerning the dialog screenshot you made, though, be aware that the operations effectively allowed do not only depend on the PDF document but also on the PDF viewer! Otherwise you might be caught in the same trap as the OP of this question.
Thanks to mkl for your answer, it was part of the story, but here is the answer which you helped me find:
using (var pdf = new PdfReader(File))
{
Console.WriteLine(PdfEncryptor.IsModifyAnnotationsAllowed(pdf.Permissions));
}
The PdfEncryptor is what was missing, it converts the P value into a simple bool for yes or no. Other methods on there are:
IsAssemblyAllowed
IsCopyAllowed
IsDegradedPrintingAllowed
IsFillInAllowed
IsModifyAnnotationsAllowed
IsModifyContentsAllowed
IsPrintingAllowed
IsScreenReadersAllowed
As for the security method part, this is what i went with:
using (var pdf = new PdfReader(File))
{
Console.WriteLine(!pdf.IsOpenedWithFullPermissions == Expected);
}

Categories