I am using iTextSharp 5.5.1 in order to sign PDF files digitally with a detached signature (obtained from a third party authority). Everything seems to work fine, the file is valid and e.g. Adobe Reader reports no problems, displays the signatures as valid etc.
The problem is that the Java Clients have apparently some problems with those files - the file can be neither opened nor parsed.
The files have a byte order mark in the header which seems to cause the behavior (\x00EF\x00BB\x00BF).
I could identify the BOM like this:
PdfReader reader = new PdfReader(path);
byte[] metadata = reader.Metadata;
// metadata[0], metadata[1], metadata[2] contain the BOM
How can I either remove the BOM (without losing the validity of the signature), or force the iTextSharp library not to append these bytes into the files?
First things first: once a PDF is signed, you shouldn't change any byte of that PDF, because you invalidate the signature if you do.
Second observation: the byte order mark is not part of the PDF header (a PDF always starts with %PDF-1.). In this context, it is the value of the begin attribute in the processing instruction of XMP metadata. I don't know of any Java client that has a problem with that byte sequence anywhere in a file. If they do have a problem with it, there's a problem with that client, not with the file.
The Byte Order Mark is an indication of the presence of UTF-8 characters. In the context of XMP, we have a stream inside the PDF that contains a clear text XML file that can be consumed by software that is not "PDF aware". For instance:
2 0 obj
<</Type/Metadata/Subtype/XML/Length 3492>>stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.1.0-jc003">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
dc:format="application/pdf"
pdf:Keywords="Metadata, iText, PDF"
pdf:Producer="iText® 5.5.4-SNAPSHOT ©2000-2014 iText Group NV (AGPL-version); modified using iText® 5.5.4-SNAPSHOT ©2000-2014 iText Group NV (AGPL-version)"
xmp:CreateDate="2014-11-07T16:36:55+01:00"
xmp:CreatorTool="My program using iText"
xmp:ModifyDate="2014-11-07T16:36:56+01:00"
xmp:MetadataDate="2014-11-07T16:36:56+01:00">
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default">This example shows how to add metadata</rdf:li>
</rdf:Alt>
</dc:description>
<dc:creator>
<rdf:Seq>
<rdf:li>Bruno Lowagie</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:subject>
<rdf:Bag>
<rdf:li>Metadata</rdf:li>
<rdf:li>iText</rdf:li>
<rdf:li>PDF</rdf:li>
</rdf:Bag>
</dc:subject>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">Hello World example</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
endstream
Such non-PDF-aware software will look for the sequence W5M0MpCehiHzreSzNTczkc9d, which is a sequence that is unlikely to appear by accident in a data stream.
The begin attribute is there to indicate that the characters in the stream use UTF-8 encoding. They are there because it is good practice for them to be there, but they are not mandatory (ISO-16684-1).
You could retrieve the metadata the way you do (byte[] metadata = reader.Metadata;), remove the bytes, and change the stream with a PdfStamper instance like this:
stamper.XmpMetadata = metadata;
After you have changed the metadata, you can sign the PDF.
Note that one aspect of your question surprises me. You write:
// metadata[0], metadata[1], metadata[2] contain the BOM
It is very strange that the first three bytes of the XMP metadata contain the BOM. XMP metadata is suppose to start with <?xpacket. If it doesn't, you are doing the right thing by removing those bytes.
Caveat: a PDF can contain XMP metadata at different levels. Right now, you are examining the most common one: document-level metadata. You may encounter PDFs with page-level XMP metadata, with XMP inside an image, etc...
Just a quick approach:
First: save both files un-encrypted.
Second: remove metadata 0 through 2 before saving the file
There are some considerations however: does the signing method require a BOM? Does the encryption method require a BOM?
You will also have to ascertain at what stage the BOM is added before you can determine whether you can/should remove the BOM.
I will have a quick hunt about for my pdf structure docs and see what I can get, however the simplest way would be (untried) load the whole thing as a byte array and simply remove xEF xBB xBF from the start of the file, then do any signing/encryption. However they may add it in again...
I will post an update over the weekend:)
Related
We are developing a Pdf parser to be used along with our system.
The requirement is such that, we store all the information on any pdf documents and should be able to reproduce the document as such (with minimal changes from original document).
We did some googling and found iTextSharp be the best mate for our purpose.
We are developing our project using .net.
You might have guessed as i mentioned in my title requiring comparisons for specific versions of iTextSharp (4.1.6 vs 5.x). We know that 4.1.6 is the last version of iTextSharp with the LGPL/MPL license . The 5.x versions are AGPL.
We would like to have a good comparison between the versions before choosing the LGPL version or we buy the license for AGPL (we dont like to publish our code).
I did some browsing through the revision changes in the iTextSharp but i would like to know if any content exist, making a good comparison between the versions.
Thanks in advance!
I'm the CTO of iText Software, so just like Michaël who already answered in the comment section, I'm at the same time the most authoritative source as well as a biased source.
There's a very simple comparison chart on the iText web site.
This chart doesn't cover text extraction, so allow me to list the relevant improvements since iText 5.
You've probably also found this page.
In case you wonder about the bug fixes and the performance improvements regarding text parsing, this is a more exhaustive list:
5.0.0: Text extraction: major overhaul to perform calculations in user space. This allows the parser to correctly determine line breaks, even if the text or page is rotated.
5.0.1: Refactored callback so method signature won't need to change as render callback API evolves.
5.0.1: Refactoring to make it easier for outside users to interact with the content stream processor. Also refactored render listener so text and image event listening occurs in the same interface (reduces a lot of non-value-add complexity)
5.0.1: New filtering functionality for text renderers.
5.0.1: Additional utility method for previewing PDF content.
5.0.1: Added a much more advanced text renderer listener that can reconstruct page content based on physical location of text on the page
5.0.1: Added support for XObject Form processing (text added via PdfTemplate can now be parsed)
5.0.1: Added rudimentary support for XObject Image callbacks
5.0.1: Bug fix - text extraction wasn't correct for certain page orientations
5.0.1: Bug fix - matrices were being concatenated in the wrong order.
5.0.1: PdfTextExtractor: changed the default render listener (new location aware strategy)
5.0.1: Getters for GraphicsState
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.2: CMapAwareDocumentFont: Tweaks to make processing quasi-invalid PDF files more robust
5.0.2: PdfContentReaderTool: null pointer handling, plus a few well placed flush calls
5.0.2: PdfContentReaderTool: Show details on resource entries
5.0.2: PdfContentStreamProcessor: Adjustment so embedded images don't cause parsing problems and improvements to EI detection
5.0.2: LocationTextExtractionStrategy: Fixed anti-parallel algorithm, plus accounting for negative inter-character offsets. Change to text extraction strategy that builds out the text model first, then computes concatenation requirements.
5.0.2: Adjustments to linesegment implementation; optimalization of changes made by Bruno to text extraction; for example: introduction of the class MarkedContentInfo.
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.3: added method to get area of image in user units
5.0.3: better parsing of inline images
5.0.3: Adding an extra check for begin/end sequences when parsing a ToUnicode stream.
5.0.4: Content streams in arrays should be parsed as if they were separated by whitespace
5.0.4: Expose CTM
5.0.4: Refactor to pull inline image processing into it's own class. Added parsing of image data if there is no filter applied (there are some PDFs where there is no white space between the end of the image data and the EI operator). Ultimately, it will be best to actually parse the image data, but this will require a pretty big refactoring of the iText decoders (to work from streams instead of byte[] of known lengths).
5.0.4: Handle multi-stage filters; Correct bug that pulled whitespace as first byte of inline image stream.
5.0.4: Applying stream filters to inline images.
5.0.4: PdfReader: Expose filter decoder for arbitrary byte arrays (instead of only streams)
5.0.6: CMapParser: Fix to read broken ToUnicode cmaps.
5.0.6: handle slightly malformed embedded images
5.0.6: CMapAwareDocumentFont: Some PDFs have a diff map bigger than 256 characters.
5.0.6: performance: Cache the fonts used in text extraction
5.1.2: PRTokeniser: Made the algorithm to find startxref more memory efficient.
5.1.2: RandomAccessFileOrArray: Improved handling for huge files that can't be mapped
5.1.2: CMapAwareDocumentFont: fix NPE if mapping doesn't get initialized (I'd rather wind up with junk characters than throw an unexpected exception down the road)
5.1.3: refactoring of how filters are applied to streams, adjust parser so it can handle multi-stage filters
5.1.3: images: allow correct decoding of 1bpc bitmask images
5.1.3: images: add jbig2 streams to pass through
5.1.3: images: handle null and indirect references in decode parameters, throw exception if unable to decode an image
5.2.0: Better error messages and better handling zero sized files and attempts to read past the end of the file.
5.2.0: Removed restriction that using memory mapping requires the file be smaller than ~2GB.
5.2.0: Avoid NullPointerException in RandomAccessFileOrArray
5.2.0: Made a utility method in pdfContentStreamProcessor private and clarified the stateful nature of the class
5.2.0: LocationTextExtractionStrategy: bounds checking on string lengths and refactoring to make code easier to read.
5.2.0: Better handling of color space dictionaries in images.
5.2.0: improve handling of quasi improper inline image content.
5.2.0: don't decode inline image streams until we absolutely need them.
5.2.0: avoid NullPointerException of resource dictionary isn't provided.
5.3.0: LocationTextExtractionStrategy: old comparison approach caused runtime exceptions in Java 7
5.3.3: incorporate the text-rise parameter
5.3.3: expose glyph-by-glyph information
5.3.3: Bugfix: text to user space transformation was being applied multiple times for sub-textrenderinfo objects
5.3.3: Bugfix: Correct baseline calculation so it doesn't include final character spacing
5.3.4: Added low-level filtering hook to LocationTextExtractionStrategy.
5.3.5: Fixed bug in PRTokeniser: handle case where number is at end of stream.
5.3.5: Replaced StringBuffer with StringBuilder in PRTokeniser for performance reasons.
5.4.2: Added an isChunkAtWordBoundary() method to LocationTextExtractionStrategy to check if a space character should be inserted between a previous chunk and the current one.
5.4.2: Added a getCharSpaceWidth() method to LocationTextExtractionStrategy to get the width of a space character.
5.4.2: Added a getText() method to LocationTextExtractionStrategy to get the text of the current Chunk.
5.4.2: Added an appendTextChunk(() method to SimpleTextExtractionStrategy to expose the append process so that subclasses can add text from outside the text parse operation.
5.4.5: Added MultiFilteredRenderListener class for PDF parser.
5.4.5: Added GlyphRenderListener and GlyphTextRenderListener classes for processing each glyph rather than processing chunks of text.
5.4.5: Added method getMcid() in TextRenderInfo.
5.4.5: fixed resource leak when many inline images were in content stream
5.5.0: CMapAwareDocumentFont: if font space width isn't defined, use the default width for the font.
5.5.0: PdfContentReader: avoid exception when displaying an empty dictionary.
There are some things that you won't be able to do if you don't upgrade. For instance, you won't be able to do the things described in these slides.
If you look at the roadmap for iText, you'll see that we'll invest even more time on text extraction in the future.
In all honesty: using the 5 year old version wouldn't only be like reinventing the wheel, it would also be like falling in every pitfall we've fallen in in the last 5 years. I can assure you that buying a license will be less expensive.
I need to convert image files to PDF without using third party libraries in C#. The images can be in any format like (.jpg, .png, .jpeg, .tiff).
I am successfully able to do this with the help of itextsharp; here is the code.
string value = string.Empty;//value contains the data from a json file
List<string> sampleData;
public void convertdata()
{
//sampleData = Newtonsoft.Json.JsonConvert.DeserializeObject<List<string>>(value);
var jsonD = System.IO.File.ReadAllLines(#"json.txt");
sampleData = Newtonsoft.Json.JsonConvert.DeserializeObject<List<string>>(jsonD[0]);
Document document = new Document();
using (var stream = new FileStream("test111.pdf", FileMode.Create, FileAccess.Write, FileShare.None))
{
PdfWriter.GetInstance(document, stream);
document.Open();
foreach (var item in sampleData)
{
newdata = Convert.FromBase64String(item);
var image = iTextSharp.text.Image.GetInstance(newdata);
document.Add(image);
Console.WriteLine("Conversion done check folder");
}
document.Close();
}
But now I need to perform the same without using third party library.
I have searched the internet but I am unable to get something that can suggest a proper answer. All I am getting is to use it with either "itextsharp" or "PdfSharp" or with the "GhostScriptApi".
Would someone suggest a possible solution?
This is doable but not practical in the sense that it would very likely take way too much time for you to implement. The general procedure is:
Open the image file format
Either copy the encoded bytes verbatim to a stream in a PDF document you have created or decode the image data and re-encode it in a PDF stream (whether it's the former or latter depends on the image format)
Save the PDF
This looks easy (it's only three points after all :-)) but when you start to investigate you'll see that it's very complicated.
First of all you need to understand enough of the PDF specification to write a new PDF file from scratch, doing all of the right things. The PDF specification is way over 1000 pages by now; you don't need all of it but you need to support a good portion of it to write a proper PDF document.
Secondly you will need to understand every image file format you want to support. That by itself is not trivial (the TIFF file format for example is so broad that it's a nightmare to support a reasonable fraction of TIFF files out there). In some cases you'll be able to simply copy the bulk of an image file format into your PDF document (jpeg files fall in that category for example), that's a complication you want to support because uncompressing the JPEG file and then recompressing it in a PDF stream will cause quality loss.
So... possible? Yes. Plausible? No. Unless you have gotten lots and lots of time to complete this project.
The structure of the simpliest PDF document with one single page and one single image is the following:
- pdf header
- pdf document catalog
- pages info
- image
- image header
- image data
- page
- reference to image
- list of references to objects inside pdf document
Check this Python code that is doing the following steps to convert image to PDF:
Writes PDF header;
Checks image data to find which filter to use. You should better select just one format like FlateDecode codec (used by PDF to compress images without loss);
Writes "catalog" object which is basically is the array of references to page objects.
Writes image object header;
Writes image data (pixels by pixels, converted to the given codec format) as the "stream" object in pdf;
Writes "page" object which contains "image" object;
Writes "trailer" section with the set of references to objects inside PDF and their starting offsets. PDF format stores references of objects at the end of PDF document.
I would write my own ASP.NET Web Service or Web API service and call it within the app :)
An unorthodox question for sure. :-) During development testing, I'd prefer to create uncompressed, non-binary PDF files with iTextSharp so that I can check their internals easily. I can always convert the finished PDF with various utilities but that takes an extra step that would be comfortable to avoid. There are some places (eg. PdfImage) where I can decide about compression level but I couldn't find anything about the compression used to output the individual PDF objects into the stream. Do you think this is possible with iTextSharp?
You have two options:
[1.] Disable compression for all your documents at once:
Please take a look at the Document class. You'll find:
public static bool Compress = true;
If you change this static bool to false, PDF syntax streams won't be compressed.
Of course: images will are de facto compressed. For instance: if you add a JPEG to your document, you are adding an image that is already compressed, and iText won't uncompress it.
[2.] Disable compression for a single document at a time:
Please take a look at the PdfWriter class. You'll find:
protected internal int compressionLevel = PdfStream.DEFAULT_COMPRESSION;
If you change the value of compressionLevel to PdfStream.NO_COMPRESSION before opening the ´document´, your PDF syntax streams won't be compressed.
While extracting text from PDF file using iTextSharp using the below piece of code, I am getting this error: “Could not find image data or EI” while debugging the code found that this error is coming in certain pages but not all pages, then further investigated and also found that generally there are two types image in pdf xObject image and Inline Image and using the below piece of code Inline Image ca not be handled. There are few few comments in this issue in other similar post that suggested to use latest version(5.5.0) itextsharp, that also i did but no luck. My basic purpose is to extract the text in the page not image. How can I handle the Inline image or how can I extract only the text regardless what type of image the page having.
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
PdfContentByte pdfData = pdfStamper.GetUnderContent(page);
LocTextExtractionStrategy its = new LocTextExtractionStrategy();
pdfData = pdfStamper.GetUnderContent(page);
string extractedTextInCurrentPage=PdfTextExtractor.GetTextFromPage(pdfReader, page, its);//In this line exception is throwing
}
Please share your PDF.
This is why:
Your PDF contains an inline image. Inline images are problematic in ISO-32000-1, but I personally saw to it that the problem will be solved in ISO-32000-2 (for PDF 2.0, to be expected in 2017).
In ISO-32000-1, an inline images starts with the BI operator, followed by some parameters. The length of the image bytes isn't one of those parameters. The actual image bytes are enclosed by an ID and an EI operator.
Software parsing PDF syntax needs to search for these operators and usually does a good job at it: find BI, then take the bytes between ID and EI. However: what to do when you encounter an image of which EI is part of the image bytes?
This hardly ever happens, but it was reported to us as a problem and we solved this in recent iText versions by converting the bytes between ID and EI to an image. If that fails, iText continues searching for the next EI. If iText doesn't find that EI parameter, you get the exception you mention.
This is a cumbersome process and, being a member of the ISO committee that writes the PDF standards, I introduced a new inline image parameter into the spec: the parameter /L will informs parsers exactly how many bytes are to be expected between the ID and EI operators. At the same time, I saw to it that the recommendation of keeping inline images smaller than 4 KB became normative: in PDF 2.0, it will be illegal to have inline images with more than 4096 bytes. Of course: this doesn't help you. PDF 2.0 doesn't exist yet. My work in the ISO committee only helps to solve the problem on the long term.
On the short term, we've written a work-around that solves the problem for the PDFs that were reported to us, but apparently, you've found a PDF that escapes the workaround. If you want us to solve the problem, you'll have to share the PDF.
I have a Win32 application that reads PDFs using iTextSharp which inserts an image into the document as a seal.
It works fine with 99% of the files we are processing over a year, but these days some files just don't read.
When I execute the code below:
string inputfile = "C:\test.pdf";
PdfReader reader = new PdfReader(inputfile);
It gives the exception:
System.NullReferenceException occurred
Message="Object reference not set to an instance of an object."
Source="itextsharp"
StackTrace:
em iTextSharp.text.pdf.PdfReader.ReadPages()
em iTextSharp.text.pdf.PdfReader.ReadPdf()
em iTextSharp.text.pdf.PdfReader..ctor(String filename, Byte[] ownerPassword)
em iTextSharp.text.pdf.PdfReader..ctor(String filename)
em MyApp.insertSeal() na C:\MyApp\Stamper.cs:linha 659
The pdf files that throw these exception can be normally read by adobe pdf and when I open one of these files with Acrobat and save it I can read this saved file with my application.
Are the files corrupted but still can be opened with Adobe Reader?
I am sharing with you two samples of files.
A file that NOT work : Not-Ok-Version.pdf
And a file that works, after a opened and saved it with Acrobat. Download it here OK-Version.pdf
Here's the (java, sorry) source for readPages:
protected internal void ReadPages() {
catalog = trailer.GetAsDict(PdfName.ROOT);
rootPages = catalog.GetAsDict(PdfName.PAGES);
pageRefs = new PageRefs(this);
}
trailer,catalog,rootPages, andpageRefs` are all member variables of PdfReader.
If the trailer or root/catalog object of a PDF are simply missing, your PDF is REALLY BADLY BROKEN. It's more likely that the xref table is a bit off, and the objects in question simply aren't exactly where they're supposed to be (which is Bad, but recoverable).
HOWEVER, when PdfReader first opens a PDF, it parses ALL the objects in the file, and converts them to the appropriate PdfObject-derived classes.
What it isn't doing is checking to see that the object number claimed by the xref table and the object number read in from the file Actually Match. Highly Unlikely, but possible. Bad software could write out their PDF objects in the wrong order but keep the byte offsets in the xref table correct. Software that overrode the object number from the xref table with the number from that particular byte offset in the file would be fine.
iText is not fine.
I still want to see the PDF.
Yep. That PDF is broken alright. Specifically:
The file's first 70kb or so define a pretty clean little PDF. Changes were then appended to the PDF.
Check that. Someone attempted to append changes to the PDF and failed. Badly. To understand just how badly, let me explain some of the internal syntax of a PDF, illustrated with this example:
%%PDF1.6
1 0 obj
<</Type/SomeObject ...>>
endobj
2 0 obj
<</Type/SomeOtherObj /Ref 1 0 R>>
endobj
3 0 obj
...
endobj
<etc>
xref
0 10
0000000000 65535 f
0000000010 00001 n
0000000049 00002 n
0000000098 00003 n
...
trailer
<</Root 4 0 R /Size 10>>
startxref 124
%%EOF
So we have a header/version "%%PDF1.v", a list of objects (the ones here are called dictionaries), a cross (x) reference table listing the byte offsets and object numbers of all the objects in the list, and a trailer giving the root object & the number of objects in the PDF, and the byte offset to the 'x' in 'xref'.
You can append changes to an existing PDF. To do so you just add any new or changed objects after the existing %%EOF, a cross reference table to those new objects, and a trailer. The trailer of an appended change should include a /Prev key with the byte offset to the previous cross reference table.
In your NOT-OKAY pdf, someone tried to append changes to a PDF, AND FAILED HORRIBLY.
The original PDF is still there, intact. That's what Reader shows you, and what you get when you save the PDF. I hacked off everything after the first %%EOF in a hex editor, and the file was fine.
So here's the layout of your NOT-OKAY pdf:
%PDF1.4.1
1 0 obj...
2 through 7
xref
0 7
<healthy xref>
trailer <</Size 8 /Root 6 0 R /Info 7 0 R>>
startxref 68308
%%EOF
So far so good. Here's where things get ugly
<binary garbage>
endstream
endobj
xref
0 7
<horribly wrong xref>
trailer <</ID [...] /Info 1 0 R /Root 2 0 R /Size 7>>
startxref 223022
%%EOF
The only thing RIGHT about that section is the startxref value.
Problems:
The second trailer has no /Prev key.
ALL the byte offsets in the second xref table are wrong.
The is part of a "stream" object, but the beginning of that object IS MISSING. Streams should look something like this
1 0 obj
<</Type/SomeType/Length 123>>
stream
123 bytes of data
endstream
endobj
The end of this file is made up of some portion of a (compressed I'd imagine) stream... but without the dictionary at the beginning telling us what filters its using and how long it is (to say nothing of any missing data), you can't do anything with it.
I suspect that someone tried to completely rebuild this PDF, then accidentally wrote the original 70kb over the beginning of their version. Kaboom.
It would appear that Adobe is simply ignoring the bad appended changes. iText could do this too, but so can you:
When iText fails to open a PDF:
1. Search backwards through the file looking for the second to last %%EOF. Ignore the one at the very end, we want the previous state of the file.
2. Delete everything after the 2nd-to-last %%EOF (if any), and try to open it again.
The sad thing is that this broken PDF could have been completely different from the "original" 70kb, and then some IO error overwrote the first part of the file. Unlikely, but there's no way to be sure.
Considering that they are now up to version 5.0, my guess would be that you are seeing increasing numbers of PDFs written to PDF version specs that your version of iTextSharp does not support. It may be time to do an upgrade.
Maybe this will help someone...
I had code that worked for years that started hanging on reading the bookmarks from a PDF file (outlines variable below). It turned out that it broke when the code was updated from .NET 4.0 to .NET 4.5.
As soon as I rolled it back to .NET 4.0, it worked again.
RandomAccessFileOrArray raf = null;
PdfReader reader1 = null;
System.Collections.ArrayList outlines = null;
raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sFile);
reader1 = new iTextSharp.text.pdf.PdfReader(raf, null);
outlines = iTextSharp.text.pdf.SimpleBookmark.GetBookmark(reader1);
Just for notes, the same VS web application project uses AjaxControlToolkit (from NuGet). Before I rolled it back, I also updated iTextSharp to ver 5.5.5 and it still hung on the same line.
When I pull down the source and run it against the bad PDF there's an exception in ReadPdf() in the 4th try block when it calls ReadDocObj():
"Invalid object number. at file pointer 16"
tokens.StringValue is j
#Mark Storer, you're the iText guy so maybe that means something to you.
From a higher level, at least to my eyes, it seems that when RebuildXref() is called (which I assume is when an invalid PDF is read) it rebuilds trailer but not catalog. The latter is what the NRE is complaining about. Then again, that's just a guess.
Also make sure your html doesn't contains hr tag while converting html to pdf
hdnEditorText.Value.Replace("\"", "'").Replace("<hr />", "").Replace("<hr/>", "")