iTextSharp PDF Reader Accuracy - c#

Did any one of you guys have experience with the accuracy of iTextSharp when reading text from a multi-page scanned pdf?
the things is i have tried to read a pdf with both the basic search-function within the adobe reader, and also using the iTextSharp.
itextsharp manages to find roughly 50% of the occurrences of a given word compared to (what i call) 100% by adobe
[iTextSharp 1000 occ // Adobe Reader >2000]
is this a known "problem"?
edit: i should add: its already been ocr'ed by the time i'm searching.

As #ChrisHaas already explained, without code and PDF samples its hard to be specific.
First of all, saying itextsharp manages to find roughly 50% of the occurences of a given word is a bit misleading as iText(Sharp) does not directly expose methods to find a specific text in a PDF and, therefore, actually finds 0%. It merely provides a framework and some simple examples for text extraction.
Using this framework for seriously searching for a given word requires more than applying those simple sample usages (provided by the SimpleTextExtractionStrategy and the LocationTextExtractionStrategy, also working under the hood when using PdfTextExtractor.GetTextFromPage(myReader, pageNum)) in combination with some Contains(word) call. You have to:
create a better text extraction strategy which
has a better algorithm to recognize which glyphs belong to which line; e.g. the sample strategies can fail utterly for scanned pages with OCR'ed text with the text lines not 100% straight but instead minimally ascending;
recognizes poor man's bold (printing the same letter twice with a very small offset to achieve the impression of a bold character style) and similar constructs and transforms them accordingly;
create a text normalization which
resolves ligatures;
unifies alternative glyphs of the semantically same or similar characters;
normalize both the extracted text and your search term and only then search.
Furthermore, as #ChrisHaas mentioned, special attention has to be paid to spaces in the text.
If you create an iText-based text search with those criteria in mind, you'll surely get an acceptable hit rate. Getting as good as Adobe Reader is quite a task as they already have invested quite some resources into this feature.
For completeness sake, you should not only search the page content and everything referred to from there but also the annotations which can have quite some text content, too, which may even appear as if it was part of the page, e.g. in case of free text annotations.

Without knowing the specifics of your situation (PDF in question, code used, etc) we can't help you too much.
However I can tell you that iTextSharp has more of a literal text extractor. Since text in a PDF can be and often is non-contiguous and non-linear, iTextSharp takes any contiguous characters and builds what we think of as words and sentences. It then also tries to combine characters that appears to be "pretty much on the same line" and does the same (such as text on a slight angle as OCR'd text often is). There's also "spaces" which should be simple ASCII 32 characters but often isn't. iTextSharp goes the extra mile and attempts to calculate whether two text runs should be separated by spaces.
Adobe probably has further heuristics that are able to guess even more about text. My guess would be that they have a larger threshold for guessing at combining non-linear text.

Related

How to PDF Resize generated by rdlc and itextsharp

I have generated pdfs using rdlc and then combined multiple pdf files to a single document using iTextSharp pdfsmartcopy class. But my pdf size is large and I want to reduce the size of that pdf file. I have tried compressing it using iTextSharp but that's unable to compress it. When I upload the pdf file to ilivepdf.com online for compression ,then it compresses the 21MB file to 1MB.
Often, the problem is related to embedded fonts.
You see, PDF really strives to preserve your document exactly how you made.
To do that, a PDF library can decide to embed a font. You can imagine this as simply putting the font file into the PDF document.
But, here comes the tricky part.
The PDF specification took into account that this may be overkill.
I mean, if you are only using the 50-something characters typically used in Western languages, it makes little sense to embed the entire font.
So PDF supports a feature called "font subsetting". This means, instead of embedding the entire font, only those characters that are actually used are embedded in the document.
So what is going wrong exactly when you're merging these documents?
(I will skip a lot of the technical details.)
In order to differentiate between a fully embedded font, system font, or subset embedded font, iText generates a new font name for your fonts whenever it embeds them.
So a document containing a subset of Times New Roman might have "Times-AUHFDI" in its resources.
Similarly, a second document (again containing a subset of Times New Roman) might list "Times-VHUIEF" as one of its resources.
I believe it simply adds a random 6-character suffix. (ex-iText developer here)
PdfSmartCopy has to decide what to do with these resources. And sadly, it doesn't know whether these fonts are actually the same. So it decides to embed both these subsets into the new document.
This is a huge memory penalty.
If you have 100 documents, all using a subset of the same font, that subset will be embedded 100 times.
The other tool you listed might actually check whether these fonts are the same (and if they are, embed them only once). Or the other tool might simply not care that much and assume based on the partial name match that they are the same.
The ideal solution would of course be to compare the actual characters in the font, to see whether these two subsets can be merged.
But that would be much more difficult (and might potentially be a performance penalty).
What can you do?
There are 12 fonts that are never embedded. They are assumed to be present on every system (hence why they are never embedded.)
If you have control over the process that generates the PDF documents, you could simply decide to create them using only these fonts.
Alternatively you could write a smarter PdfSmartCopy. You would need to look into how fonts are built and stored, and perform the actual comparison I mentioned earlier.
Ask for technical support at iText. If enough people request this particular feature, you may get it.

Drawing Shapes on a PDF with C# WinForm

I have been searching for help with this for a long time. If it's been answered here already, I cannot find it. I'm using C# on a Windows Form.
I'm trying to create a simple program that allows me to open a PDF, flatten any layers within it, and then, for each click of the mouse, draw a circle.
Centered within each circle I need to have a number, beginning with "1", and chronologically increasing to infinity (could be 1, could be 15000).
Finally, I need to be able to save, and print the final result.
There are other things I need to add, but if someone can get me started with this, I should be able to figure out the rest on my own.
I've been able to import the .pdf. However, any tut I've found for creating a transparent layer on which to draw, never allows me to see the pdf behind. Do I even need this transparent layer, or can I draw directly on the pdf? My second biggest issue figuring out is how to create the circle, with the chronologically increasing number anywhere I choose to click my mouse.
Thanks in advance for any help.
Please see the image below for what it should look like.
You can do that with Atalasoft's DotImage/DotPdf/DotAnnotate packages (disclaimer: I used to work on these products up until 3 years ago). There are a number of ways to do this. If you don't care the markup being an annotation, you can make an annotation with a custom appearance and add each one to the document.
If you care that the numbers get added to the document, you can use DotPdf directly to append content to the content stream of any page.
How to do this on your own. Good luck (seriously - this is not an easy problem to solve).
Here's what you need to be able to do first (at a minimum) for putting new content into an existing page:
First, let's talk about the PDF rendering model:
PDF uses a little non-Turing complete RPN language to place content on the page. A given page has one or more streams of code which gets executed in order to render a page. If you want add content on top, you need to render it last (there are other more complicated ways to do this, but this is good enough). That means you either append to the existing content stream (wouldn't recommend), or you take advantage of the fact that a PDF page can have any number of content streams on it. You make a new stream and add in the code to render the content that you want.
I'll warn you ahead of time that rendering text is non-trivial, especially if you have to embed fonts or use unicode encoding. I'll also warn you that there is no "circle" primitive in PDF. You have to approximate it with Bezier curves.
I'm studious in my laziness, so I created abstractions to make it easier to correctly create content streams. For example, I make a class called a drawing surface and I could tell it "set the drawing style to this" "place a 'shape' here" "draw text here" and so on. When it was told to render, it would generate the PDF rendering program that matched. On top of this, I had another abstraction that consisted of a drawing list of higher level objects and the drawing list, when rendered would write to the drawing surface which would in turn write PDF.
Append changes (generations) to a PDF
Create Content streams
Append a replacement page object for an existing page with the Contents changed from a Stream to an array (or if it's an array already, append to it) and with new resources added to the Resource dictionary
Here's what you need to be able to do first (at a minimum) for putting annotations on an existing page:
Append changes (generations) to a PDF
Append a replacement page object for an existing page with the Annots added with new annotations (or modified by inserting/appending new annotations)
Create appearance streams
Create annotations with using the custom appearance streams
As far as how to do the UI, that's oddly straight forward as long as you have a PDF renderer. Render a page into an image and make a control that gives you a mouse click onto the page. Then build a transformation matrix that goes from image coordinates to PDF page coordinates and push the mouse coordinates through that matrix. The result will be the origin of your mark up on the page (be aware that some pages are rotated and you will need to adjust your transformation matrix to match).
Now, to be clear, when I wrote this library for Atalasoft, I already had several years of PDF experience (I worked on Acrobat v 1 - 4). While I wasn't working on the library full time, it was written over the span of 10 years. The code to append to an existing PDF took several months of time to get right. Eventually, I shed that code because of complications in appending anything but simple changes (like annotations) and wrote code that could rewrite an entire PDF with updates to existing content (page reordering, annotations, bookmarks, new content on a page, edited images, etc), while simultaneously shedding anything that is no longer needed. This is akin to adding in and clipping out sections of a directed graph with cycles and being able to ensure that you have a correct graph on the other end.
The hard part wasn't working within the specification - that's fairly straightforward for me. The problem was dealing with cockamamie PDF generated by other tools that had all kinds of bizarro spec violations and handling that correctly.
Now, I'm not saying don't do this. I'm all for people learning new things and learning about PDF. There's a lot there to learn and a lot of interesting ideas, but you need to be aware that simple sounding problems in PDF space aren't trivial unless you have a great deal of infrastructure in place. For example, "how many pages are in a PDF?" requires a PDF scanner and code to execute or parse PDF content so you can read in the cross reference table (which may be a compressed cross reference stream), the document dictionary (which may include encryption), and finally the page tree: all which can be easily derailed by non-compliant PDF.
If you're trying to balance time and cost, remember that your time is far from free and maybe a library to do the heavy lifting is not a bad thing. iTextPdf is open source and can do all of the things you need to do, but it will cost you in time to learn the library, but that's a huge savings over having to write PDF tools on your own. Atalasoft's code is not free, but was written to have a much shallower learning curve than most libraries.

Converting pdf to text

I need to create a C# or C++ (MFC) application that converts pdf files to txt. I need not only to convert, but remove headers, footers, some garbage characters on the left margin etc. Thus the application shold allow the user to set page margins to cut off what is not needed. I actually have already created such an application using xpdf, but it gives me some problems when I am trying to insert custom tags into the extracted text to preserve italics and bold. Maybe somebody could suggest something useful?
Thanks.
There are shareware and freeware utilities out there. Try fetching their source code, or perhaps use them the way they are.
A public version of the PDF specification can be found here: Adobe PDF Specification
PDF Shareware readers can be found: PDF Reader source code # SourceForge
Please look at Podofo. It's a LGPL-licensed library that has many powerful editing features. One of it's examples, txt2pdf IIRC, is a good start: it shows basic text-extraction; From there you can check if pre (in pdf engine) or post (in text) filtering suffices to your goals. I didn't get to use Pdf Hummus, but it's supposed to have these capabilities too, although it's less straightforward.

How to verify that pdf is text based using ITextSharp?

I need to verify that the pdf report is text based (and not bitmap based; however it could contain some images). I do not need to extract the text, just to verify that it is text based.
Is there a way how to perform such a verification using ITextSharp library?
Thanks in advance,
Stefan
You can look for text drawing commands easily enough. The least work on your part would be to try to extract the text and see if anything is there. Ideally you'd know some of the text it should contain and search for it. A single sentence or phrase would be plenty for this sort of testing.
Text extraction with iText is pretty trivial these days. Lots of examples floating around SO, and the web.

Is there a way to replace a text in a PDF file with itextsharp?

I'm using itextsharp to generate the PDFs, but I need to change some text dynamically.
I know that it's possible to change if there's any AcroField, but my PDF doen's have any of it. It just has some pure texts and I need to change some of them.
Does anyone know how to do it?
Actually, I have a blog post on how to do it! But like IanGilham said, it depends on whether you have control over the original PDF. The basic idea is you setup a form on the page and replace the form fields with the text you want. (You can style the form so it doesn't look like a form)
If you don't have control over the PDF, let me know how to do it!
Here is a link to the full post:
Using a template to programmatically create PDFs with C# and iTextSharp
I haven't used itextsharp, but I have been using PDFNet SDK to explore the content of a large pile of PDFs for localisation over the last few weeks.
I would say that what you require is absolutely achievable, but how difficult it is will depend entirely on how much control you have over the quality of the files. In my case, the files can be constructed from any combination of images, text in any random order, tables, forms, paths, single pixel graphics and scanned pages, some of which are composed from hundreds of smaller images. Let's just say we're having fun with it.
In the PDFTron way of doing things, you would have to implement a viewer (sample available), and add some code over a text selection. Given the complexities of the format, it may be necessary to implement a simple editor in a secondary dialog with the ability to expand the selection to the next line (or whatever other fundamental object is used to make up text). The string could then be edited and applied by copying the entire page of the document into a new page, replacing the selected elements with your new string. You would probably have to do some mathematics to get this to work well though, as just about everything in PDF is located on the page by means of an affine transform.
Good luck. I'm sure there are people on here with some experience of itextsharp and PDF in general.
This question comes up from time to time on the mailing list. The same answer is given time and time again - NO. See this thread for the official answer from the person who created iText.
This question should be a FAQ on the itextsharp tag wiki.

Categories