Howto: Improve the PDF- quality before OCR using C#

Howto: Improve the PDF- quality before OCR using C# - c#

I'm creating a service that monitors a folder for scanned files. Once the file is there, The service picks it up, and convert it to a readable PDF. In this process the service also searches for a barcode. After this, the text is extracted and the file, with its text is stored into the database of our software. The location is based on the barcode.
Now, for the OCR we are using the SDK of Atalasoft (http://www.atalasoft.com/).
Also the Barcode recognizer is included in this SDK.
But the converted text still has some mistakes. (I ran some tests with other OCR-programs, but Atalasoft came out nice.)
I'm looking for some software (SDK-kit) which allows me to improve the quality of the PDF for OCR purposes.
I tested Kofax VRS Elite (http://www.kofax.com/vrs-virtualrescan/). I'm looking for something similar, but that can be implemented in the service using some kind of SDK-kit.
Anyone who did this before, or had similar problems?
thx in advance!

You may try and follow a different path altogether:
See if you can configure the scanner(s) to scan directly to PDF and do the OCR on the fly. The Lexmark scanners can do this. This creates PDF's with selectable and searchable text. This in turn can be extracted with a PDF reading library.
Alternatively you may want to have a look at http://www.abbyy.com/ and see if you get better results.
If these are not good options, you may want to break down your problem in a systematic way:
1. Is the image quality of the scanned images the problem? If so, then this will have to be fixed first. Your OCR solution may be affected by resolution, contrast, and colour.
2. Is it the OCR software? Take a highly legible document and see if the OCR software makes mistakes. If so, then you know you have to find better OCR software.
3. If your document quality is decent and your OCR software has a high success rate in deciphering a legible document, then you may want to look at the exceptions that do not work, and tackle these on a case by case basis.
If smears and background images on documents is the cause of the problem, you may want to look into ways of avoiding this, or cleaning this with image processing software that exposes an API.

Related

Convert High Resolution PDF to Low Resolution PDF file C#

I´m looking for a way to convert a High Resolution PDF file to a Low Resolution PDF file from an ASP.NET applicaitn (C#).
Users will import High Resolution PDF's and the solution should then have the possibility to provide both High Resolution PDF and Low Resolution PDF.
I´m looking for a API to do that. I have found a lot of PDF apis but none of them seems to do what I´m looking for.

ABCpdf .NET will do this for you. There are a variety of functions for resizing, resampling or recompressing the images within a PDF document. However given your requirements you probably just want to use the document reduce size operation.
To do this you just need code of the following form:
Doc doc = new Doc();
doc.Read(Server.MapPath("../mypics/sample.pdf"));
using (ReduceSizeOperation op = new ReduceSizeOperation(doc)) {
op.UnembedSimpleFonts = false; // though of course making these true...
op.UnembedComplexFonts = false; // ... would further reduce file size.
op.MonochromeImageDpi = 72;
op.GrayImageDpi = 72;
op.ColorImageDpi = 144;
op.Compact(true);
}
doc.Save(Server.MapPath("ReduceSizeOperation.pdf"));
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)

There are a number of possible approaches to this problem, one of which would simply be to export from InDesign twice (which would allow you to make the two required versions of PDF). If that is not feasible (which it might not be as exporting from InDesign can take a bit of time) there are definitely libraries on the market that can perform what you want to do. Here to you have a number of approaches:
1) While this will get me shot by most Adobe employees, you could re-distill your PDF file into a smaller file. I would not advocate to do this, I'm mentioning it to be complete. Risks involved would be introducing quality loss, artefacts and so on (mostly because PostScript doesn't support a number of features PDF does support and by redistilling you'd loose those features).
2) Use an application or library that was made for this task. callas pdfToolbox for example (warning, I'm affiliated with this company!) is capable of taking a PDF file and running corrections on it. Amongst those corrections are image downsampling, conversion to RGB, image re-compression (for example with JPEG-2000), throwing away unnecessary data in the PDF and much more. Given that the application has a command-line interface, it can easily be integrated into a c# process. Other applications with similar functionality would be from companies such as Enfocus, Apago and others.
3) Use a low-level PDF library (such as the Adobe PDF library that can be licensed from Adobe through DataLogics) and implement the necessary changes yourself. More work, but potentially more flexible.
Whatever approach you choose, know that you are starting from a high-quality PDF file and that your process should try to retain as much of that quality as possible (dependant on which application you have for the low resolution PDF file of course). Make sure you don't get into trouble by loosing proper overprint, transparency etc...

Probably you can just resize images in high resolution version PDF and this will give you much smaller files.
You can resize and/or recompress images using Docotic.Pdf library. For more details please take a look at my answer to a similar question.
Disclaimer: I am one of the developers of Docotic.Pdf library.

OCR engine to capture characters from images

i'm using c# tessnet2 wrapper for Tesseract OCR engine to capture chracters of image files. i been searching everywhere if tessnet2 has any build in functions to overwrite certain characters and saved them into the same image file it's reading but have not found anything in regards to that. so what i'm thinking of doing is creating a new imagine file base on what i'm receiving from tessnet2 but i need to create the new image the same exact way but change just few things in the new created image. i'm not sure if i'm using the correct methology or if there is other c# assemblies out there that allow you to read characters from image file and at the same time allow you to manipulate as you need them.

Good luck--but tess has no way of replacing in the proper font. Raster graphics don't generally store glyph information. Even if it did, you would potentially be in violation of licenses and/or copyrights surrounding the fonts you'd be writing in. I'm not an expert in OCR, but I will confidently say that this is something not readily available out there in the wild.

To expand on Brian's answer:
You will need to do this yourself. I have not worked with Tesseract, but I have used the Nuance OCR engine. It will return you font information as well as coordinates for the character it has recognized (note that you will most likely have to compute the actual image coordinate as the OCR engine will have deskewed the image before performing the recognition). Once you get the coordinates and the deskew so that you can compute the actual coordinate, you can then use any image manipulation library (Leadtools, Accusoft, etc) or just straight GDI+ functions to clear the character, then using the font info and size info create a new character and merge it into the image. This is not trivial but certainly doable.
Edit:
It was late when I wrote the initial answer, wanted to clarify what is meant by font information. The OCR engine will give you information regarding the point size, whether its bold/italicized and the font family (Seriph, etc). I do not know of one that will tell you the exact font that the document is in. If you have a sample of the documents that you will process, then you can make a good guess based on the info the OCR engine gives you.

C# solution for rendering PDFs and OCRing the resulting images? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm looking for is a C# solution to import data from PDF documents into our database, in a commercial application. Our customers will be looking to import any arbitrary document. Ordinarily I'd write this off as a complete impossibility, but the documents they're importing will be in their own set layout.
My plan is to have the PDFs rendered to static images, then allow the users to set up their own templates, which essentially pull out text at predefined pixel-offsets in the PDF, using OCR. For tables, they define a location of the table and a bunch of further values for column and row sizes. We can then apply the template onto that document type.
So, what I'm really looking for is two libraries: one to convert PDFs to images, another to OCR those images.
Requirements:
Is pure-C# or has a supported C# wrapper onto a native DLL.
Doesn't fork out processes - wrappers that essentially just create command line parameters and launch an external executable aren't allowed in this case.
In the case of FOSS, allows us to exempt ourselves from normal FOSS license requirements (i.e. publishing our sourcecode) by paying a license fee.
We certainly don't mind paying for a commercial solution, but we'd rather not get stuck with paying a fee per individual distribution of the software.
I know this is quite a specific requirement set - perhaps enough for some people to deem this question too localised, but I'm hoping that someone can suggest an approach and some libraries that can be helpful to me, as well as others in the future.
Stuff I've looked into for the PDF side:
iTextSharp - Documentation is a book you have to buy, not a good start. Doesn't seem to be much useful documentation regarding turning PDFs into images in the public domain. Licensing is opaque, looks like we have to pay per client we distribute to.
Docotic.Pdf - Text only, no use to us.
pdftohtml - Again, doesn't produce images. Would be a mess to port to C# too.
PdfFileParser - Still not what we need.
GhostScript - Pretty much exactly what we want, but requires forking out to a program.
For the OCR side, I'll probably end up using Tesseract, since the Apache license is permissive and it's got good reviews. If there's an alternative, I'd be interested in that too.

I would like to recommend Amyuni PDF Creator .Net for this task.
1st Scenario:
If your PDF files are well defined (no missing font information etc) you could directly extract the text from the PDF by specifying a rectangular region in the method GetObjectsInRectangle. You should also use the option acGetRectObjectsOptimize:
Optimize text objects before returning them. That is, combine text
objects that are close to each other into a single text object.
2nd Scenario:
If there are images involved that also contain text, rendering the whole page into an image and then applying OCR might be a better choice. You can do this with Amyuni PDF Creator .Net by using the methods ExportToTiff, ExportToJPeg, or RasterizePageRange.
From the documentation:
IacDocument.RasterizePageRange Method The RasterizePageRange method converts page contents into a color or grey scale image. When
archiving documents or performing OCR, it is sometimes preferable for
all pages to be stored as images rather than complex text and graphic
operations.
Then you can use our OCR add-in that integrates with Tesseract OCR and finally we fall again into the 1st Scenario (GetObjectsInRectangle). In order to apply OCR to your files you can use the method OCRPageRange.
void OCRPageRange(int startPage, int EndPage, string Language,
acOCROptions Options)
About licensing, Amyuni PDF Creator .Net provides a (per application) royalty free license.
Usual disclaimer applies

I think you might want to give Docotic.Pdf another chance.
The library can extract text chunks, words and even individual characters with their bounding rectangles. Please have a look at the sample for extraction of words from PDFs.
Also, Docotic.Pdf can create images from PDFs and draw pages on a System.Drawing.Graphics. Please have a look at Draw and print Pdf group of samples.
Disclaimer: I am one of developers of the library.

.NET component for color PDF to grayscale conversion

Currently i use Ghostscript to convert color PDF's to grayscale PDF's. Now i'm looking for reliable .NET commercial or not commercial component/library for ghostscript replacement. I googled and I did not find any component/library that is able to do that easily or to do that at all.
EDIT #1:
Why Ghostscript does not work for me:
I implemented Ghostscript and I'm using it's native API's. The problem is that Ghostscript does not support multiple instances of the interpreter within a single process. -dJOBSERVER mode also does not work for me because i don't collect all job and them process them all at once. It happens that Ghostscript is processing large job which takes around 20 minutes and meanwhile i get some smaller job which has to be processed ASAP and cannot wait 20 minutes. Other problem is that Ghostscript page processed events are not easily to catch. I wrote a parser for ghostscript stdout messages and i can read out processed page number but not for each page when it's processed as ghostscript pushes message for group of processed pages. There are couple of more problems with Ghostscript like producing bad pdf's, duplicating font problems.....
You can find one more problem i had with ghostscript here: Ghostscript - PS to PDF - Inverted images problem
-
a year after UPDATE:
Before a year a go i asked this question. Later i made my own solution by using iTextSharp.
You can take a look at the converting PDF to grayscale solution here:
http://habjan.blogspot.com/2013/09/proof-of-concept-converting-pdf-files.html
or
https://itextsharpextended.codeplex.com/
Works for me in most cases :)

Not quite an answer, but I think you dismiss Ghostscript too quickly.
Are you aware of the GhostScript API (for in-process Ghostscript)? Or of the -dJOBSERVER mode that can take a series of PS commands piped to its standard in?
That still won't get you your callbacks however, and it's still not multi-threaded.
As previously stated, iText could do it, but it would be a matter of walking through all the content and images looking for non-grayscale color spaces and converting them in a space-specific manner.
You'd also have to replace the pixel data in any images you might find.
The good news is that iText[Sharp] is capable of operating in multiple threads, provided each document is used from one thread at a time.
I suspect this is also the case for the suggested commercial library, which isn't such a good deal.
And then a light went on above my head... drawn in gray scale.
Blending modes and transparency groups!
Take all the current page content and stick it in a transparency group that is blended with a solid black rectangle that covers the page. I think there's even a luminosity to alpha blend mode... lets see here.
Yep, PDF reference section 11.6.5.2 "Soft Mask Dictionaries". You'll want a "luminosity" group.
Now, the bad news. If your goal in switching to gray scale is to save space, this will fail utterly. It'll actually make each file a little larger... say a 100 bytes per page, give or take.
The software rendering the PDF better be pretty hot stuff too. Your cousin's undergrad rendering project need not apply. This is advanced graphics stuff here, infrequently used by Common PDF Files, so the last sort of thing to be implemented.
So... For each original page
Create a new page.
Cover it with a black background.
Cover it with a white rectangle (had it backwards earlier) in a transparency group that uses a soft mask dictionary set to be the luminosity of the original page's content (now stashed in an XObject Form).
Because this is all your own code, you'll have ample opportunity to do whatever it is you want to do at the beginning or end of each page.
By golly, that's just crazy enough to work! It does require some PDF-Fu, but not nearly as much as the "convert each color space and image in various ways as I step through the document". Deeper knowledge, less code to write.

This isn't a .net library, but rather a potential work-around. You could install a virtual printer that is capable of writing PDF files. I would suggest CutePDF, as it's free, easy to use and does a great job 'printing' a large number of file formats to PDF. You can do nearly everything with CutePDF that you can do with a normal printer, including printing to grayscale.
After the virtual printer is installed, you can use c# to 'print' a greyscale version.
Edit: I just remembered that the free version is not silent. Once you print to the CutePDF printer, it will ask you to 'Save As'. They do have an SDK available for purchase, but I couldn't say whether it would be able to help you convert to grayscale.

If a commercial product is a valid option for you, allow me to recommend Amyuni PDF Creator .Net. By using it you will be able to enumerate all items inside the page and change their colors accordingly, images can also be set as grayscale. Usual disclaimers apply
Sample code using Amyuni PDF Creator ActiveX, the .Net version would be similar:
pdfdoc.ReportState = ReportStateConstants.acReportStateDesign;
object[] page_items = (object[])pdfdoc.get_ObjectAttribute("Pages[1]", "Objects");
string[] color_attributes = new string[] { "TextColor", "BackColor", "BorderColor", "StrokeColor" };
foreach (acObject page_item in page_items)
{
object _type = page_item["ObjectType"];
if ((ACPDFCREACTIVEX.ObjectTypeConstants)_type == ACPDFCREACTIVEX.ObjectTypeConstants.acObjectTypePicture)
{
page_item["GrayScale"] = true;
}
else
foreach (string attr_name in color_attributes)
{
try
{
Color color = System.Drawing.ColorTranslator.FromWin32((int)page_item[attr_name]);
int grayColor = (int)(0.3 * color.R + 0.59 * color.G + 0.11 * color.B);
int newColorRef = System.Drawing.ColorTranslator.ToWin32(Color.FromArgb(grayColor, grayColor, grayColor));
page_item[attr_name] = newColorRef;
}
catch { } //not all items have all kinds of color attributes
}
}

Before a year a go i asked this question. Later i made my own solution by using iTextSharp.
You can take a look at the converting PDF to grayscale solution here: https://itextsharpextended.codeplex.com/

iTextPdf a good product for creating/managing pdf it has got both commercial and free versions.
Have a look at aspose.pdf for .net it provides below features and a lot more.
Add and remove watermarks from PDF document
Set page margin, size, orientation, transition type, zoom factor and appearance of PDF document
..
And here is a list of open source pdf libraries.

After a lot of investigation i found out about ABCpdf from Websupergoo. Their component can easily convert any PDF page to grayscale by simple call to Recolor method. The component is commercial.

How do I check for corrupt TIFF images in C#?

I searched on how to check if a TIFF file is corrupt or not. Most suggests wrapping the Image.FromFile function in a try block. If it throws an OutOfMemoryException, its corrupt. Has anyone used this? Is it effective? Any alternatives?

Please check out the freeware called LibTiff .NET. It has the function to check if every page in a TIF file is corrupted or not. Even partially corrupt also no problem
http://bitmiracle.com/libtiff/
Thanks

Many tiff files won't open in the standard GDI+ .NET. That is, if you're running on Windows XP. Window 7 is much better. So any file which is not supported by GDI+ (i.e. fax, 16 bit gray scale, 48bpp RGB, tiled tiff, piramidical tiled tiff etc.) are then seen as 'corrupt'. And not just that, anything resulting in a bitmap over a few 100 MByte on a 32-bit system will also cause an out-of-memory exception.
If your goal is to support as much as possible of the TIFF standard, please start from LibTiff (derivates). I've used LibTiff.NET from BitMiracle (LGPL), which worked well for me. Please see my other posts
Many of the TIFF utilities are also based on LibTIFF, some of them are ported to C#.NET. This would be my suggestion if you want to validate the TIFF.
As for the TIFF specification suggested in other replies: of course this gives you bit-level control. But to my experience you won't need to go that low to have good TIFF support. The format is so versatile that it will cost you an enormous amount of time to start support from scratch.

It will only be corrupt in the sense that the frameworks methods cant open it.
There are some TIFF types that the framework cannot open -( In my case I cant remember the exact one, think it was one of the FAX type ones...)
That may be enough for you, if you are just looking a using the framework to manipulate images. After all I you cant open it, you cant use it...
ImageMagic - may give you more scope here

Without looking at the tiff, it may be difficult to see if its corrupt from a visual perspective, but if you have issues with processing an image, just create a function that does a basic test for this type of processing and handle the error?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.