OCR engine to capture characters from images

OCR engine to capture characters from images - c#

i'm using c# tessnet2 wrapper for Tesseract OCR engine to capture chracters of image files. i been searching everywhere if tessnet2 has any build in functions to overwrite certain characters and saved them into the same image file it's reading but have not found anything in regards to that. so what i'm thinking of doing is creating a new imagine file base on what i'm receiving from tessnet2 but i need to create the new image the same exact way but change just few things in the new created image. i'm not sure if i'm using the correct methology or if there is other c# assemblies out there that allow you to read characters from image file and at the same time allow you to manipulate as you need them.

Good luck--but tess has no way of replacing in the proper font. Raster graphics don't generally store glyph information. Even if it did, you would potentially be in violation of licenses and/or copyrights surrounding the fonts you'd be writing in. I'm not an expert in OCR, but I will confidently say that this is something not readily available out there in the wild.

To expand on Brian's answer:
You will need to do this yourself. I have not worked with Tesseract, but I have used the Nuance OCR engine. It will return you font information as well as coordinates for the character it has recognized (note that you will most likely have to compute the actual image coordinate as the OCR engine will have deskewed the image before performing the recognition). Once you get the coordinates and the deskew so that you can compute the actual coordinate, you can then use any image manipulation library (Leadtools, Accusoft, etc) or just straight GDI+ functions to clear the character, then using the font info and size info create a new character and merge it into the image. This is not trivial but certainly doable.
Edit:
It was late when I wrote the initial answer, wanted to clarify what is meant by font information. The OCR engine will give you information regarding the point size, whether its bold/italicized and the font family (Seriph, etc). I do not know of one that will tell you the exact font that the document is in. If you have a sample of the documents that you will process, then you can make a good guess based on the info the OCR engine gives you.

Related

Using C#, how do I search for images in a Windows file system like TinEye.com does on the web?

Hi and thanks for looking!
Update
For the sake of clarity, a third-party .NET library is just fine. Preferably an open-source or free one. The solution need not be native .NET.
Background
I am working on an enterprise web application for which the client has given us thousands of pages of content in MS Word documents that we have to parse, extract data, and send to the content database.
Within these docs are various embedded images representing a larger original image in a separate folder.
The client did not provide any paths to the original source image, so when we see content with an embedded image in the MS Word doc, we have to go through several "assets" folders and look for the corresponding image which is extraordinarily time consuming.
We are already using DocX to parse the documents, so you can assume that we have a list of bitmap images to loop through that we have pulled from the document.
Question
Given a list of bitmaps that we just extracted from the document, how do we search a different folder containing hundreds of images, for the matching image, and then return the file path to it?
TinEye.com does this over the web. I am wondering if, using System.Drawing or something, we can do it on a PC with C#.
Thanks!
Matt

Hate to propose an answer to my own question, but I think I might be on to something here. Here is heuristic/pseudo code for a C# forms app--your thoughts are appreciated:
Part 1
Using System.IO, traverse the "assets" folders and get all images.
For each image, Base64 encode it.
Take the resulting string and place in an XML file:
<Image>
<Path>C:\SomePath</Path>
<EncodedString>[Some Base64 String]<Encoded String>
</Image>
Now we have an XML file containing all original images, in Base64 form, along with their file path.
Part 2
Using DocX, extract all images from MS Word Doc.
For each image, use Linq-to-Xml to search for an exact match in the XML file from Part 1.
If there are no exact matches, start iterating the XML file and computing the Levenshtein distance.
While in the foreach store the XML node Id (or file path) and Levenshtein Distance as a key value pair in an object.
Take the k/v pair with the lowest LD score and return the file path.
For performance, set tolerance so that the foreach stops if a certain original image has an acceptably low LD score when compared to the image extracted from the document.
Since this is a one-off task, I don't need instant performance. So, I could run this tonight before leaving the office and, hopefully, come back tomorrow to a list of paths connecting the original images to the ones embedded in the docs.
UPDATE
The heuristic above worked beautifully! I ended up using the Sift library to efficiently calculate distances between Base64 strings. Specifically, I used their FastDistance() method. Having 100% accuracy on finding the images I need, even if the angle from which the photo was taken is slightly different.

There is no built-in algorithm in the .NET framework for generating image similarity. You'd need to use a third-party library or do it yourself. Lots of image similarity algo questions on SO:
Algorithm for finding similar images
How can I measure the similarity between two images?
comparing images programmatically - lib or class
One more, for .NET: Are there any OK image recognition libraries for .NET?. This one refers you to AForge, which seems to have the algorithm that you are after.

According to this SO answer to a similar question, you should look at OpenCV and VLFeat. The former has a C++ API and the latter a C API, so you would need to write your own P/Invoke wrapper or perhaps wrap them in a C++/CLI facade, which you could call from C#.

.NET component for color PDF to grayscale conversion

Currently i use Ghostscript to convert color PDF's to grayscale PDF's. Now i'm looking for reliable .NET commercial or not commercial component/library for ghostscript replacement. I googled and I did not find any component/library that is able to do that easily or to do that at all.
EDIT #1:
Why Ghostscript does not work for me:
I implemented Ghostscript and I'm using it's native API's. The problem is that Ghostscript does not support multiple instances of the interpreter within a single process. -dJOBSERVER mode also does not work for me because i don't collect all job and them process them all at once. It happens that Ghostscript is processing large job which takes around 20 minutes and meanwhile i get some smaller job which has to be processed ASAP and cannot wait 20 minutes. Other problem is that Ghostscript page processed events are not easily to catch. I wrote a parser for ghostscript stdout messages and i can read out processed page number but not for each page when it's processed as ghostscript pushes message for group of processed pages. There are couple of more problems with Ghostscript like producing bad pdf's, duplicating font problems.....
You can find one more problem i had with ghostscript here: Ghostscript - PS to PDF - Inverted images problem
-
a year after UPDATE:
Before a year a go i asked this question. Later i made my own solution by using iTextSharp.
You can take a look at the converting PDF to grayscale solution here:
http://habjan.blogspot.com/2013/09/proof-of-concept-converting-pdf-files.html
or
https://itextsharpextended.codeplex.com/
Works for me in most cases :)

Not quite an answer, but I think you dismiss Ghostscript too quickly.
Are you aware of the GhostScript API (for in-process Ghostscript)? Or of the -dJOBSERVER mode that can take a series of PS commands piped to its standard in?
That still won't get you your callbacks however, and it's still not multi-threaded.
As previously stated, iText could do it, but it would be a matter of walking through all the content and images looking for non-grayscale color spaces and converting them in a space-specific manner.
You'd also have to replace the pixel data in any images you might find.
The good news is that iText[Sharp] is capable of operating in multiple threads, provided each document is used from one thread at a time.
I suspect this is also the case for the suggested commercial library, which isn't such a good deal.
And then a light went on above my head... drawn in gray scale.
Blending modes and transparency groups!
Take all the current page content and stick it in a transparency group that is blended with a solid black rectangle that covers the page. I think there's even a luminosity to alpha blend mode... lets see here.
Yep, PDF reference section 11.6.5.2 "Soft Mask Dictionaries". You'll want a "luminosity" group.
Now, the bad news. If your goal in switching to gray scale is to save space, this will fail utterly. It'll actually make each file a little larger... say a 100 bytes per page, give or take.
The software rendering the PDF better be pretty hot stuff too. Your cousin's undergrad rendering project need not apply. This is advanced graphics stuff here, infrequently used by Common PDF Files, so the last sort of thing to be implemented.
So... For each original page
Create a new page.
Cover it with a black background.
Cover it with a white rectangle (had it backwards earlier) in a transparency group that uses a soft mask dictionary set to be the luminosity of the original page's content (now stashed in an XObject Form).
Because this is all your own code, you'll have ample opportunity to do whatever it is you want to do at the beginning or end of each page.
By golly, that's just crazy enough to work! It does require some PDF-Fu, but not nearly as much as the "convert each color space and image in various ways as I step through the document". Deeper knowledge, less code to write.

This isn't a .net library, but rather a potential work-around. You could install a virtual printer that is capable of writing PDF files. I would suggest CutePDF, as it's free, easy to use and does a great job 'printing' a large number of file formats to PDF. You can do nearly everything with CutePDF that you can do with a normal printer, including printing to grayscale.
After the virtual printer is installed, you can use c# to 'print' a greyscale version.
Edit: I just remembered that the free version is not silent. Once you print to the CutePDF printer, it will ask you to 'Save As'. They do have an SDK available for purchase, but I couldn't say whether it would be able to help you convert to grayscale.

If a commercial product is a valid option for you, allow me to recommend Amyuni PDF Creator .Net. By using it you will be able to enumerate all items inside the page and change their colors accordingly, images can also be set as grayscale. Usual disclaimers apply
Sample code using Amyuni PDF Creator ActiveX, the .Net version would be similar:
pdfdoc.ReportState = ReportStateConstants.acReportStateDesign;
object[] page_items = (object[])pdfdoc.get_ObjectAttribute("Pages[1]", "Objects");
string[] color_attributes = new string[] { "TextColor", "BackColor", "BorderColor", "StrokeColor" };
foreach (acObject page_item in page_items)
{
object _type = page_item["ObjectType"];
if ((ACPDFCREACTIVEX.ObjectTypeConstants)_type == ACPDFCREACTIVEX.ObjectTypeConstants.acObjectTypePicture)
{
page_item["GrayScale"] = true;
}
else
foreach (string attr_name in color_attributes)
{
try
{
Color color = System.Drawing.ColorTranslator.FromWin32((int)page_item[attr_name]);
int grayColor = (int)(0.3 * color.R + 0.59 * color.G + 0.11 * color.B);
int newColorRef = System.Drawing.ColorTranslator.ToWin32(Color.FromArgb(grayColor, grayColor, grayColor));
page_item[attr_name] = newColorRef;
}
catch { } //not all items have all kinds of color attributes
}
}

Before a year a go i asked this question. Later i made my own solution by using iTextSharp.
You can take a look at the converting PDF to grayscale solution here: https://itextsharpextended.codeplex.com/

iTextPdf a good product for creating/managing pdf it has got both commercial and free versions.
Have a look at aspose.pdf for .net it provides below features and a lot more.
Add and remove watermarks from PDF document
Set page margin, size, orientation, transition type, zoom factor and appearance of PDF document
..
And here is a list of open source pdf libraries.

After a lot of investigation i found out about ABCpdf from Websupergoo. Their component can easily convert any PDF page to grayscale by simple call to Recolor method. The component is commercial.

Howto: Improve the PDF- quality before OCR using C#

I'm creating a service that monitors a folder for scanned files. Once the file is there, The service picks it up, and convert it to a readable PDF. In this process the service also searches for a barcode. After this, the text is extracted and the file, with its text is stored into the database of our software. The location is based on the barcode.
Now, for the OCR we are using the SDK of Atalasoft (http://www.atalasoft.com/).
Also the Barcode recognizer is included in this SDK.
But the converted text still has some mistakes. (I ran some tests with other OCR-programs, but Atalasoft came out nice.)
I'm looking for some software (SDK-kit) which allows me to improve the quality of the PDF for OCR purposes.
I tested Kofax VRS Elite (http://www.kofax.com/vrs-virtualrescan/). I'm looking for something similar, but that can be implemented in the service using some kind of SDK-kit.
Anyone who did this before, or had similar problems?
thx in advance!

You may try and follow a different path altogether:
See if you can configure the scanner(s) to scan directly to PDF and do the OCR on the fly. The Lexmark scanners can do this. This creates PDF's with selectable and searchable text. This in turn can be extracted with a PDF reading library.
Alternatively you may want to have a look at http://www.abbyy.com/ and see if you get better results.
If these are not good options, you may want to break down your problem in a systematic way:
1. Is the image quality of the scanned images the problem? If so, then this will have to be fixed first. Your OCR solution may be affected by resolution, contrast, and colour.
2. Is it the OCR software? Take a highly legible document and see if the OCR software makes mistakes. If so, then you know you have to find better OCR software.
3. If your document quality is decent and your OCR software has a high success rate in deciphering a legible document, then you may want to look at the exceptions that do not work, and tackle these on a case by case basis.
If smears and background images on documents is the cause of the problem, you may want to look into ways of avoiding this, or cleaning this with image processing software that exposes an API.

Draw emf antialiased

Is there a way to draw an emf metafile (exported form a drawing tool) with antialiasing enabled? The tools I tried are not capable of exporting emf files antaliased so I wondered if I can turn it back on manually when drawing the emf in the OnPaint override of my Controls.
If anyone can confirm that is technically possible to generate antialiased emf files, another solution would be to use a drawing tool that can export to antialiased emf or have a 3rd party converter do this later. If anyone knowns such a tool, please let me know.
EDIT: When looking at the emf instructions it doesn't seem that emf itself can actually store the information whether it is to be rendered antialiased or not. At least I couldn't find anything. It is more likely that the antialiasing is done by the playback engine. For example when I open an emf in Word 2007 it is rendered antialiased. But not when I draw it with GDI+ "playback engine" (Graphics.DrawImage(...)). or when I view it the standard windows image viewer.
This makes me believe that some tools actually have their own emf playback engine. So maybe there is free .NET library (preferably with source code) that give me an object model of the emf instructions stored in the parsed emf file so I can play it back myself instead of using Graphics.DrawImage(...)?

We had a similar issue in a DirectX project. Upscaling and downscaling works to a certain degree, but it's faking it. If it's something you need to do over and over, you could perhaps parse the records of the WMF and draw them with GDI+ antialiased.
The following threads back this up (but they're from 2005 so things might have changed):
http://www.dotnet247.com/247reference/msgs/28/144605.aspx
http://www.dotnetmonster.com/Uwe/Forum.aspx/dotnet-sdk/1127/Graphics-DrawImage-metafile-no-antialiasing
[Edit:]
These three programs might do the job for you: I'm assuming you're ok with doing it by hand:
http://emf-to-vector-converter-command-line-ser.smartcode.com/info.html
http://www.verypdf.com/pdf-editor/index.html
http://www.ivanview.com/converter/emf-batch-converter.html
[Edit II:]
Well, here's a program that will let you inspect an EMF in various ways:
http://download.cnet.com/windows/3055-2383_4-10558240.html?tag=pdl-redir
...and here's a freeware library that will let you parse 122 of the EMF commands and output them in GDI+. That should probably do the trick:
http://www.codeproject.com/KB/GDI-plus/emfexplorer.aspx?msg=2359423
...oh, and notice also comment #3 on the codeproject page. Looks like someone have banged their heads against the wall before. Hope this solves your problem.

EMF is using GDI commands, not GDI+, so it has no notion of antialiasing. I suspect that when you ask GDI+ to render the file, it sends it to GDI and just copies the resulting bitmap.
Duplicating this in code would be the same as reimplementing GDI, so it's not terribly feasible. Not impossible, just a larger job than the benefit would justify. If there is an open source utility that can open EMF files outside of Windows, you might look into the source code.
My guess is that Word is using the downsampling trick.

EMF file is a list of GDI commands. So it won't be anti-aliaised, even if under GDI+, you put a SmoothingMode() call before the drawing. You'll have to enumerate the GDI commands, then translate it into GDI+ commands.
Under Vista/Seven, you can use GDI+ 1.1 function named GdipConvertToEmfPlus/ConvertToEmfPlus. If you want your program to work with XP, you should write your own enumeration, then conversion to GDI+ commands.
The GDI enumeration then conversion to GDI+ is possible has been done by emfexplorer, but I've written some code perhaps more easy to follow, even if it's written in Delphi.
I'm putting this answer just now (I'm late), because I spent a lot of time finding out a solution using ConvertToEmfPlus, and writing some tuned open source code in case this method is not available.

Where Can I Find Current Adobe Image Format Specifications? "Clipping Paths"

This is in regards to Adobe's Image Resource Blocks(IRB), that they store in TIFF, PSD, JPEG Formats. It's also called "8BIM", This standard was released with Adobe's Photoshop 3 (November 1994).
IRB contains information on color profiles and clipping paths(what i am interested in).
The only piece of documentation i can find on the internet is this 4 page document provided by Adobe in 1990.
I have been searching imagemagick source code to find that the IRB ID's for clipping paths are from 2000 to 2998, to thats a usable 998 clipping paths.
So I managed to get a IRB Byte Array of each resource block from JPEG and a TIFF file, specified in the four page document. I rolled my own and tested out Graphics Mill to see if managed to get the same information.
I am not sure how to convert the clipping path byte array into anything usable since I don't even know the format that adobe photoshop uses. The idea was to map the clipping path to a c# GDI+ Graphics Path.
I think that it's kind of pathetic that adobe has been around for so many years being the industry leader in graphic design, but yet they can't even provide necessary documentation.
Can anybody suggest any documentation that i could use?

Barely it's still actual for you, but some info can be found here
http://www.adobe.com/devnet-apps/photoshop/fileformatashtml/PhotoshopFileFormats.htm

+1 to mistika
Adobe specification contains description of PSD clipping path. The last revision is dated by Oct 2013 and looks like Adobe is currently working on it. At least I have feeling that new stuff was added.
If you are looking for a code using PSD format, take a look into libpsd. That’s a nice open source, pretty easy to read. Sometimes more informative that spec.
As for GraphicsMill, since 6x version it can transform Adobe Clipping Path to GdiPlus GraphicsPath.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.