Parse text from a screen grab

Parse text from a screen grab - c#

Not sure the best way to explain this but i'll give it a shot. I'm trying to find a way to parse text/numbers from a screen grab in either C# or Java - whichever provides the easiest way, but preferably java.
An example would be as follows. You have a website/document/application with a block of text. You can take a screenshot of the specific area which contains this text. Once the screenshot has been taken you can extract a string from it containing the relevant characters.
Any feedback is appreciated. Thanks

You could try the Tessnet2 a .NET 2.0 Open Source OCR assembly using Tesseract engine.

If you are taking a screenshot it will be saved into an image file. You can use OCR image tools to break it down into text, but it's not 100% always in the conversion. Parsing text and/or numbers is just basic code in C# or Java.

FreeOCR is best out of all the OCRs I've tried.

Related

Image recognition to read 7segment Display

I am implementing SURF to detect Digits inside a seven segment display using some template. But it is not working fine. Is there any way that can be slow but more effective. I am using Emgu Wrapper for OpenCV

I would suggest you don't use SURF and instead look into using tesserect for character recognition.
SURF is really good for recognising patterns such as logo's and images, but for characters tesserect will not only produce better results, it's easier to implement!
You can create your own custom fonts to look for if the digits you are trying to read are non-standard.
https://www.youtube.com/watch?v=RqvvXJXuRYY&list=UUxAnMtjN08ryThpgYTBmILg
Try following this tutorial, its really helpful for getting started with OCR.
It's in VB but it won't be hard to write in C# once you have got the logic down.
Hope this helps!

OCR engine to capture characters from images

i'm using c# tessnet2 wrapper for Tesseract OCR engine to capture chracters of image files. i been searching everywhere if tessnet2 has any build in functions to overwrite certain characters and saved them into the same image file it's reading but have not found anything in regards to that. so what i'm thinking of doing is creating a new imagine file base on what i'm receiving from tessnet2 but i need to create the new image the same exact way but change just few things in the new created image. i'm not sure if i'm using the correct methology or if there is other c# assemblies out there that allow you to read characters from image file and at the same time allow you to manipulate as you need them.

Good luck--but tess has no way of replacing in the proper font. Raster graphics don't generally store glyph information. Even if it did, you would potentially be in violation of licenses and/or copyrights surrounding the fonts you'd be writing in. I'm not an expert in OCR, but I will confidently say that this is something not readily available out there in the wild.

To expand on Brian's answer:
You will need to do this yourself. I have not worked with Tesseract, but I have used the Nuance OCR engine. It will return you font information as well as coordinates for the character it has recognized (note that you will most likely have to compute the actual image coordinate as the OCR engine will have deskewed the image before performing the recognition). Once you get the coordinates and the deskew so that you can compute the actual coordinate, you can then use any image manipulation library (Leadtools, Accusoft, etc) or just straight GDI+ functions to clear the character, then using the font info and size info create a new character and merge it into the image. This is not trivial but certainly doable.
Edit:
It was late when I wrote the initial answer, wanted to clarify what is meant by font information. The OCR engine will give you information regarding the point size, whether its bold/italicized and the font family (Seriph, etc). I do not know of one that will tell you the exact font that the document is in. If you have a sample of the documents that you will process, then you can make a good guess based on the info the OCR engine gives you.

Converting pdf to text

I need to create a C# or C++ (MFC) application that converts pdf files to txt. I need not only to convert, but remove headers, footers, some garbage characters on the left margin etc. Thus the application shold allow the user to set page margins to cut off what is not needed. I actually have already created such an application using xpdf, but it gives me some problems when I am trying to insert custom tags into the extracted text to preserve italics and bold. Maybe somebody could suggest something useful?
Thanks.

There are shareware and freeware utilities out there. Try fetching their source code, or perhaps use them the way they are.
A public version of the PDF specification can be found here: Adobe PDF Specification
PDF Shareware readers can be found: PDF Reader source code # SourceForge

Please look at Podofo. It's a LGPL-licensed library that has many powerful editing features. One of it's examples, txt2pdf IIRC, is a good start: it shows basic text-extraction; From there you can check if pre (in pdf engine) or post (in text) filtering suffices to your goals. I didn't get to use Pdf Hummus, but it's supposed to have these capabilities too, although it's less straightforward.

How to verify that pdf is text based using ITextSharp?

I need to verify that the pdf report is text based (and not bitmap based; however it could contain some images). I do not need to extract the text, just to verify that it is text based.
Is there a way how to perform such a verification using ITextSharp library?
Thanks in advance,
Stefan

You can look for text drawing commands easily enough. The least work on your part would be to try to extract the text and see if anything is there. Ideally you'd know some of the text it should contain and search for it. A single sentence or phrase would be plenty for this sort of testing.
Text extraction with iText is pretty trivial these days. Lots of examples floating around SO, and the web.

How does one parse and convert AutoCAD MText entity to raw text?

I would like to parse AutoCAD's MText entity and extract the raw text. I see a pattern in the way the text is formatted. If this has already been solved, then I would not need to reinvent the wheel. I have searched online, but have not found sufficient information.
I am searching for any links or references on this subject.
Edit:
To further clarify, we are using the ODA (Open Design Aliance) libraries to access the DWG files. I am not familiar with this library. Another developer is using the library and extracting information from the files including MText entities. I am then provided with a file containing the MText text, which is what I am looking at. I am looking at the MText formatted text, which I have access to and am working with in C#.
Questions:
I asked the other developer if the ODA library provided a means to extract the raw text unformatted. His response was that it could, however that it would also result in the entity getting written back to the DWG file. I am interested in the raw text without affecting the original DWG file. Does ODA provide a way of extracting the raw text without altering the file?
I am interested in any documentation on the formatting rules of MText, so that I can consider writing a parser myself if necessary.
Is there anything out there to convert MText to RTF? I realize that RTF would not completely satisfy all formatting rules, but this could provide a satisfactory means of displaying the formatted text in a WinForms app. Given RTF I could also obtain the raw text.

This Forum thread includes a VB program to strip the control characters from the MText. The code indicates what should be done to strip each control character, so it should be straightforward to write something similar in C#.
Additionally, the documentation of the format codes is available in the AutoCAD documentation.

If you are using C# and the .NET interface, the Text property of the MText object provides the raw text:
MText mt;
...
string rawText = mt.Text;
If you want the formatting as well, the solution is different.

If you are parsing an AutoCAD file without AutoCAD, you need to specify what file type you are parsing. However, this question is basically a subset of the following questions:
Are there any libraries for parsing AutoCAD files?
Open source cad drawing (dwg) library in C#
.Net CAD component that can read/write dxf/ dwg files
Reading .DXF files
For DWG, the basic options are Open Design Alliance and AutoCAD RealDWG.
If this doesn't help, please provide more details as to exactly what you are trying to do.

If you are using C#, give the netDXF library a try.
I thought pseudo code should be like this:
DxfDocument dxf = new DxfDocument();
dxf = DxfDocument.Load(openFileDialog1.FileName);//load your file
//This extracts the raw text of your first text obj
dxf.MTexts[0].PlainText;

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.