How to extract the line item dynamically using zonal OCR method? - c#

Currently I am creating an OCR application. It's okay to extract the fixed area based on the predefined template but having the difficulty in extracting the line item from the scanned invoices as every invoice has the different line item.

It sounds like you are looking into dynamically extracting information from unstructured forms.
The term 'Unstructured forms processing' refers to capturing data from documents that do not have a fixed structure. Examples of unstructured forms are documents such as purchase orders, invoices, bills, and tabs. These types of documents have a general template but certain parts of the form can vary depending on how many line items or purchases are included in the form.
To extract the data from the form, you will need to use some sort of OCR to convert the image to text. You can use tesseract if you are looking for an open source solution and extract all of the data from the invoice. I did a search on Stack Overflow for using Tesseract on unstructured forms and came across these solutions which you can take a look at :
Tesseract receipt scanning advice needed
How to extract relevant information from receipt
Another option is to look into a commercial solution who has libraries that solve this issue for you. The company I work for LEADTOOLS has an Invoice Recognition and Processing library that allows you to define your Master and then easily process your filled invoices against the invoices. Here is a video overview of the Invoice Recognition and Processing SDK:
Invoice Recognition and Processing
Screenshot of the invoice demo included with the SDK:

Related

.NET program scan renderable text in Chart in .PDF - not for words but for values - Text Location features?

Hello I have a chart that I need to have the system review and give results...
Chart image located here....
example chart .pdf http://imageshack.us/photo/my-images/651/scorecardchartexample.gif/
http://imageshack.us/photo/my-images/651/scorecardchartexample.gif/
--Assume the chart is in .PDF and the text is renderable I.E. "highlight-able".
--Assume the chart is placed on the page exactly the same way and same position every time
--Assume the chart can change - that is to say, I need to be able to upload a 1000 of these charts all following the exact same format but with some alternate info from chart to chart.
--Assume VAST expertise in .NET - and little expertise in actual text interpretation.
--Assume expertise in interpreting .PDF that have editable fields...I am already doing this, this is limited to .PDF's I created and was able to place values on each field etc.
--Assume this chart is only deliverable in a single text renderable .PDF - that is to say - we interact with a website that creates this chart - this website has no API to interact with, we must print to PDF this chart from the webpage and that is all we can do...(government website)
Using a .NET system, I need to create a program...or incorporate an existing application into my .NET system, that will review this chart and will be able to tell what each "X" represents...that is to say an "X" one inch to the left or in the next row is an indicator of a different result (refer to chart)
I need the program to perform its search and return results based on the trigger of the .PDF document hitting a folder or whatever. This part we can handle assuming we creating the program from scratch...otherwise we will be limited to interacting with an existing app as needed.
We are open to a variety of strategies. Assuming such a class or object exists, we were thinking of reading text based on location in the document, like an X,Y sort of thing. Another desireable route would be some sort of stringBuffer (assume C#) but will need to be able to navigate the chart gridlines and will need to count white spaces to accurately interpret the position of the "X"'s and what the "X" means based on its placement. 3rd option, something we are unware of.
If something exists and is tried and true, well that of course woould be best. Then any tips on interfacing with it using .NET and C#.
Thank you all very much in advance Code Gawds!
Reel
OK We found some software called ClearImage - it wasn't cheap but it is pretty neat. It will analyze any image in the same fashion Adobe PDF analyzes a document to find form fields. After clear image does that it gives you a list of "blobs" you then get to dictate what each blob means and give it a unique identifer. This allows for auto value declaration based on "blob" placement in the image.
It also allows to sort of "finger print" an image so if the same image were to show up it could recognize it...in my case we have 3 different templates for the chart, and indeed each one will be different due to different charting, but ultimately each template has the same layout from multiples of the chart...this has helped in allowing our system to identfy what chart has been entered then after that first check, move on to anyalizing each blob.
Anyway, worth a look if anyone else should come across this question and is in need this type of function. I didn't want to leave it unanswered. I may update this as we learn more about it. I know this isn't exactly a coding question but this type of task is coding intensive and if anyone was looking to perform the same task they may find their way here. I will endeavor to update in the spirit of stackoverflow with comments relating to integration and objects etc. etc.
should anyone have more questions about this software in relation to coding you can ask here or post a new question, we will be happy to post our code (methods, classes objects etc.) we used (in C#) in terms of integrating it into our/your programs.

C# solution for rendering PDFs and OCRing the resulting images? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm looking for is a C# solution to import data from PDF documents into our database, in a commercial application. Our customers will be looking to import any arbitrary document. Ordinarily I'd write this off as a complete impossibility, but the documents they're importing will be in their own set layout.
My plan is to have the PDFs rendered to static images, then allow the users to set up their own templates, which essentially pull out text at predefined pixel-offsets in the PDF, using OCR. For tables, they define a location of the table and a bunch of further values for column and row sizes. We can then apply the template onto that document type.
So, what I'm really looking for is two libraries: one to convert PDFs to images, another to OCR those images.
Requirements:
Is pure-C# or has a supported C# wrapper onto a native DLL.
Doesn't fork out processes - wrappers that essentially just create command line parameters and launch an external executable aren't allowed in this case.
In the case of FOSS, allows us to exempt ourselves from normal FOSS license requirements (i.e. publishing our sourcecode) by paying a license fee.
We certainly don't mind paying for a commercial solution, but we'd rather not get stuck with paying a fee per individual distribution of the software.
I know this is quite a specific requirement set - perhaps enough for some people to deem this question too localised, but I'm hoping that someone can suggest an approach and some libraries that can be helpful to me, as well as others in the future.
Stuff I've looked into for the PDF side:
iTextSharp - Documentation is a book you have to buy, not a good start. Doesn't seem to be much useful documentation regarding turning PDFs into images in the public domain. Licensing is opaque, looks like we have to pay per client we distribute to.
Docotic.Pdf - Text only, no use to us.
pdftohtml - Again, doesn't produce images. Would be a mess to port to C# too.
PdfFileParser - Still not what we need.
GhostScript - Pretty much exactly what we want, but requires forking out to a program.
For the OCR side, I'll probably end up using Tesseract, since the Apache license is permissive and it's got good reviews. If there's an alternative, I'd be interested in that too.
I would like to recommend Amyuni PDF Creator .Net for this task.
1st Scenario:
If your PDF files are well defined (no missing font information etc) you could directly extract the text from the PDF by specifying a rectangular region in the method GetObjectsInRectangle. You should also use the option acGetRectObjectsOptimize:
Optimize text objects before returning them. That is, combine text
objects that are close to each other into a single text object.
2nd Scenario:
If there are images involved that also contain text, rendering the whole page into an image and then applying OCR might be a better choice. You can do this with Amyuni PDF Creator .Net by using the methods ExportToTiff, ExportToJPeg, or RasterizePageRange.
From the documentation:
IacDocument.RasterizePageRange Method The RasterizePageRange method converts page contents into a color or grey scale image. When
archiving documents or performing OCR, it is sometimes preferable for
all pages to be stored as images rather than complex text and graphic
operations.
Then you can use our OCR add-in that integrates with Tesseract OCR and finally we fall again into the 1st Scenario (GetObjectsInRectangle). In order to apply OCR to your files you can use the method OCRPageRange.
void OCRPageRange(int startPage, int EndPage, string Language,
acOCROptions Options)
About licensing, Amyuni PDF Creator .Net provides a (per application) royalty free license.
Usual disclaimer applies
I think you might want to give Docotic.Pdf another chance.
The library can extract text chunks, words and even individual characters with their bounding rectangles. Please have a look at the sample for extraction of words from PDFs.
Also, Docotic.Pdf can create images from PDFs and draw pages on a System.Drawing.Graphics. Please have a look at Draw and print Pdf group of samples.
Disclaimer: I am one of developers of the library.

Mail merge or merge-like functionality from C#

I need to print a few thousand stickers with a few text fields (name, position, etc) as well as a barcode image.
Each staff member gets two unique stickers, and the sticker paper has 4 per sheet so that's 2 staff per sheet.
I already have all the code to generate the barcode as an Image, and the staff details are stored in a List of object.
If possible, I'd like to avoid using MSWord directly since my development environment is quite different from the target environment and I've had issues in the past from the disparity. (Win7-64, MSOffice2010 vs. WinXP-32, MSOffice2003).
What's the best way to accomplish this?
If I save the document as an XML format and replace the mail merge fields with unique tokens which I can replace with my actual values (and I can even replace the binary image data with base-64 encoded image bytes) then that works but it's clunky. For starters, I'd have to save the XML file and then somehow print it transparent to the user (don't want Word showing up). Also, the XML template is 1 page, but I might have several dozen to print. I can send each page to the printer individually but that's not exactly ideal.
Any other suggestions?
I would use DevXpress XtraReports as I have used it in the past in similar scenarios with great results. If you prefer other engines like Crystal or Telerik is the same, as easy as dragging some fields in the page details section and assign your object list as datasource. DevXpress has also a RichTextBox with builtin mailmerge feature. at last if you decide for word do not forget that you can automate and use it while keeping it invisible so users wont see it.

Extract data from nested tables in PDF

I have a few pdf files that were created from word or excel files.
I need to get the information thats in the tables.
The text in the document is not an image so I'm able to extract the text using tools such as pdfbox.
When I have the text I have no way of knowing what cells in the table it belongs to because I don't know where the table borders are.
Iv'e tried a few desktop tools such as abby or solid pdf converter and they are able to convert the files into nice word documents but this doesn't suit my needs as I want to be able to do this programatticly in C#.
Some of the tables have nested tables wich I think makes this a little bit more diffucult.
I appreciate your help
The difficulty here is caused by the fact that the text in the PDF is not contained within any table. It might look like it is, but underneath the surface, it is not.
So there are a couple of options that I can think of. But none of them are going to be quite as satisfying as you'd probably like.
There are some companies that offer SDKs for PDF to Excel/Word conversion. Investintech and Iceni are a couple of examples. But these solutions are not free.
If you know the exact layout of the PDF files that you need to extract the table data from, then you can use any SDK that lets you extract text from a PDF and also tells you the exact co-ordinates of the extracted text. Using this method you need to know in advance where the text is going to be, so that you can extract text from a specific area on the page. It obviously won't work if you need to process any random document.
It's a difficult task, but hopefully this will give you a starting point.

C# - Templated Printing from Object(s)

I'm in need of a solution to print or export (pdf/doc) from C#. I want to be able to design a template with place holders, bind an object (or xml) to this template, and get out a finished document.
I'm not really sure if this is a reporting solution or not.
I also don't want to have to roll my own printing / graphics code -- I'd like all display concerns handled in a template.
I initially think of this as something Crystal Reports can do (although I've never used CR), but I'm not sure if I'm abusing the system here -- I'm not really interested in binding ADO.NET datasets at the moment (screw datasets). Can Crystal deal with binding to objects?
Does SSRS or WPF play in this field too?
A subset of WPF-P is XPS which can be used to present your objects via databinding.
One of the best choices if you are already using WPF.
Google Keywords: XPS, FixedDocument, FlowDocument, WPF Printing
Might read through this thread:
http://groups.google.com/group/nhusers/browse_thread/thread/e2c2b8f834ae7ea8
Seems a lot of people like iTextSharp
http://itextsharp.sourceforge.net/
For Word docs, look into Word's Mail Merge feature and Word automation. I did this recently in a form letter printing project. Basically what I did was create a Word template file (file extension .dot) and in this template file I defined MergeFields in a standard form letter. My application queries a database for the records it needs to print and then for each record it returns it matches fields in the database with these merge fields and sends the result (the merged doc) to the printer.
It's working really well and if I had a link that gave a definitive explanation, I'd provide it (check back here, I'll see if I can't find the most useful ones). Hopefully I've provided enough keywords to let you find your own resources. I can go into more detail if you need.
I've never had to export PDF files but for a project I'm working on now I'll have to. For a free solution my research has lead to iTextSharp (like Will Shaver points out) but I've only done the initial investigations and I have found a few pay solutions I might end up resorting to.

Categories