C# solution for rendering PDFs and OCRing the resulting images? [closed]

C# solution for rendering PDFs and OCRing the resulting images? [closed] - c#

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm looking for is a C# solution to import data from PDF documents into our database, in a commercial application. Our customers will be looking to import any arbitrary document. Ordinarily I'd write this off as a complete impossibility, but the documents they're importing will be in their own set layout.
My plan is to have the PDFs rendered to static images, then allow the users to set up their own templates, which essentially pull out text at predefined pixel-offsets in the PDF, using OCR. For tables, they define a location of the table and a bunch of further values for column and row sizes. We can then apply the template onto that document type.
So, what I'm really looking for is two libraries: one to convert PDFs to images, another to OCR those images.
Requirements:
Is pure-C# or has a supported C# wrapper onto a native DLL.
Doesn't fork out processes - wrappers that essentially just create command line parameters and launch an external executable aren't allowed in this case.
In the case of FOSS, allows us to exempt ourselves from normal FOSS license requirements (i.e. publishing our sourcecode) by paying a license fee.
We certainly don't mind paying for a commercial solution, but we'd rather not get stuck with paying a fee per individual distribution of the software.
I know this is quite a specific requirement set - perhaps enough for some people to deem this question too localised, but I'm hoping that someone can suggest an approach and some libraries that can be helpful to me, as well as others in the future.
Stuff I've looked into for the PDF side:
iTextSharp - Documentation is a book you have to buy, not a good start. Doesn't seem to be much useful documentation regarding turning PDFs into images in the public domain. Licensing is opaque, looks like we have to pay per client we distribute to.
Docotic.Pdf - Text only, no use to us.
pdftohtml - Again, doesn't produce images. Would be a mess to port to C# too.
PdfFileParser - Still not what we need.
GhostScript - Pretty much exactly what we want, but requires forking out to a program.
For the OCR side, I'll probably end up using Tesseract, since the Apache license is permissive and it's got good reviews. If there's an alternative, I'd be interested in that too.

I would like to recommend Amyuni PDF Creator .Net for this task.
1st Scenario:
If your PDF files are well defined (no missing font information etc) you could directly extract the text from the PDF by specifying a rectangular region in the method GetObjectsInRectangle. You should also use the option acGetRectObjectsOptimize:
Optimize text objects before returning them. That is, combine text
objects that are close to each other into a single text object.
2nd Scenario:
If there are images involved that also contain text, rendering the whole page into an image and then applying OCR might be a better choice. You can do this with Amyuni PDF Creator .Net by using the methods ExportToTiff, ExportToJPeg, or RasterizePageRange.
From the documentation:
IacDocument.RasterizePageRange Method The RasterizePageRange method converts page contents into a color or grey scale image. When
archiving documents or performing OCR, it is sometimes preferable for
all pages to be stored as images rather than complex text and graphic
operations.
Then you can use our OCR add-in that integrates with Tesseract OCR and finally we fall again into the 1st Scenario (GetObjectsInRectangle). In order to apply OCR to your files you can use the method OCRPageRange.
void OCRPageRange(int startPage, int EndPage, string Language,
acOCROptions Options)
About licensing, Amyuni PDF Creator .Net provides a (per application) royalty free license.
Usual disclaimer applies

I think you might want to give Docotic.Pdf another chance.
The library can extract text chunks, words and even individual characters with their bounding rectangles. Please have a look at the sample for extraction of words from PDFs.
Also, Docotic.Pdf can create images from PDFs and draw pages on a System.Drawing.Graphics. Please have a look at Draw and print Pdf group of samples.
Disclaimer: I am one of developers of the library.

Related

Generate and design PDF with iTextSharp or similar [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
tl;dr :
Basically I'm just wondering what is the best/easiest way to design a PDF document? Is it remotely legit to actually design a whole PDF document with iTextSharp with code(i.e not loading external files)? I want the final result to look similar to a webpage with various colours, borders, images and everything.
Or do you have to rely on other documents like .doc, .html files to achieve a good design?
Originally I thought that I would use HTML markup to generate a PDF, however seeing how bad the support seems to be when it comes to styling/CSS so am I rethinking the whole situation.
Which led me to think, why even use a HTML markup or a .doc(x) file to create the PDF design when I could just do it right within the PDF without having to rely on on various files that serves no real purpose.
I started looking on this guide(it is several parts)
http://www.c-sharpcorner.com/uploadfile/f2e803/basic-pdf-creation-using-itextsharp-part-i/
A part of the guide :
using (MemoryStream ms = new MemoryStream())
{
Document document = new Document(PageSize.A4, 25, 25, 30, 30);
PdfWriter writer = PdfWriter.GetInstance(document, ms);
document.Open();
document.Add(new Paragraph("Hello World"));
document.Close();
writer.Close();
Response.ContentType = "pdf/application";
Response.AddHeader("content-disposition",
"attachment;filename=First PDF document.pdf");
Response.OutputStream.Write(ms.GetBuffer(), 0, ms.GetBuffer().Length);
}
And got a basic hang about generating PDFs and design them a bit.
But is it possible to generate and design big PDF documents this way and are there any more proper guides or similar with all the various commands to generate texts, images, borders and everything since I have no real clue about generating PDF with code.

The question is too broad, so I can only give you a very broad answer.
Option 1: you create your layout by using iText's high-level objects. There are countless applications out there that are using PdfPTable to generate complex reports. For instance: the time tables for a German Railway company are created from scratch through code; the invoices for a Belgian Telco company are created this way,... The advantage of this approach is that you can really fine-tune the layout. The disadvantage is that you need to change source code as soon as you want to change the layout.
Option 2: you create your layout by creating an AcroForm template. Every field in this template has a name and is visualized at exact positions (defined by its coordinates) on specific pages. The code to fill out such a form consists of only a handful of lines. Whenever you need to change the layout, you alter the AcroForm template. You do not need to change your code. The disadvantage is that AcroForms are very static. Compare it to a paper form: you can't insert a row in a paper form either.
Option 3: you create your data in XHTML format and your styles in CSS. A Belgian printing company responsible for creating invoices for its customers is streaming data into very simple HTML files involving a sequences of tables that never span more than a handful of pages. These files are then fed to iText's XML worker along with a CSS that is different for each of its customers. The advantage of this approach is that no extra programming is needed when a new customer joins. It's just a matter of creating a new CSS. The disadvantage is that you are limited by the HTML format. Elementary logic also tells you that you shouldn't expect URL2PDF: have you ever tried printing a website? Well, the bad quality of that print should give you an indication of the problems you'll encounter when trying to convert HTML to PDF. If you anticipate them, you can get good results. If you don't... it's a poor craftsman who blames his tools...
Option 4: define your template using the XML Forms Architecture. Such templates are usually created using Adobe LiveCycle Designer. An XSD is fed into LC Designer and the result is an empty form where the PDF format acts as a container for an XML stream. You can then use iText to inject your custom XML containing data that conforms with the XSD into the PDF and you can use XFA Worker to flatten such a form. XFA Worker is only available as a closed source product (givers need to set limits because takers rarely do).
Option 5: right now XML Worker is used to convert XHTML+CSS and XFA to PDF (ordinary PDF, PDF/A, PDF/UA). You could use the generic XML Worker engine to support your own XML format. The advantage would be a very powerful engine that you can tune to meet your exact needs. The disadvantage is that this involves a serious up-front development investment.
Option 6: use a third party tool to define the template and a third party server that uses iText under the hood to create PDFs based on the template. An example of such a third party tool is Scriptura developed by Inventive Designers. There are other tools, but Inventive Designers is a customer of iText and we know that they are using iText correctly whereas we don't have this guarantee from other vendors.

I' sure you can achieve your goal with iText. But imho, you should first investigate things like report generator:
Sql Server Reporting Services, or
Crystal Report, or
...
For quite the same time invest, you will have several format : HTML, DOCX, PDF, XLSX...
Of course the pertinence of the answer may vary according to the nature of document to produce: if we are talking about 1000 pages documentations, the report generator is not necessarily the best way.

Convert High Resolution PDF to Low Resolution PDF file C#

I´m looking for a way to convert a High Resolution PDF file to a Low Resolution PDF file from an ASP.NET applicaitn (C#).
Users will import High Resolution PDF's and the solution should then have the possibility to provide both High Resolution PDF and Low Resolution PDF.
I´m looking for a API to do that. I have found a lot of PDF apis but none of them seems to do what I´m looking for.

ABCpdf .NET will do this for you. There are a variety of functions for resizing, resampling or recompressing the images within a PDF document. However given your requirements you probably just want to use the document reduce size operation.
To do this you just need code of the following form:
Doc doc = new Doc();
doc.Read(Server.MapPath("../mypics/sample.pdf"));
using (ReduceSizeOperation op = new ReduceSizeOperation(doc)) {
op.UnembedSimpleFonts = false; // though of course making these true...
op.UnembedComplexFonts = false; // ... would further reduce file size.
op.MonochromeImageDpi = 72;
op.GrayImageDpi = 72;
op.ColorImageDpi = 144;
op.Compact(true);
}
doc.Save(Server.MapPath("ReduceSizeOperation.pdf"));
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)

There are a number of possible approaches to this problem, one of which would simply be to export from InDesign twice (which would allow you to make the two required versions of PDF). If that is not feasible (which it might not be as exporting from InDesign can take a bit of time) there are definitely libraries on the market that can perform what you want to do. Here to you have a number of approaches:
1) While this will get me shot by most Adobe employees, you could re-distill your PDF file into a smaller file. I would not advocate to do this, I'm mentioning it to be complete. Risks involved would be introducing quality loss, artefacts and so on (mostly because PostScript doesn't support a number of features PDF does support and by redistilling you'd loose those features).
2) Use an application or library that was made for this task. callas pdfToolbox for example (warning, I'm affiliated with this company!) is capable of taking a PDF file and running corrections on it. Amongst those corrections are image downsampling, conversion to RGB, image re-compression (for example with JPEG-2000), throwing away unnecessary data in the PDF and much more. Given that the application has a command-line interface, it can easily be integrated into a c# process. Other applications with similar functionality would be from companies such as Enfocus, Apago and others.
3) Use a low-level PDF library (such as the Adobe PDF library that can be licensed from Adobe through DataLogics) and implement the necessary changes yourself. More work, but potentially more flexible.
Whatever approach you choose, know that you are starting from a high-quality PDF file and that your process should try to retain as much of that quality as possible (dependant on which application you have for the low resolution PDF file of course). Make sure you don't get into trouble by loosing proper overprint, transparency etc...

Probably you can just resize images in high resolution version PDF and this will give you much smaller files.
You can resize and/or recompress images using Docotic.Pdf library. For more details please take a look at my answer to a similar question.
Disclaimer: I am one of the developers of Docotic.Pdf library.

Converting pdf to text

I need to create a C# or C++ (MFC) application that converts pdf files to txt. I need not only to convert, but remove headers, footers, some garbage characters on the left margin etc. Thus the application shold allow the user to set page margins to cut off what is not needed. I actually have already created such an application using xpdf, but it gives me some problems when I am trying to insert custom tags into the extracted text to preserve italics and bold. Maybe somebody could suggest something useful?
Thanks.

There are shareware and freeware utilities out there. Try fetching their source code, or perhaps use them the way they are.
A public version of the PDF specification can be found here: Adobe PDF Specification
PDF Shareware readers can be found: PDF Reader source code # SourceForge

Please look at Podofo. It's a LGPL-licensed library that has many powerful editing features. One of it's examples, txt2pdf IIRC, is a good start: it shows basic text-extraction; From there you can check if pre (in pdf engine) or post (in text) filtering suffices to your goals. I didn't get to use Pdf Hummus, but it's supposed to have these capabilities too, although it's less straightforward.

Howto: Improve the PDF- quality before OCR using C#

I'm creating a service that monitors a folder for scanned files. Once the file is there, The service picks it up, and convert it to a readable PDF. In this process the service also searches for a barcode. After this, the text is extracted and the file, with its text is stored into the database of our software. The location is based on the barcode.
Now, for the OCR we are using the SDK of Atalasoft (http://www.atalasoft.com/).
Also the Barcode recognizer is included in this SDK.
But the converted text still has some mistakes. (I ran some tests with other OCR-programs, but Atalasoft came out nice.)
I'm looking for some software (SDK-kit) which allows me to improve the quality of the PDF for OCR purposes.
I tested Kofax VRS Elite (http://www.kofax.com/vrs-virtualrescan/). I'm looking for something similar, but that can be implemented in the service using some kind of SDK-kit.
Anyone who did this before, or had similar problems?
thx in advance!

You may try and follow a different path altogether:
See if you can configure the scanner(s) to scan directly to PDF and do the OCR on the fly. The Lexmark scanners can do this. This creates PDF's with selectable and searchable text. This in turn can be extracted with a PDF reading library.
Alternatively you may want to have a look at http://www.abbyy.com/ and see if you get better results.
If these are not good options, you may want to break down your problem in a systematic way:
1. Is the image quality of the scanned images the problem? If so, then this will have to be fixed first. Your OCR solution may be affected by resolution, contrast, and colour.
2. Is it the OCR software? Take a highly legible document and see if the OCR software makes mistakes. If so, then you know you have to find better OCR software.
3. If your document quality is decent and your OCR software has a high success rate in deciphering a legible document, then you may want to look at the exceptions that do not work, and tackle these on a case by case basis.
If smears and background images on documents is the cause of the problem, you may want to look into ways of avoiding this, or cleaning this with image processing software that exposes an API.

Is there a way to replace a text in a PDF file with itextsharp?

I'm using itextsharp to generate the PDFs, but I need to change some text dynamically.
I know that it's possible to change if there's any AcroField, but my PDF doen's have any of it. It just has some pure texts and I need to change some of them.
Does anyone know how to do it?

Actually, I have a blog post on how to do it! But like IanGilham said, it depends on whether you have control over the original PDF. The basic idea is you setup a form on the page and replace the form fields with the text you want. (You can style the form so it doesn't look like a form)
If you don't have control over the PDF, let me know how to do it!
Here is a link to the full post:
Using a template to programmatically create PDFs with C# and iTextSharp

I haven't used itextsharp, but I have been using PDFNet SDK to explore the content of a large pile of PDFs for localisation over the last few weeks.
I would say that what you require is absolutely achievable, but how difficult it is will depend entirely on how much control you have over the quality of the files. In my case, the files can be constructed from any combination of images, text in any random order, tables, forms, paths, single pixel graphics and scanned pages, some of which are composed from hundreds of smaller images. Let's just say we're having fun with it.
In the PDFTron way of doing things, you would have to implement a viewer (sample available), and add some code over a text selection. Given the complexities of the format, it may be necessary to implement a simple editor in a secondary dialog with the ability to expand the selection to the next line (or whatever other fundamental object is used to make up text). The string could then be edited and applied by copying the entire page of the document into a new page, replacing the selected elements with your new string. You would probably have to do some mathematics to get this to work well though, as just about everything in PDF is located on the page by means of an affine transform.
Good luck. I'm sure there are people on here with some experience of itextsharp and PDF in general.

This question comes up from time to time on the mailing list. The same answer is given time and time again - NO. See this thread for the official answer from the person who created iText.
This question should be a FAQ on the itextsharp tag wiki.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.