Converting Image-Based PDF to Text-Based PDF - c#

how to convert Image-Based PDF to Text-Based PDF. There are lot of tools available for using. But iam looking for a C# code to make an application. I heard about Tessara but i not get code for C#. it is available only c/c++.
I used MODI dll to convert Image to Text. The process is Converting Each page of PDF to Image(using Acrobat dll) and with that output Image(bmp/tif) we can use MODI to get text. is there any possibility available to change the MODI object to PDF?
MODI.Document doc = new MODI.Document();
doc.Create(ImagePath);
doc.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, false, false);
doc.SaveAs("c://.../test.pdf", MODI.MiFILE_FORMAT.miFILE_FORMAT_DEFAULTVALUE, MODI.MiCOMP_LEVEL.miCOMP_LEVEL_HIGH);
//But this line creating PDF but the PDF is not opened. Due to error.
if u have any other way to do this please let me know.
Regards,
R.Balajiprasad

You can use google's Tesseract-OCR and the documentation can be found here. It's free and works perfectly. There is a nugget package (IronOcr) which uses tesseract and it can be found here.

Related

How can I convert PDF to doc without microsoft.office.interop?

I need to convert PDF files into .doc files using C#. The computer has no file system though it doesn't have Office installed. Any good ideas how I can approach this? I did some research and most of people use the interop services.
You need to understand that PDF is not really implemented as a single document format.
If your PDF docs are created by rendering text to a PDF file, then direct PDF conversion is not only possible, but can be very good (reliable).
If the source of your PDF is either a scanner or fax (essentially a scanner...) then what you have is a document with an "picture" of text. This scenario is more difficult to deal with. If you open up the markup for this there is no 'text' to be converted. In this situation you have to deal with some manner of OCR (optical character recognition) which is less reliable due to a variety of issues.
If you have the option of intercepting the data before it is rendered to PDF (say like in SSRS or Crystal) then it would be better for you to bypass the PDF stage and move your data to a Word document.
If you are constrained to receiving faxes and then needing to interpret their content, prepare for OCR hell. It has been a while since I was there, so I hope that it has gotten better.
Even with out office installed on your machine, you have access (with Visual Studios) to the Office developer toolkit which will allow you build documents to be distributed in the Word formats.(.doc/.docx).
An option/idea may be to convert the PDF to Html, which can be opened in Word?
use aspose pdf kit to conver pdf to text and then text to doc using filestream or aspose doc

Open PDF and print to PDF programmatically C#

I am developing an application that is able to open and display PDFs after I open them and print them to another PDF using CutePDF, but the originals are not viewable.
I am looking for a way to programmatically open a PDF file, and print to another PDF file (not necessarily using CutePDF, just printing to another PDF is the desired functionality).
This will be integrated into a C# .NET project. Are there any suggestions how to go about doing this?
Thanks.
You could use Office Interop and generate the PDF, when you say "print to another pdf", I imagine you mean just generate? Or are you saying spool them to a pdf print driver that essentially will just create a PDF to be saved.
Use iText, which is available in Java and C# versions. I have used the Java version successfully. I recommend the iText in Action book to help you get up to speed with iText faster. The book discusses only the Java API, but I imagine you will be able to learn the principles of iText from the book and then figure out the minor differences for the C# version.
To implement this you can use PDFFlow library for generating PDF files from C#. It has easy fluent syntax and many features.
Here are many examples of real complex PDF documents: examples
Good luck :)

How do I output a webpage that contains MathML to PDF?

My web application displays MathML embedded in HTML using the MathPlayer plugin. I need to output to PDF. I have PDF components (Dynamic PDF, ABCpdf), but they don't know how to parse the MathML, of course.
Is there a library that can help me translate the MathML to an image or something that I can feed to the PDF components on the fly in the web application?
Design Science has a command line Windows executable (also available as a DLL) that will convert all of the MathML in a document to EPS for use in PDF. It's the Document Composer, which is part of the MathFlow SDK. Contact us if you're interested in more info or an evaluation.
FYI, I have also found another PDF component that supports MathML called AHFormatter. I have not tried it, but it apparently works very well.

conversion of jpeg picture to pdf file

I need to convert a set of jpeg images into a pdf file (which should contain all the jpeg).
I want to do it in .Net1.1 and programatically in C#.
You could use iTextSharp to create the pdf and add images to it. Here's a sample.
You could try ImageMagick.NET - it's a wrapper around ImageMagick, which can convert pretty much anything into anything. (I've only used the command line tool.)
If there's a problem with support for older .NET versions, just execute the command line tool yourself - it's the same thing.
Use one of them open source pdf library - http://csharp-source.net/open-source/pdf-libraries
Please try Aspose.Pdf for .NET in order to either convert the images to PDF file or add images to an existing PDF file. This works with .NET 1.1 and above. You can use it in any of your .NET applications using C# or VB.NET. It works on both 32-bit and 64-bit systems alike. Please try the component at your end.
Disclosure: I work as a developer evangelist at Aspose.

HTML to Image .tiff File

Is there a way to convert a HTML string into a Image .tiff file?
I am using C# .NET 3.5. The requirement is to give the user an option to fact a confirmation. The confirmation is created with XML and a XSLT. Typically it is e-mailed.
Is there a way I can take the HTML string generated by the transformation HTML string and convert that to a .tiff or any image that can be faxed?
3rd party software is allowed, however the cheaper the better.
We are using a 3rd party fax library, that will only accept .tiff images, but if I can get the HTML to be any image I can covert it into a .tiff.
Here are some free-as-in-beer possibilities:
You can use the PDFCreator printer driver that comes with ghostscript and print
directly to a TIFF file or many other formats.
If you have MSOffice installed, the Microsoft Office Document Image Writer will produce
a file you can convert to other formats.
But in general, your best bet is to print to a driver that will produce and
image file of some kind or a windows meta-file format (.wmf) file.
Is there some reason why you can't just print-to-fax? Does the third-party software not support a printer driver? That's unusual these days.
A starting point might be the software of WebSuperGoo, which provide rich image editing products, cheap or for free.
I know for sure their PDF Writer can do basic HTML (http://www.websupergoo.com/helppdf6net/source/3-concepts/b-htmlstyles.htm). This should not be too hard to convert to TIFF.
This does not include the full HTML subset or CSS. That might require using Microsofts IE ActiveX component.

Categories