Converting pdf to text - c#

I need to create a C# or C++ (MFC) application that converts pdf files to txt. I need not only to convert, but remove headers, footers, some garbage characters on the left margin etc. Thus the application shold allow the user to set page margins to cut off what is not needed. I actually have already created such an application using xpdf, but it gives me some problems when I am trying to insert custom tags into the extracted text to preserve italics and bold. Maybe somebody could suggest something useful?
Thanks.

There are shareware and freeware utilities out there. Try fetching their source code, or perhaps use them the way they are.
A public version of the PDF specification can be found here: Adobe PDF Specification
PDF Shareware readers can be found: PDF Reader source code # SourceForge

Please look at Podofo. It's a LGPL-licensed library that has many powerful editing features. One of it's examples, txt2pdf IIRC, is a good start: it shows basic text-extraction; From there you can check if pre (in pdf engine) or post (in text) filtering suffices to your goals. I didn't get to use Pdf Hummus, but it's supposed to have these capabilities too, although it's less straightforward.

Related

Creating PDFs Online

We are using Report Definition laguage (RDL) templates to define various reports in one of our Sharepoint applications. These reports are (then) saved as PDFs into various SharePoint Document Library's. One report in-particular renders, but is considered to be "failing" due to the styling needs of the report. So it appears RDL only understand "very simple" HTML.
For Example:
Trademark characters are not rendering as superscript (they render as normal text instead)
The ability to assign Line Height fails
The ability to assign Word Spacing fails (so printers "leading" requirements fail)
Both of these point to various marked Microsoft limitation for RDL's to interprint various HTML...of which we are now aware.
So...
I need a better tool...and we are scratching our heads on this one!
QUESTION:
What tools take-in HTML, understand CSS (well!) and can generate PDFs from C-Sharp objects?
Please keep in-mind I need the to PDF generator tools you recommend (below) to understand CSS and HTML.
NOTE:
I looked at the various other StackEchange sites to see if there is a better forum for this particular question, but this one was the only one that seemed to fit-the-bill. If you are a mediator, and feel this question is mis-placed, please feel free to move this question.
This HTML to PDF converter has the most accurate conversion of a complex html/css page. There is also a demo to try the conversion with your html
Maybe you can give Amyuni WebkitPDF a try. It is a Free component for converting HTML+CSS into PDF files. From the home page:
Directly convert HTML files into PDF without the use of a web browser or a printer driver
Convert HTML files into XAML/XPS for rendering within Silverlight
Integrate and deploy the HTML conversion feature within your applications
Generate either a single continuous PDF page or split the HTML into multiple PDF pages
Amyuni WebkitPDF is distributed as a library with a sample application, and sample code for C++ and C#.
Disclaimer: I currently work as software developer at Amyuni Technologies.
I only know a workaround for the "leading space" issue. This example "leads" the value with 10 spaces:
=space(10) & Fields!FieldName.Value
This should work for any renderer, I'll update this if I come around other tricks.
Have a look at Aspose.Pdf for .NET: http://www.aspose.com/categories/.net-components/aspose.pdf-for-.net/default.aspx

Programmatically find and replace text in pdf

I'm trying to create a web script that will allow me to alter PDF templates that I have uploaded and re-output them. I have tried Zend already which allows me to write to a PDF but that means leaving the PDF blank in certain space which is to primitive for what I need. PDFFlip was not any better.
We need to implement functionality so we can remove content from the PDF as well as remove and replace. I have looked at CAM::PDF and changepagestring.pl but I'm not sure it's up to the job. I was hard pressed to find any real usage examples and Perl is not a language I have used before.
This is for a web project but I am flexible with the language we use, ideally PHP or ASP.NET C# would be great. Preferably not Java unless there is no other way.
I should also point out that I looked through the FoxitReader SDK without any luck. I never tried to implement it but I found no mention of find and replace like functionality.
You can tinker with PDF text but it is not straight-forward just to search and replace. The text is designed as an end-file format not for easy editing. I wrote a blog post explaining some of the issues at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text
May be as workaround it's better to hold and fill in templates in some format that is more convenient for editing? E.g., you can keep your templates as Microsoft Word templates and then export them to PDF after filling. This thread may be useful on this way.
PDF file format isn't quite appropriate for editing.
Alternatively, you may prepare your templates as PDFs containing form fields. In this case filling of form fields is common and well-known task and there a lot of pdf components for this.

How to highlight text in Pdf Winforms C#

I have a pdf file which I want to open in a Windows Forms Application and perform following tasks-
View the pdf document
Zoom +/- document
Search Text
Highlight a specific text
Show it in a listbox/dropdown
select those words and highlight in pdf
Remove selection/Highlight.
I have tried using certain libraries like pdfSharp/iTextSharp even Acrobat Reader OCX control.
Its really bugging me..is there any help??
I'd suggest looking at some means of converting the PDF if you don't have a direct need to edit it. Even then, it may be easier to convert to a different form, make changes, and then convert back. PDF is a form of PostScript, which makes it powerful, but also makes it a mess to deal with and my personal preference is to skip that headache. Not always avoidable (had a lot of fun creating Thai support in PDF print#home ticket creation once without bloating the document beyond unusable), but highly recommended where possible.
Anyways, there are a variety of PDF conversion libraries out there, some of which may be available for .NET. Worst case, you may need to create a managed C++ layer to allow your C# code to access them.
Doesn't acrobat reader OCX already have all those features ? What exactly doesnt the OCX do that you need to do in your code ?
You might try contacting Adobe and getting their full SDK for PDF. It might have controls which you can use to solve your problem.
Come to think of it , is there even an SDK for PDF from Adobe ?
You have not mentioned your preference of using Free or Commercial PDF Viewer option. If you are open to use Commercial PSF viewer, you may evaluate SyncFusion PDF Viewer control, Telerik PDF Viewer, Dynamic PDF Viewer or TallComponents. I have checked feature set and all seem to have features you are looking for. I do not represent or promote any of these SDKs, I have used TallComponents and Dynamic PDF for PDF manipulation and both have excellent support, I would say PDF Veterans in .NET space.

How can I programmatically create PDF bookmarks from PDF file?

So, I have used Pdf995's PDF print driver from a web browser to print web pages and eventually use PdfEdit995 to join these various PDF files into one large PDF.
Now I have a lot of large PDF documents that I wish to add bookmarks to, but am hoping there is a relatively easy way of doing this programmatically (using C#, preferably) - basically, I want to find, within each PDF, text that is large enough to qualify as a header, and use that text as the bookmark.
Any tips/advice/direction? Thanks!
It's definitely possible to do this, but I would recommend finding a PDF library that does most of the leg work. Technically you could do it all yourself with the aid of the PDF specification, but that'd probably take more time than it's worth.
The library will need to be able to let you find text in a document and then return the page and size, font, etc, of the text and create bookmarks (also known as outlines) based on that information programmatically.
My companies product, Quick PDF Library, can help you do this and so can PDFKit.NET. I'm sure there are other libraries out there that support this functionality too. As far as free libraries go, from what I've seen I don't believe that PDFSharp or iText will meet all of your requirements in this case, but I'm sure someone will correct me if I'm wrong.
If you'd prefer to develop a solution for this entirely yourself, then the PDF reference is available online for free.

Is there a way to replace a text in a PDF file with itextsharp?

I'm using itextsharp to generate the PDFs, but I need to change some text dynamically.
I know that it's possible to change if there's any AcroField, but my PDF doen's have any of it. It just has some pure texts and I need to change some of them.
Does anyone know how to do it?
Actually, I have a blog post on how to do it! But like IanGilham said, it depends on whether you have control over the original PDF. The basic idea is you setup a form on the page and replace the form fields with the text you want. (You can style the form so it doesn't look like a form)
If you don't have control over the PDF, let me know how to do it!
Here is a link to the full post:
Using a template to programmatically create PDFs with C# and iTextSharp
I haven't used itextsharp, but I have been using PDFNet SDK to explore the content of a large pile of PDFs for localisation over the last few weeks.
I would say that what you require is absolutely achievable, but how difficult it is will depend entirely on how much control you have over the quality of the files. In my case, the files can be constructed from any combination of images, text in any random order, tables, forms, paths, single pixel graphics and scanned pages, some of which are composed from hundreds of smaller images. Let's just say we're having fun with it.
In the PDFTron way of doing things, you would have to implement a viewer (sample available), and add some code over a text selection. Given the complexities of the format, it may be necessary to implement a simple editor in a secondary dialog with the ability to expand the selection to the next line (or whatever other fundamental object is used to make up text). The string could then be edited and applied by copying the entire page of the document into a new page, replacing the selected elements with your new string. You would probably have to do some mathematics to get this to work well though, as just about everything in PDF is located on the page by means of an affine transform.
Good luck. I'm sure there are people on here with some experience of itextsharp and PDF in general.
This question comes up from time to time on the mailing list. The same answer is given time and time again - NO. See this thread for the official answer from the person who created iText.
This question should be a FAQ on the itextsharp tag wiki.

Categories