can someone tell me if AcrobatAccessLib (Acrobat Access 3.0 Type Library) in com references can be used for text searching in pdf document?
It contains class PDDom, but I dont know if I can load document into it or, how to work with it.
(I dont wanna use iTextSharp, and others, I tryied it but not works as I wanted - pdf has corrupted number paging + contains tables, that are across 2 pages - iTextSharp finds me searching text on both pages - instead of 1, but if I use Acrobat Reader - it works well)
EDIT: Or another question, Can I use acrobat reader and its searching module in my application?
I am working in c#
Thanks a lot!
Try to use PDFLIBNET.DLL
in that dll have pdfwrapper class, this class provides lots methods to get text from pdf. The FindText method used to get a text from a particular position, and exportToText method gives content of pdf page
from that content u will search the pdf content..
am using tat DLL and searching the pdf content with out any issue..
try it and let me know..
If money is not an issue, I would by the Aspose PDF components. They work pretty well and are built for server usage.
Related
I need to create a C# or C++ (MFC) application that converts pdf files to txt. I need not only to convert, but remove headers, footers, some garbage characters on the left margin etc. Thus the application shold allow the user to set page margins to cut off what is not needed. I actually have already created such an application using xpdf, but it gives me some problems when I am trying to insert custom tags into the extracted text to preserve italics and bold. Maybe somebody could suggest something useful?
Thanks.
There are shareware and freeware utilities out there. Try fetching their source code, or perhaps use them the way they are.
A public version of the PDF specification can be found here: Adobe PDF Specification
PDF Shareware readers can be found: PDF Reader source code # SourceForge
Please look at Podofo. It's a LGPL-licensed library that has many powerful editing features. One of it's examples, txt2pdf IIRC, is a good start: it shows basic text-extraction; From there you can check if pre (in pdf engine) or post (in text) filtering suffices to your goals. I didn't get to use Pdf Hummus, but it's supposed to have these capabilities too, although it's less straightforward.
I'm currently using Aspose PDF Kit to split a 'master PDF' up into individual documents + thumbnails. This works well at the moment, but the device I'll be rendering the PDF on won't know about the annotations/links within the PDF.
I understand there is a way to parse the PDF document to detect the X/Y position of a hyperlink etc, is there an simple way to extract/iterate across the document data so I can write it to an external XML file?
You may want to try Docotic.Pdf library for this (disclaimer: I work for Bit Miracle).
The library can be used to retrieve all hyperlinks in a document. You may retrieve bounding box, text and other properties of a link, too.
Please take a look at "Extract text from link target" sample. It may help you to get started.
I am working in a desktop project in C# with .net. This project has a function that generates some information and i would like to print this generated info as a document (may be .doc, .pdf, etc). Summarizing, i need:
Get the data generated by a function;
Generate a document containing these information structured with title, texts and tables (things that every document have);
Print it;
I thought generating an .html file (because it's simple to generate this kind of file), but i couldn't find a way to print it directly from my program.
Which extension of file would you recommend to insert this kind of information and print it directly from my program??
Thanks in advance.
Here's an easy way that uses a RichTextBox
http://www.codeproject.com/KB/printing/simpleprintingcs.aspx
It's not trivial to print a PDF, HTML, or a doc unless you are going to use external programs or third-party libraries. ImageMagick/GhostScript could help you print PDF.
Disclaimer: I work at Atalasoft -- If you are willing to use commercial software, my company makes PDF rendering components for .NET. There are companies that do the same for HTML.
Directly? Open printer port...
Or you can do it with framework classes:
How to: Print with a WebBrowser Control
http://msdn.microsoft.com/en-us/library/b0wes9a3.aspx
I have a pdf file which I want to open in a Windows Forms Application and perform following tasks-
View the pdf document
Zoom +/- document
Search Text
Highlight a specific text
Show it in a listbox/dropdown
select those words and highlight in pdf
Remove selection/Highlight.
I have tried using certain libraries like pdfSharp/iTextSharp even Acrobat Reader OCX control.
Its really bugging me..is there any help??
I'd suggest looking at some means of converting the PDF if you don't have a direct need to edit it. Even then, it may be easier to convert to a different form, make changes, and then convert back. PDF is a form of PostScript, which makes it powerful, but also makes it a mess to deal with and my personal preference is to skip that headache. Not always avoidable (had a lot of fun creating Thai support in PDF print#home ticket creation once without bloating the document beyond unusable), but highly recommended where possible.
Anyways, there are a variety of PDF conversion libraries out there, some of which may be available for .NET. Worst case, you may need to create a managed C++ layer to allow your C# code to access them.
Doesn't acrobat reader OCX already have all those features ? What exactly doesnt the OCX do that you need to do in your code ?
You might try contacting Adobe and getting their full SDK for PDF. It might have controls which you can use to solve your problem.
Come to think of it , is there even an SDK for PDF from Adobe ?
You have not mentioned your preference of using Free or Commercial PDF Viewer option. If you are open to use Commercial PSF viewer, you may evaluate SyncFusion PDF Viewer control, Telerik PDF Viewer, Dynamic PDF Viewer or TallComponents. I have checked feature set and all seem to have features you are looking for. I do not represent or promote any of these SDKs, I have used TallComponents and Dynamic PDF for PDF manipulation and both have excellent support, I would say PDF Veterans in .NET space.
So, I have used Pdf995's PDF print driver from a web browser to print web pages and eventually use PdfEdit995 to join these various PDF files into one large PDF.
Now I have a lot of large PDF documents that I wish to add bookmarks to, but am hoping there is a relatively easy way of doing this programmatically (using C#, preferably) - basically, I want to find, within each PDF, text that is large enough to qualify as a header, and use that text as the bookmark.
Any tips/advice/direction? Thanks!
It's definitely possible to do this, but I would recommend finding a PDF library that does most of the leg work. Technically you could do it all yourself with the aid of the PDF specification, but that'd probably take more time than it's worth.
The library will need to be able to let you find text in a document and then return the page and size, font, etc, of the text and create bookmarks (also known as outlines) based on that information programmatically.
My companies product, Quick PDF Library, can help you do this and so can PDFKit.NET. I'm sure there are other libraries out there that support this functionality too. As far as free libraries go, from what I've seen I don't believe that PDFSharp or iText will meet all of your requirements in this case, but I'm sure someone will correct me if I'm wrong.
If you'd prefer to develop a solution for this entirely yourself, then the PDF reference is available online for free.