I have a fairly simple task: I need to read a PDF file and write out its image contents while ignoring its text contents. So essentially I need to do the complement of "save as text".
Ideally, I would prefer to avoid any sort of re-compression of the image contents but if it's not possible, it's ok too.
Are the examples of how to do it?
Thanks!
Extracting text from a PDF file with PDFsharp is not a simple task.
It was discussed recently in this thread:
https://stackoverflow.com/a/9161732/162529
Extracting text from a PDF with PdfSharp can actually be very easy, depending on the document type and what you intend to do with it. If the text is in the document as text, and not an image, and you don't care about the position or format, then it's quite simple. This code gets all of the text of the first page in the PDFs I'm working with:
var doc = PdfReader.Open(docPath);
string pageText = doc.Pages[0].Contents.Elements.GetDictionary(0).Stream.ToString();
doc.Pages.Count gives you the total number of pages, and you access each one through the doc.Pages array with the index. I don't recommend using foreach and Linq here, as the interfaces aren't implemented well. The index passed into GetDictionary is for which PDF document element - this may vary based on how the documents are produced. If you don't get the text you're looking for, try looping through all of the elements.
The text that this produces will be full of various PDF formatting codes. If all you need to do is extract strings, though, you can find the ones you want using Regex or any other appropriate string searching code. If you need to do anything with the formatting or positioning, then good luck - from what I can tell, you'll need it.
Example of PDFSharp libraries extracting images from .pdf file:
link
library
EDIT:
Then if you want to extract text from image you have to use OCR libraries.
There are two good OCRs tessnet and MODI
Link to thread on stack
But I fully can recommend MODI which I am using now. Some sample # codeproject.
EDIT 2 :
If you don't want to read text from extracted images, you should write new PDF document and put all of them into it. For writing PDFs I use MigraDoc. It is not difficult to use that library.
Related
I have PDF document data with table structure format and I would like to convert that PDF file into a text file with the same structure with margin and spaces between text in pdf
You need to write your own PDF tool then. Which is not exactly an easy task. Honestly, 3rd party tools make your job much easier, why don't you want to use one?
If you change your mind, I can suggest iTextSharp. I've used it in the past with great success. Here are some example to get you going:
http://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C
ps. there are 3 tools used in there.
I need to verify that the pdf report is text based (and not bitmap based; however it could contain some images). I do not need to extract the text, just to verify that it is text based.
Is there a way how to perform such a verification using ITextSharp library?
Thanks in advance,
Stefan
You can look for text drawing commands easily enough. The least work on your part would be to try to extract the text and see if anything is there. Ideally you'd know some of the text it should contain and search for it. A single sentence or phrase would be plenty for this sort of testing.
Text extraction with iText is pretty trivial these days. Lots of examples floating around SO, and the web.
I am able to read a pdf file using PDFBOX in my ASP.net application but it is not adding space for an empty cell in a table, So how to read empty fields from a pdf file using PDFBOX in C#. Is there any other method to read the pdf file .
Thanks .
You might be able to pull off this sort of thing if you know exactly where the text should be in advance and can get the locations of the text as you extract it.
If you don't know in advance where the rows and cells are, you'll have to guess based on the text locations. This will not be easy.
In general, extracting data from PDF is ill advised. PDFs don't have a concept of "tables" (unless the PDF creator goes well out of there way to use "Marked Content", which is still rare). PDFs have lines, glyphs, and images (a pile of pixels). It is Very Hard to extract formatting from that information... and sometimes it is all but impossible.
I don't know if PDFBox will give you the locations of extracted text, but iTextSharp will.
I'm currently using Aspose PDF Kit to split a 'master PDF' up into individual documents + thumbnails. This works well at the moment, but the device I'll be rendering the PDF on won't know about the annotations/links within the PDF.
I understand there is a way to parse the PDF document to detect the X/Y position of a hyperlink etc, is there an simple way to extract/iterate across the document data so I can write it to an external XML file?
You may want to try Docotic.Pdf library for this (disclaimer: I work for Bit Miracle).
The library can be used to retrieve all hyperlinks in a document. You may retrieve bounding box, text and other properties of a link, too.
Please take a look at "Extract text from link target" sample. It may help you to get started.
I'm using itextsharp to generate the PDFs, but I need to change some text dynamically.
I know that it's possible to change if there's any AcroField, but my PDF doen's have any of it. It just has some pure texts and I need to change some of them.
Does anyone know how to do it?
Actually, I have a blog post on how to do it! But like IanGilham said, it depends on whether you have control over the original PDF. The basic idea is you setup a form on the page and replace the form fields with the text you want. (You can style the form so it doesn't look like a form)
If you don't have control over the PDF, let me know how to do it!
Here is a link to the full post:
Using a template to programmatically create PDFs with C# and iTextSharp
I haven't used itextsharp, but I have been using PDFNet SDK to explore the content of a large pile of PDFs for localisation over the last few weeks.
I would say that what you require is absolutely achievable, but how difficult it is will depend entirely on how much control you have over the quality of the files. In my case, the files can be constructed from any combination of images, text in any random order, tables, forms, paths, single pixel graphics and scanned pages, some of which are composed from hundreds of smaller images. Let's just say we're having fun with it.
In the PDFTron way of doing things, you would have to implement a viewer (sample available), and add some code over a text selection. Given the complexities of the format, it may be necessary to implement a simple editor in a secondary dialog with the ability to expand the selection to the next line (or whatever other fundamental object is used to make up text). The string could then be edited and applied by copying the entire page of the document into a new page, replacing the selected elements with your new string. You would probably have to do some mathematics to get this to work well though, as just about everything in PDF is located on the page by means of an affine transform.
Good luck. I'm sure there are people on here with some experience of itextsharp and PDF in general.
This question comes up from time to time on the mailing list. The same answer is given time and time again - NO. See this thread for the official answer from the person who created iText.
This question should be a FAQ on the itextsharp tag wiki.