I am creating a program that, at the moment, is using locationtextextractionstrategy() from iTextSharp to retrieve the text inside a pdf. This works great except for how long it takes for some of the pdfs I have to run through the program. So I was curious if there was a way to only pull a certain amount of text from a pdf instead of pulling it all. Maybe use coordinates to specify a region and then extract the text from the specified area. Any input would be much appreciated, thank you!
Related
I need to verify that the pdf report is text based (and not bitmap based; however it could contain some images). I do not need to extract the text, just to verify that it is text based.
Is there a way how to perform such a verification using ITextSharp library?
Thanks in advance,
Stefan
You can look for text drawing commands easily enough. The least work on your part would be to try to extract the text and see if anything is there. Ideally you'd know some of the text it should contain and search for it. A single sentence or phrase would be plenty for this sort of testing.
Text extraction with iText is pretty trivial these days. Lots of examples floating around SO, and the web.
I am able to read a pdf file using PDFBOX in my ASP.net application but it is not adding space for an empty cell in a table, So how to read empty fields from a pdf file using PDFBOX in C#. Is there any other method to read the pdf file .
Thanks .
You might be able to pull off this sort of thing if you know exactly where the text should be in advance and can get the locations of the text as you extract it.
If you don't know in advance where the rows and cells are, you'll have to guess based on the text locations. This will not be easy.
In general, extracting data from PDF is ill advised. PDFs don't have a concept of "tables" (unless the PDF creator goes well out of there way to use "Marked Content", which is still rare). PDFs have lines, glyphs, and images (a pile of pixels). It is Very Hard to extract formatting from that information... and sometimes it is all but impossible.
I don't know if PDFBox will give you the locations of extracted text, but iTextSharp will.
I'm currently using Aspose PDF Kit to split a 'master PDF' up into individual documents + thumbnails. This works well at the moment, but the device I'll be rendering the PDF on won't know about the annotations/links within the PDF.
I understand there is a way to parse the PDF document to detect the X/Y position of a hyperlink etc, is there an simple way to extract/iterate across the document data so I can write it to an external XML file?
You may want to try Docotic.Pdf library for this (disclaimer: I work for Bit Miracle).
The library can be used to retrieve all hyperlinks in a document. You may retrieve bounding box, text and other properties of a link, too.
Please take a look at "Extract text from link target" sample. It may help you to get started.
I have a few pdf files that were created from word or excel files.
I need to get the information thats in the tables.
The text in the document is not an image so I'm able to extract the text using tools such as pdfbox.
When I have the text I have no way of knowing what cells in the table it belongs to because I don't know where the table borders are.
Iv'e tried a few desktop tools such as abby or solid pdf converter and they are able to convert the files into nice word documents but this doesn't suit my needs as I want to be able to do this programatticly in C#.
Some of the tables have nested tables wich I think makes this a little bit more diffucult.
I appreciate your help
The difficulty here is caused by the fact that the text in the PDF is not contained within any table. It might look like it is, but underneath the surface, it is not.
So there are a couple of options that I can think of. But none of them are going to be quite as satisfying as you'd probably like.
There are some companies that offer SDKs for PDF to Excel/Word conversion. Investintech and Iceni are a couple of examples. But these solutions are not free.
If you know the exact layout of the PDF files that you need to extract the table data from, then you can use any SDK that lets you extract text from a PDF and also tells you the exact co-ordinates of the extracted text. Using this method you need to know in advance where the text is going to be, so that you can extract text from a specific area on the page. It obviously won't work if you need to process any random document.
It's a difficult task, but hopefully this will give you a starting point.
I'm using itextsharp to generate the PDFs, but I need to change some text dynamically.
I know that it's possible to change if there's any AcroField, but my PDF doen's have any of it. It just has some pure texts and I need to change some of them.
Does anyone know how to do it?
Actually, I have a blog post on how to do it! But like IanGilham said, it depends on whether you have control over the original PDF. The basic idea is you setup a form on the page and replace the form fields with the text you want. (You can style the form so it doesn't look like a form)
If you don't have control over the PDF, let me know how to do it!
Here is a link to the full post:
Using a template to programmatically create PDFs with C# and iTextSharp
I haven't used itextsharp, but I have been using PDFNet SDK to explore the content of a large pile of PDFs for localisation over the last few weeks.
I would say that what you require is absolutely achievable, but how difficult it is will depend entirely on how much control you have over the quality of the files. In my case, the files can be constructed from any combination of images, text in any random order, tables, forms, paths, single pixel graphics and scanned pages, some of which are composed from hundreds of smaller images. Let's just say we're having fun with it.
In the PDFTron way of doing things, you would have to implement a viewer (sample available), and add some code over a text selection. Given the complexities of the format, it may be necessary to implement a simple editor in a secondary dialog with the ability to expand the selection to the next line (or whatever other fundamental object is used to make up text). The string could then be edited and applied by copying the entire page of the document into a new page, replacing the selected elements with your new string. You would probably have to do some mathematics to get this to work well though, as just about everything in PDF is located on the page by means of an affine transform.
Good luck. I'm sure there are people on here with some experience of itextsharp and PDF in general.
This question comes up from time to time on the mailing list. The same answer is given time and time again - NO. See this thread for the official answer from the person who created iText.
This question should be a FAQ on the itextsharp tag wiki.