Extract data from nested tables in PDF

Extract data from nested tables in PDF - c#

I have a few pdf files that were created from word or excel files.
I need to get the information thats in the tables.
The text in the document is not an image so I'm able to extract the text using tools such as pdfbox.
When I have the text I have no way of knowing what cells in the table it belongs to because I don't know where the table borders are.
Iv'e tried a few desktop tools such as abby or solid pdf converter and they are able to convert the files into nice word documents but this doesn't suit my needs as I want to be able to do this programatticly in C#.
Some of the tables have nested tables wich I think makes this a little bit more diffucult.
I appreciate your help

The difficulty here is caused by the fact that the text in the PDF is not contained within any table. It might look like it is, but underneath the surface, it is not.
So there are a couple of options that I can think of. But none of them are going to be quite as satisfying as you'd probably like.
There are some companies that offer SDKs for PDF to Excel/Word conversion. Investintech and Iceni are a couple of examples. But these solutions are not free.
If you know the exact layout of the PDF files that you need to extract the table data from, then you can use any SDK that lets you extract text from a PDF and also tells you the exact co-ordinates of the extracted text. Using this method you need to know in advance where the text is going to be, so that you can extract text from a specific area on the page. It obviously won't work if you need to process any random document.
It's a difficult task, but hopefully this will give you a starting point.

Related

Can I use templates from a SQL table and fill on that the information saved on another table using Winforms?How can I let to download it as PDF file?

I am creating a template for a CV. It is stored in a database table as an image with format VARBINARY(MAX). I am going to load them to Winforms to be able to add labels on the template and fill the information in the labels. The information for the template is saved in another table of the database.
Now I need to fill the information on my templates using winforms and then save them on the device by choosing a PDF from the dialog bar.
What can I use for doing this? Is there any tool which can help me doing so?

You are asking quite a few different questions here, without giving too much detail. I'll try to give some guidance, but without a lot more detail, it's hard to be specific.
First, the question doesn't have anything to do with WinForms, what you are asking would be the same in pretty much any approach, so I'll ignore WinForms.
If you have the template in a database table, presumably as plain text (which could of course be HTML, PostScript or any other markup language) with placeholders, and you have the data in another table, then you can simply merge the two together to produce the CV in plain text.
Once you have the merged CV, converting it to PDF is best done with a library. It's not the sort of thing you want to try doing yourself. Which library you use depends a lot on your requirements and budget. There are one or two free ones, but I can't say how good they are, as I use a commercial one. I'm not making any specific recommendations here, but a bit of searching and experimenting should get you what you want.
Hope that answers you questions. If not, please edit your question to include more specific information.

Mail merge or merge-like functionality from C#

I need to print a few thousand stickers with a few text fields (name, position, etc) as well as a barcode image.
Each staff member gets two unique stickers, and the sticker paper has 4 per sheet so that's 2 staff per sheet.
I already have all the code to generate the barcode as an Image, and the staff details are stored in a List of object.
If possible, I'd like to avoid using MSWord directly since my development environment is quite different from the target environment and I've had issues in the past from the disparity. (Win7-64, MSOffice2010 vs. WinXP-32, MSOffice2003).
What's the best way to accomplish this?
If I save the document as an XML format and replace the mail merge fields with unique tokens which I can replace with my actual values (and I can even replace the binary image data with base-64 encoded image bytes) then that works but it's clunky. For starters, I'd have to save the XML file and then somehow print it transparent to the user (don't want Word showing up). Also, the XML template is 1 page, but I might have several dozen to print. I can send each page to the printer individually but that's not exactly ideal.
Any other suggestions?

I would use DevXpress XtraReports as I have used it in the past in similar scenarios with great results. If you prefer other engines like Crystal or Telerik is the same, as easy as dragging some fields in the page details section and assign your object list as datasource. DevXpress has also a RichTextBox with builtin mailmerge feature. at last if you decide for word do not forget that you can automate and use it while keeping it invisible so users wont see it.

Use iTextsharp to edit pdf template without Acrofields

I have a pdf template without AcroFields and i need to replace text in it. The text is formated like this ((aFieldToReplace)), but there are also tables that need filled up with a n-numbered rows.
Is there any good tutorial, resource or sample to find?
Is there a way to replace a text in a PDF file with itextsharp? has more or less the same question but the answer ignores the "no Acrofield" part of the question.
EDIT:
To make it even harder, i have multiple templates that i can use. The templates have all there own formatting-style (font, color,...)
EDIT 2:
The purpose is to create a report with some data in a database. The data in a database is coming from several forms in a ASP.NET MVC application.
The report could have several layouts depending on the chosen template.
Templates should be addable dynamically, so i can't create the layout from scratch. I really need to get the layout from a template.

Quoting the excellent iText in Action:
In a PDF document, every character or glyph on a PDF page has its fixed position, regardless of the application that’s used to view the document.
[…]
Suppose you want to replace the word “edit” with the word “manipulate” in a sentence, you’d have to reflow the text. You’d have to reposition all the characters that follow that word. Maybe you’d even have to move a portion of the text to the next page. That’s not trivial, if not impossible.
[…]
Don’t expect any tool to be able to edit a PDF file the same way you’d edit a Word document.
PDF is a document display format. If you want templating you'll probably have to use something else.

#Frederiek:
If you can spend a bit of money, this will do exactly what you want. Check out the demo, it's quite cool. It can reflow the text, replace images, etc. Quite nice.
http://www.iceni.com/infixServer.htm
Let me know if that works for you.

Reading PDF file to get tabular data in structured format,

I have to read a pdf file which contains a table with several columns. Using iTextSharp I am able to read the file but I get bunch of non-formatted text. I am not able to structure the data so that I can insert into a database.
Any suggestions?

Unless its structured text there is no tagging to show columns. Tools like PdfBox make 'guesses' to try and extract the table.
There is an article explaining why text extraction is so hard at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text

If I understand it correctly, pdf text is stored positionally, so it has no concept of rows or columns. That means you have to use heuristics based on the "likelihood" that a you're reading from a different column.
You can try doing this by comparing the amount of space between the words. (I'm not familiar with the ITextSharp interface so please forgive me if I'm mentioning things its not capable of. . . I'm mostly familiar with pdfNet.
Another idea that just came to me is that if the text has visual cues such as vertical lines separating the columns. If that's the case you should be able to come up with heuristics to determine if the text is left or right of the column lines.
...
However the best thing to do, if possible, is to get ahold of the data in a more database friendly format. This will likely save heartaches in the long run.
-- Jason

I am concluding there is no straight forward way to do this. Atleast reading the data in tabular format. I tried suggestions provided by Mark, but it is seems to be not feasible as per my requirement.

Is there a way to replace a text in a PDF file with itextsharp?

I'm using itextsharp to generate the PDFs, but I need to change some text dynamically.
I know that it's possible to change if there's any AcroField, but my PDF doen's have any of it. It just has some pure texts and I need to change some of them.
Does anyone know how to do it?

Actually, I have a blog post on how to do it! But like IanGilham said, it depends on whether you have control over the original PDF. The basic idea is you setup a form on the page and replace the form fields with the text you want. (You can style the form so it doesn't look like a form)
If you don't have control over the PDF, let me know how to do it!
Here is a link to the full post:
Using a template to programmatically create PDFs with C# and iTextSharp

I haven't used itextsharp, but I have been using PDFNet SDK to explore the content of a large pile of PDFs for localisation over the last few weeks.
I would say that what you require is absolutely achievable, but how difficult it is will depend entirely on how much control you have over the quality of the files. In my case, the files can be constructed from any combination of images, text in any random order, tables, forms, paths, single pixel graphics and scanned pages, some of which are composed from hundreds of smaller images. Let's just say we're having fun with it.
In the PDFTron way of doing things, you would have to implement a viewer (sample available), and add some code over a text selection. Given the complexities of the format, it may be necessary to implement a simple editor in a secondary dialog with the ability to expand the selection to the next line (or whatever other fundamental object is used to make up text). The string could then be edited and applied by copying the entire page of the document into a new page, replacing the selected elements with your new string. You would probably have to do some mathematics to get this to work well though, as just about everything in PDF is located on the page by means of an affine transform.
Good luck. I'm sure there are people on here with some experience of itextsharp and PDF in general.

This question comes up from time to time on the mailing list. The same answer is given time and time again - NO. See this thread for the official answer from the person who created iText.
This question should be a FAQ on the itextsharp tag wiki.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract data from nested tables in PDF - c#

Related

Can I use templates from a SQL table and fill on that the information saved on another table using Winforms?How can I let to download it as PDF file?

Mail merge or merge-like functionality from C#

Use iTextsharp to edit pdf template without Acrofields

Reading PDF file to get tabular data in structured format,

Is there a way to replace a text in a PDF file with itextsharp?

Categories

Resources