Reading PDF file to get tabular data in structured format,

Reading PDF file to get tabular data in structured format, - c#

I have to read a pdf file which contains a table with several columns. Using iTextSharp I am able to read the file but I get bunch of non-formatted text. I am not able to structure the data so that I can insert into a database.
Any suggestions?

Unless its structured text there is no tagging to show columns. Tools like PdfBox make 'guesses' to try and extract the table.
There is an article explaining why text extraction is so hard at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text

If I understand it correctly, pdf text is stored positionally, so it has no concept of rows or columns. That means you have to use heuristics based on the "likelihood" that a you're reading from a different column.
You can try doing this by comparing the amount of space between the words. (I'm not familiar with the ITextSharp interface so please forgive me if I'm mentioning things its not capable of. . . I'm mostly familiar with pdfNet.
Another idea that just came to me is that if the text has visual cues such as vertical lines separating the columns. If that's the case you should be able to come up with heuristics to determine if the text is left or right of the column lines.
...
However the best thing to do, if possible, is to get ahold of the data in a more database friendly format. This will likely save heartaches in the long run.
-- Jason

I am concluding there is no straight forward way to do this. Atleast reading the data in tabular format. I tried suggestions provided by Mark, but it is seems to be not feasible as per my requirement.

Related

Can I use templates from a SQL table and fill on that the information saved on another table using Winforms?How can I let to download it as PDF file?

I am creating a template for a CV. It is stored in a database table as an image with format VARBINARY(MAX). I am going to load them to Winforms to be able to add labels on the template and fill the information in the labels. The information for the template is saved in another table of the database.
Now I need to fill the information on my templates using winforms and then save them on the device by choosing a PDF from the dialog bar.
What can I use for doing this? Is there any tool which can help me doing so?

You are asking quite a few different questions here, without giving too much detail. I'll try to give some guidance, but without a lot more detail, it's hard to be specific.
First, the question doesn't have anything to do with WinForms, what you are asking would be the same in pretty much any approach, so I'll ignore WinForms.
If you have the template in a database table, presumably as plain text (which could of course be HTML, PostScript or any other markup language) with placeholders, and you have the data in another table, then you can simply merge the two together to produce the CV in plain text.
Once you have the merged CV, converting it to PDF is best done with a library. It's not the sort of thing you want to try doing yourself. Which library you use depends a lot on your requirements and budget. There are one or two free ones, but I can't say how good they are, as I use a commercial one. I'm not making any specific recommendations here, but a bit of searching and experimenting should get you what you want.
Hope that answers you questions. If not, please edit your question to include more specific information.

Mail merge or merge-like functionality from C#

I need to print a few thousand stickers with a few text fields (name, position, etc) as well as a barcode image.
Each staff member gets two unique stickers, and the sticker paper has 4 per sheet so that's 2 staff per sheet.
I already have all the code to generate the barcode as an Image, and the staff details are stored in a List of object.
If possible, I'd like to avoid using MSWord directly since my development environment is quite different from the target environment and I've had issues in the past from the disparity. (Win7-64, MSOffice2010 vs. WinXP-32, MSOffice2003).
What's the best way to accomplish this?
If I save the document as an XML format and replace the mail merge fields with unique tokens which I can replace with my actual values (and I can even replace the binary image data with base-64 encoded image bytes) then that works but it's clunky. For starters, I'd have to save the XML file and then somehow print it transparent to the user (don't want Word showing up). Also, the XML template is 1 page, but I might have several dozen to print. I can send each page to the printer individually but that's not exactly ideal.
Any other suggestions?

I would use DevXpress XtraReports as I have used it in the past in similar scenarios with great results. If you prefer other engines like Crystal or Telerik is the same, as easy as dragging some fields in the page details section and assign your object list as datasource. DevXpress has also a RichTextBox with builtin mailmerge feature. at last if you decide for word do not forget that you can automate and use it while keeping it invisible so users wont see it.

How to verify that pdf is text based using ITextSharp?

I need to verify that the pdf report is text based (and not bitmap based; however it could contain some images). I do not need to extract the text, just to verify that it is text based.
Is there a way how to perform such a verification using ITextSharp library?
Thanks in advance,
Stefan

You can look for text drawing commands easily enough. The least work on your part would be to try to extract the text and see if anything is there. Ideally you'd know some of the text it should contain and search for it. A single sentence or phrase would be plenty for this sort of testing.
Text extraction with iText is pretty trivial these days. Lots of examples floating around SO, and the web.

How to read the empty cell in a PDF file in ASP.net

I am able to read a pdf file using PDFBOX in my ASP.net application but it is not adding space for an empty cell in a table, So how to read empty fields from a pdf file using PDFBOX in C#. Is there any other method to read the pdf file .
Thanks .

You might be able to pull off this sort of thing if you know exactly where the text should be in advance and can get the locations of the text as you extract it.
If you don't know in advance where the rows and cells are, you'll have to guess based on the text locations. This will not be easy.
In general, extracting data from PDF is ill advised. PDFs don't have a concept of "tables" (unless the PDF creator goes well out of there way to use "Marked Content", which is still rare). PDFs have lines, glyphs, and images (a pile of pixels). It is Very Hard to extract formatting from that information... and sometimes it is all but impossible.
I don't know if PDFBox will give you the locations of extracted text, but iTextSharp will.

Extract data from nested tables in PDF

I have a few pdf files that were created from word or excel files.
I need to get the information thats in the tables.
The text in the document is not an image so I'm able to extract the text using tools such as pdfbox.
When I have the text I have no way of knowing what cells in the table it belongs to because I don't know where the table borders are.
Iv'e tried a few desktop tools such as abby or solid pdf converter and they are able to convert the files into nice word documents but this doesn't suit my needs as I want to be able to do this programatticly in C#.
Some of the tables have nested tables wich I think makes this a little bit more diffucult.
I appreciate your help

The difficulty here is caused by the fact that the text in the PDF is not contained within any table. It might look like it is, but underneath the surface, it is not.
So there are a couple of options that I can think of. But none of them are going to be quite as satisfying as you'd probably like.
There are some companies that offer SDKs for PDF to Excel/Word conversion. Investintech and Iceni are a couple of examples. But these solutions are not free.
If you know the exact layout of the PDF files that you need to extract the table data from, then you can use any SDK that lets you extract text from a PDF and also tells you the exact co-ordinates of the extracted text. Using this method you need to know in advance where the text is going to be, so that you can extract text from a specific area on the page. It obviously won't work if you need to process any random document.
It's a difficult task, but hopefully this will give you a starting point.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading PDF file to get tabular data in structured format, - c#

I have to read a pdf file which contains a table with several columns. Using iTextSharp I am able to read the file but I get bunch of non-formatted text. I am not able to structure the data so that I can insert into a database. Any suggestions?

Unless its structured text there is no tagging to show columns. Tools like PdfBox make 'guesses' to try and extract the table. There is an article explaining why text extraction is so hard at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text

I am concluding there is no straight forward way to do this. Atleast reading the data in tabular format. I tried suggestions provided by Mark, but it is seems to be not feasible as per my requirement.

Related

Can I use templates from a SQL table and fill on that the information saved on another table using Winforms?How can I let to download it as PDF file?

Mail merge or merge-like functionality from C#

How to verify that pdf is text based using ITextSharp?

How to read the empty cell in a PDF file in ASP.net

Extract data from nested tables in PDF

Categories

Resources