I try to build an application that can convert a PDF to an excel with C#.
I have searched for some library to help me with this, but most of them are commercially licensed, so I ended up to iTextSharp.dll
It's good that is free, but I rarely find any good open source documentation for it.
These are some link that I have read:
https://yoda.entelect.co.za/view/9902/extracting-data-from-pdf-files
https://www.mikesdotnetting.com/article/80/create-pdfs-in-asp-net-getting-started-with-itextsharp
http://www.thedevelopertips.com/DotNet/ASPDotNet/Read-PDF-and-Convert-to-Stream.aspx?id=34
there're more. But, most of them did not really explain what use of the code.
So this is most common code in IText with C#:
StringBuilder text = new StringBuilder(); // my new file that will have pdf content?
PdfReader pdfReader = new PdfReader(myPath); // This maybe how IText read the pdf?
for (int page = 1; page <= pdfReader.NumberOfPages; page++) // looping for read all content in pdf?
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy(); // ?
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); // ?
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText))); // maybe how IText convert the data to text?
text.Append(currentText); // maybe the full content?
}
pdfReader.Close(); // to close the PdfReader?
As you can see, I still do not have a clear knowledge of the IText code that I have. Tell me, if my knowledge is correct and give me an answer for code that I still not understand.
Thank You.
Let me start by explaining a bit about PDF.
PDF is not a 'what you see is what you get'-format.
Internally, PDF is more like a file containing instructions for rendering software. Unless you are working with a tagged PDF file, a PDF document does not naturally have a concept of 'paragraph' or 'table'.
If you open a PDF in notepad for instance, you might see something like
7 0 obj
<</BaseFont/Helvetica-Oblique/Encoding/WinAnsiEncoding/Subtype/Type1/Type/Font>>
endobj
Instructions in the document get gathered into 'objects' and objects are numbered, and can be cross-referenced.
As Bruno already indicated in the comments, this means that finding out what a table is, or what the content of a table is, can be really hard.
The PDF document itself can only tell you things like:
object 8 is a line from [50, 100] to [150, 100]
object 125 is a piece of text, in font Helvetica, at position [50, 110]
With the iText core library you can
get all of these objects (which iText calls PathRenderInfo, TextRenderInfo and ImageRenderInfo objects)
get the graphics state when the object was rendered (which font, font-size, color, etc)
This can allow you to write your own parsing logic.
For instance:
gather all the PathRenderInfo objects
remove everything that is not a perfect horizontal or vertical line
make clusters of everything that intersects at 90 degree angles
if a cluster contains more than a given threshold of lines, consider it a table
Luckily, the pdf2Data solution (an iText add-on) already does that kind of thing for you.
For more information go to http://pdf2data.online/
Related
Looking for a way to convert iTextSharp.text.pdf.BarcodeQRCode to System.Drawing.Image
This is what I have so far...
public System.Drawing.Image GetQRCode(string content)
{
iTextSharp.text.pdf.BarcodeQRCode qrcode = new iTextSharp.text.pdf.BarcodeQRCode(content, 115, 115, null);
iTextSharp.text.Image img = qrcode.GetImage();
MemoryStream ms = new MemoryStream(img.OriginalData);
return System.Drawing.Image.FromStream(ms);
}
In line 3 above using img.OriginalData returns null
Using img.RawData on line 3 instead thows invalid parameter error on line 4.
I've googled some of the code samples on how to perform the thing you want and your code (the "OriginalData" approach) is basicaly the same: https://csharp.hotexamples.com/examples/iTextSharp.text.pdf/BarcodeQRCode/-/php-barcodeqrcode-class-examples.html .
However, I don't see how it could work. From my investigations of BarcodeQRCode#getImage it seems that OriginalData is not set while processing such a barcode, so it will always be null.
More than that, the code you mention belongs to iText 5, which is end of life and no longer maintained (with an exception of considerable security fixes), so it's recommended to update to iText 7.
As for iText 7, I do see how to achieve the same in Java, since barcode classes do have a createAwtImage method there. .NET, on the other hand, lacks such a functionality, so I'd day that one unfortunately couldn't do it in .NET.
There are some good reasons for that. iText's Images (and a barcode could be easily converted to an iText's Image object as shown here: https://kb.itextpdf.com/home/it7kb/faq/how-to-generate-2d-barcode-as-vector-image) represent a PDF's XObject. In PDF syntax, an image file (jpg, png, etc.) is an XObject with the raw image data stored inside. However, an XObject can also contain PDF syntaxt content (it is not just used for image files). So to render such a content one needs to process the data from PDF syntax to image syntax, which is not that easy. There are some means in Java's awt to do so, that's why it's implemented in Java. As for .NET, since there is no out-of-the-box means to convert PDF images to System.Drawing.Image, it was decided not to implement it.
To conclude, there is another iText product, pdfRender, which allows one to convert PDF files (and you could create a page just for a barcode) to images. Perhaps you might want to play with it: https://itextpdf.com/en/products/itext-7/convert-pdf-to-image-pdfrender
While extracting text from PDF file using iTextSharp using the below piece of code, I am getting this error: “Could not find image data or EI” while debugging the code found that this error is coming in certain pages but not all pages, then further investigated and also found that generally there are two types image in pdf xObject image and Inline Image and using the below piece of code Inline Image ca not be handled. There are few few comments in this issue in other similar post that suggested to use latest version(5.5.0) itextsharp, that also i did but no luck. My basic purpose is to extract the text in the page not image. How can I handle the Inline image or how can I extract only the text regardless what type of image the page having.
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
PdfContentByte pdfData = pdfStamper.GetUnderContent(page);
LocTextExtractionStrategy its = new LocTextExtractionStrategy();
pdfData = pdfStamper.GetUnderContent(page);
string extractedTextInCurrentPage=PdfTextExtractor.GetTextFromPage(pdfReader, page, its);//In this line exception is throwing
}
Please share your PDF.
This is why:
Your PDF contains an inline image. Inline images are problematic in ISO-32000-1, but I personally saw to it that the problem will be solved in ISO-32000-2 (for PDF 2.0, to be expected in 2017).
In ISO-32000-1, an inline images starts with the BI operator, followed by some parameters. The length of the image bytes isn't one of those parameters. The actual image bytes are enclosed by an ID and an EI operator.
Software parsing PDF syntax needs to search for these operators and usually does a good job at it: find BI, then take the bytes between ID and EI. However: what to do when you encounter an image of which EI is part of the image bytes?
This hardly ever happens, but it was reported to us as a problem and we solved this in recent iText versions by converting the bytes between ID and EI to an image. If that fails, iText continues searching for the next EI. If iText doesn't find that EI parameter, you get the exception you mention.
This is a cumbersome process and, being a member of the ISO committee that writes the PDF standards, I introduced a new inline image parameter into the spec: the parameter /L will informs parsers exactly how many bytes are to be expected between the ID and EI operators. At the same time, I saw to it that the recommendation of keeping inline images smaller than 4 KB became normative: in PDF 2.0, it will be illegal to have inline images with more than 4096 bytes. Of course: this doesn't help you. PDF 2.0 doesn't exist yet. My work in the ISO committee only helps to solve the problem on the long term.
On the short term, we've written a work-around that solves the problem for the PDFs that were reported to us, but apparently, you've found a PDF that escapes the workaround. If you want us to solve the problem, you'll have to share the PDF.
i am having a trouble in retrieving images and text in a pdf file at the same, i was able to get images and text in a pdf file but not at the same time (this will cause a question of whether to render the image first or the text first for example in my panel control?), maybe if you guys can help me define what does each constants in pdfname means? i tried using pdfname.all but it returns null, but when using pdfname.resources it returns procset, font and xobject. i used xobject for image, but what are procset and font (could this be the style of the text? does it have pdfname.text for retrieving text)?
thanks in advance.
First of all,
i am having a trouble in retrieving images and text in a pdf file at the same
for this task you should use the iText(Sharp) parser API. In iTextSharp you essentially implement IRenderListener (an interface with methods for being informed about (bitmap) images and text fragments in a content stream) and process the page contents with it:
PdfReader reader = new PdfReader(...);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
int pageNumber = [... the number of the page you are interested in; may be a loop variable ...];
IRenderListener listener = new [... your IRenderListener implementation ...]
parser.ProcessContent(pageNumber, listener);
You ask
whether to render the image first or the text first for example in my panel control
The IRenderListener methods also retrieve information on the location of the bitmap or text fragment in question.
For ideas how the text fragments may be combined in your listener, you may want to be inspired by the implementations SimpleTextExtractionStrategy or LocationTextExtractionStrategy present in iTextSharp.
If you insist on doing it manually, though...
maybe if you guys can help me define what does each constants in pdfname means?
You find the definitions of what the names map to in the PDF specification ISO 32000-1:2008 a copy of which Adobe made available here.
when using pdfname.resources it returns procset, font and xobject. i used xobject for image, but what are procset and font (could this be the style of the text?
The contents of the page Resource Dictionaries are explained in section 7.8.3 of the specification.
does it have pdfname.text for retrieving text)?
You'll find how test is presented in page content streams and xobjects in section 9.
I have spent about 20 hours of coding to produce invoices using iText in c#.
Now, i want to use the same code to transform some of the tables to html.
Do you know if i can do this?
For instance i have this:
PdfPTable table = new PdfPTable(3);
table.DefaultCell.Border = 0;
table.DefaultCell.Padding = 3;
table.WidthPercentage = 100;
int[] widths = { 100, 200, 100};
table.SetWidths(widths);
List listOfCompanyData = (List)getCompanyData();
List listOfCumparatorDreaptaData = (List)getCumparatorDreaptaData(proformaInvoice.getCumparatorDreapta());
table.AddCell((Phrase)listOfCompanyData.Items[0]);
table.AddCell("");
table.AddCell((Phrase)listOfCumparatorDreaptaData.Items[0]);
and i want to transform this table into html...
Is it possible?
PDFs and HTML are fundamentally different display technologies. PDF is much more complex then HTML is, which is why you find so many HTML to PDF converters. The other way around is much more difficult.
iText can only do do it from HTML to PDF.
There are online converters that will take a PDF and convert it to HTML. There are also downloadable utilities.
I am not aware of any .NET library that will do this.
PDF is almost a write-only format. Any time your workflow calls for "get the data out of a PDF", you've probably screwed up.
Having said that, there are several ways to stash data within a PDF:
Form fields have no particular length limit and need not be visible. Getting form data with iText is trivial.
You can attach a file to a PDF and suck it out later, both with iText.
DocInfo fields. You can stuff a string into one of the author/title/keywords/etc metadata fields. An ugly hack, but effective.
XML metadata. The "new-fangled" metadata is stored in an XML schema. You can put pretty much whatever you want in there... though iText regenerates some of it every time it makes changes (mod date and such).
Custom keys/values. You can tack any old key/value pairs you like into any old dictionary within a PDF. Adobe would like you to register a company-specific prefix for your custom tags to avoid collisions, but I've never felt the need.
From the book iText in Action it seems that it is doable using the original java library, but it does not seem like it is no longer ported in the c# lib. I'm pretty sure it was in version 4 :-/
Try look at some old source here: http://www.koders.com/csharp/fid60B0985D3A89152128B73F54EDD4EB5420A5E4D8.aspx?s=%22Ken+Auer%22
nFOP + XSLT + XML = pdf | doc | HTML
nfop.sourceforge.net/article.html should give you an idea on how to use it, you need "Microsoft Visual J # NET Redistributable Package" to run nFOP
open source no cost :)
K
Just wondering if anyone could tell me of a simple way to create files for printing? At the moment I'm just scripting HTML, but I'm wondering if there isn't some easier way of doing it that would give me more control over what it being printed? Something along the lines of an Access printout, or Excel printout - where I could decide how to lay things out and almost "Mail merge" the details in via programming.
Basically, I want to create something for print that can have tables encasing it, and could be longer or shorter for each record depending upon the number of foreign keys (e.g. one staff member could have 10 jobs today, or just 3. I want to create a document that will generate and print).
Any ideas/advice/opinions? Thank you!
EDIT: Wow, thanks for all the responses! For this particular task, FlowDocuments seems to be the closest to what I'm actually after so I'll play with that. Either way I have several really good options now.
EDIT 2: After some playing, iTextSharp has become the choice for me. For anyone wondering in the future, here is a link to a great and simple tutorial: http://www.mikesdotnetting.com/Category/20
Thanks again!
I would create a PDF file which can be viewed just about anywhere and will maintain formatting. Take a look here: http://itextsharp.sourceforge.net/
There's always FlowDocuments. Check out the overview at MSDN
http://msdn.microsoft.com/en-us/library/aa970909.aspx and see if they match what you want to do. They're pretty easy to print and can be serialized to xaml. Might not be exactly what you're after, but they're pretty useful.
We are currently using PDFSharp with great success -
http://www.pdfsharp.com/PDFsharp/
GDI+ or WPF ... all .NET, not COM or interop.
Oh, and its open source. Here is some sample code -
http://www.pdfsharp.net/wiki/PDFsharpSamples.ashx
If you use PDF or XPS generator, it still requires you to define the document composition very much like scripting your HTML, so I dont see that it gives you much more values other than the created file is in print ready format.
What you need is something that you can design a template and just filling in the blank, so I suggest that you either go for Word or Excel automation, otherwise look at some lightweight report generation library. I come across this and maybe it is worth checking out too.
http://www.fyireporting.com/
Like David i also recommended I Text Sharp ;) It's relly easy to create pdf document with this ;) I use it in ASP.NET project. It have much of options to format pdf file, in my example i use basic ;)
Example:
string file = #"d:\print.pdf"; //path to pdf file
Document myDocument = new Document(PageSize.A4.Rotate());
PdfWriter.GetInstance(myDocument, new FileStream(file, FileMode.Create));
myDocument.Open();
//data to save in pdf- unimportant!
Opiekun obiekun = (from opiekunTmp in db.Opiekuns where opiekunTmp.idOpiekun == nalez.Dziecko.idOpiekun select opiekunTmp).SingleOrDefault();
Dziecko dzieckoZap = (from dzieckoTmp in db.Dzieckos where dzieckoTmp.idDziecko == nalez.idDziecko select dzieckoTmp).SingleOrDefault();
//some info about font
BaseFont times = BaseFont.CreateFont(BaseFont.TIMES_ROMAN, BaseFont.CP1250, BaseFont.EMBEDDED);
Font font = new Font(times, 12);
myDocument.Add(new Paragraph("--------------------------Raport opłaty--------------------------",font));
myDocument.Add(new Paragraph("Data rozliczenia: " + (((TextBox)this.GridViewOplaty.Rows[e.RowIndex].Cells[8].Controls[0]).Text), font));
myDocument.Add(new Paragraph("Płatnik: " + obiekun.Imie + " " + obiekun.Nazwisko, font));
myDocument.Add(new Paragraph("Dziecko: " + dzieckoZap.Imie + " " + dzieckoZap.Nazwisko, font));
myDocument.Add(new Paragraph(""));
myDocument.Add(new Paragraph("Data Podpis płatnika: " + obiekun.Imie + " " + obiekun.Nazwisko, font));
myDocument.Add(new Paragraph(""));
myDocument.Add(new Paragraph(" ........... ................................."));
myDocument.Close(); //we close the pdf and open
System.Diagnostics.Process.Start(file); //and open our file if You want that ;)
I have created printouts from web sites using Excel XML format. Basically this means you don't have to use the Office APIs to generate the document (which can be cumbersome and requires extra libraries on the web server). Instead, you can just take an XML template, use XPath, LINQ to XML, or other technologies to insert your data into the template, and then stream it to the user and they can print it.
Generating the template is easy. You just use Excel to create the document and then save it in the "XML Spreadsheet" format. The XML is a bit oppressive but it isn't terrible.
Documentation on the XML Spreadsheet format is here:
http://msdn.microsoft.com/en-us/library/aa140066%28office.10%29.aspx
Note that the documentation is for Excel 2002. The format does change in newer versions of Excel, but it is backwards compatible.
We use ActiveReports. It is very easy to define your layout and can print and export in a number of formats, pdf, rtf, excel, tiff, etc.
If you're looking for a very simple way to create docs (prob not the best, but it sure is easy), you can set up Word docs with bookmarks and insert data into the bookmarks through code, so I'm guessing this would work for Excel too (if they have bookmarks?):
EDIT: Here's a quick translation into c# (been tested in vb, but not c#):
Word.Application oWord = default(Word.Application);
Word.Document oDoc = default(Word.Document);
oWord = Interaction.CreateObject("Word.Application");
oWord.Visible = false;
oDoc = oWord.Documents.Add(Directory + "\\MyDocument.dot");
oDoc.Bookmarks("MyBookmark").Range.Text = strBookmark;
oDoc.PrintOut();
oDoc.Close(Word.WdSaveOptions.wdDoNotSaveChanges);
oDoc = null;
oWord.Application.Quit();
oWord = null;
I've done the word and pdf generation things in the past and the support for pdf generation in pdfsharp (#Kris) is pretty good and I would use it ahead of office automation.
I hope I've not mis-read your needs but rather than exporting in a specific format and then firing the print feature I would these days re-consider plain old browser printing. In the past the awful limitations of browser printing meant that printing was a bad experience (no shrink-to-fit etc). But modern browsers have sufficient print support to be acceptable for simple jobs.
I've just checked printing this page in Firefox 3.5.8 and IE8 and both support shrink-to-fit and I reckon a simple print stylesheet will generate nice looking job sheets straight out of the browser as long as your (presumably internal) audience is guaranteed to have a modern browser.
XPS is MSFT's solution to print structured and formatted documents. I'm not saying to send someone an XPS, just to use the .NET support for printing via the XPS framework.
http://roecode.wordpress.com/2007/12/21/using-flowdocument-xaml-to-print-xps-documents/
It's very easy to do if you're doing WPF. Essentially an XPS is MSFT's PDF. Anyone running Vista or 7 can view/print them fine.