itext - pdf to html

itext - pdf to html - c#

I have spent about 20 hours of coding to produce invoices using iText in c#.
Now, i want to use the same code to transform some of the tables to html.
Do you know if i can do this?
For instance i have this:
PdfPTable table = new PdfPTable(3);
table.DefaultCell.Border = 0;
table.DefaultCell.Padding = 3;
table.WidthPercentage = 100;
int[] widths = { 100, 200, 100};
table.SetWidths(widths);
List listOfCompanyData = (List)getCompanyData();
List listOfCumparatorDreaptaData = (List)getCumparatorDreaptaData(proformaInvoice.getCumparatorDreapta());
table.AddCell((Phrase)listOfCompanyData.Items[0]);
table.AddCell("");
table.AddCell((Phrase)listOfCumparatorDreaptaData.Items[0]);
and i want to transform this table into html...
Is it possible?

PDFs and HTML are fundamentally different display technologies. PDF is much more complex then HTML is, which is why you find so many HTML to PDF converters. The other way around is much more difficult.
iText can only do do it from HTML to PDF.
There are online converters that will take a PDF and convert it to HTML. There are also downloadable utilities.
I am not aware of any .NET library that will do this.

PDF is almost a write-only format. Any time your workflow calls for "get the data out of a PDF", you've probably screwed up.
Having said that, there are several ways to stash data within a PDF:
Form fields have no particular length limit and need not be visible. Getting form data with iText is trivial.
You can attach a file to a PDF and suck it out later, both with iText.
DocInfo fields. You can stuff a string into one of the author/title/keywords/etc metadata fields. An ugly hack, but effective.
XML metadata. The "new-fangled" metadata is stored in an XML schema. You can put pretty much whatever you want in there... though iText regenerates some of it every time it makes changes (mod date and such).
Custom keys/values. You can tack any old key/value pairs you like into any old dictionary within a PDF. Adobe would like you to register a company-specific prefix for your custom tags to avoid collisions, but I've never felt the need.

From the book iText in Action it seems that it is doable using the original java library, but it does not seem like it is no longer ported in the c# lib. I'm pretty sure it was in version 4 :-/
Try look at some old source here: http://www.koders.com/csharp/fid60B0985D3A89152128B73F54EDD4EB5420A5E4D8.aspx?s=%22Ken+Auer%22

nFOP + XSLT + XML = pdf | doc | HTML
nfop.sourceforge.net/article.html should give you an idea on how to use it, you need "Microsoft Visual J # NET Redistributable Package" to run nFOP
open source no cost :)
K

Related

Try To Understand ITextSharp

I try to build an application that can convert a PDF to an excel with C#.
I have searched for some library to help me with this, but most of them are commercially licensed, so I ended up to iTextSharp.dll
It's good that is free, but I rarely find any good open source documentation for it.
These are some link that I have read:
https://yoda.entelect.co.za/view/9902/extracting-data-from-pdf-files
https://www.mikesdotnetting.com/article/80/create-pdfs-in-asp-net-getting-started-with-itextsharp
http://www.thedevelopertips.com/DotNet/ASPDotNet/Read-PDF-and-Convert-to-Stream.aspx?id=34
there're more. But, most of them did not really explain what use of the code.
So this is most common code in IText with C#:
StringBuilder text = new StringBuilder(); // my new file that will have pdf content?
PdfReader pdfReader = new PdfReader(myPath); // This maybe how IText read the pdf?
for (int page = 1; page <= pdfReader.NumberOfPages; page++) // looping for read all content in pdf?
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy(); // ?
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); // ?
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText))); // maybe how IText convert the data to text?
text.Append(currentText); // maybe the full content?
}
pdfReader.Close(); // to close the PdfReader?
As you can see, I still do not have a clear knowledge of the IText code that I have. Tell me, if my knowledge is correct and give me an answer for code that I still not understand.
Thank You.

Let me start by explaining a bit about PDF.
PDF is not a 'what you see is what you get'-format.
Internally, PDF is more like a file containing instructions for rendering software. Unless you are working with a tagged PDF file, a PDF document does not naturally have a concept of 'paragraph' or 'table'.
If you open a PDF in notepad for instance, you might see something like
7 0 obj
<</BaseFont/Helvetica-Oblique/Encoding/WinAnsiEncoding/Subtype/Type1/Type/Font>>
endobj
Instructions in the document get gathered into 'objects' and objects are numbered, and can be cross-referenced.
As Bruno already indicated in the comments, this means that finding out what a table is, or what the content of a table is, can be really hard.
The PDF document itself can only tell you things like:
object 8 is a line from [50, 100] to [150, 100]
object 125 is a piece of text, in font Helvetica, at position [50, 110]
With the iText core library you can
get all of these objects (which iText calls PathRenderInfo, TextRenderInfo and ImageRenderInfo objects)
get the graphics state when the object was rendered (which font, font-size, color, etc)
This can allow you to write your own parsing logic.
For instance:
gather all the PathRenderInfo objects
remove everything that is not a perfect horizontal or vertical line
make clusters of everything that intersects at 90 degree angles
if a cluster contains more than a given threshold of lines, consider it a table
Luckily, the pdf2Data solution (an iText add-on) already does that kind of thing for you.
For more information go to http://pdf2data.online/

MS Word C# AddIn - how to edit xml of an open word document

Thanks for coming by :)
I need to modify the XML of an MS Word Document directly, because the Word Interop's capabilities are insufficient for what I need to do.
The trick is that I have to do it from a Word Add-In and apply it to the currently open document, so I can't open/save packages (right?). In short, several dozen articles like the one below are not applicable here:
https://msdn.microsoft.com/en-us/library/aa982683%28v=office.12%29.aspx
Any help would be appreciated :)
Example problem -- Remove custom cell margins from a really, really big table in word (think 200x10) and check "Same as whole table" for each.
A lead on a solution (currenttable is the currently selected word table):
using System.Xml.Linq; // plus all the standard Word Add-In references
...
XDocument currentablexdocument = XDocument.Parse(currenttable.Range.WordOpenXML);
currentablexdocument.Descendants().Where(e =>e.Name.LocalName.Equals("tcMar")).Remove();
currenttable.Range.Delete();
currentselection.InsertXML(currentablexdocument.ToString());
Explanation:
currenttable.Range.WordOpenXML provides me with well-formed XML representation of the table, which I then interpret as an XDocument
tcMar = table cell margins. These XML elements exist only if a cell has custom margins. Deleting all such elements does exactly what I need.
currenttable.Range.Delete() deletes the old table
currentselection.InsertXML(...) inserts the modified table XML into the document with margins fixed. Pretty much instantaneous. Yay!
Problem:
Deleting and inserting the table is flaky and yields undesired results. It would be much better if I could MODIFY the xml directly. Is it possible?
Disclaimer:
Any other ideas of fixing this particular issue are welcome, but I have tried a myriad of possible solutions:
applying table style rejected by client,
looping "SendKeys" commands to automate use of the Word interface too unreliable,
changing Table.XXXPadding, Row.XXXPadding, Column.XXXPadding doesn't affect custom Cell margins (among other issues)
looping through cells to change their Cell.XXXPadding too slow (Freezes word for several minutes on a 200x10 table). Note, it's accessing the padding that's slow; the loop itself takes 3 seconds to traverse the whole table when implemented correctly.
ofc I tried it all with ScreenRefreshing = false and AllowAutoFit = false;
Somebody please help :)
Cheers!

Diffing large XML files in C# (.net 2.0)

I'm kind of stuck having to use .Net 2.0, so LINQ xml isn't available, although I would be interested how it would compare...
I had to write an internal program to download, extract, and compare some large XML files (about 10 megs each) that are essentially build configurations. I first attempted using libraries, such as Microsoft's XML diff/patch, but comparing the files was taking 2-3 minutes, even with ignoring whitespace, namespaces, etc. (i tested each ignore one at a time to try and figure out what was speediest). The I tried to implement my own ideas - lists of nodes from XmlDocument objects, dictionaries of keys of the root's direct descendants (45000 children, by the way) that pointed to ints to indicate the node position in the XML document... all took at least 2 minutes to run.
My final implementation finishes in 1-2 seconds - I made a system process call to diff with a few lines of context and saved those results to display (our development machines include cygwin, thank goodness).
I can't help but think there is a better, XML specific way to do this that would be just as fast as a plain text diff - especially since all I'm really interested in is the Name element that is the child of each direct descendant, and could throw away 4/5 of the file for my purposes (we only need to know what files were included, not anything else involving language or version)
So, as popular as XML is, I'm sure somebody out there has had to do something similar. What is a fast efficient way to compare these large XML's? (prefereably open source or Free)
edit: a sample of the nodes - I only need to find missing Name elements (there are over 45k nodes as well)
<file>
<name>SomeFile</name>
<version>10.234</version>
<countries>CA,US</countries>
<languages>EN</languages>
<types>blah blah</types>
<internal>N</internal>
</file>

XmlDocument source = new XmlDocument();
source.Load("source.xml");
Dictionary<string, XmlNode> files = new Dictionary<string, XmlNode>();
foreach(XmlNode file in source.SelectNodes("//file"))
files.Add(file.SelectSingleNode("./name").InnerText, file);
XmlDocument source2 = new XmlDocument();
source2.Load("source2.xml");
XmlNode value;
foreach(XmlNode file in source2.SelectNodes("//file"))
if (files.TryGetValue(file.SelectSingleNode("./name").InnerText, out value))
// This file is both in source and source2.
else
// This file is only in source2.
I am not sure exactly what you want, I hope that this example will help you in your quest.

Diffing XML can be done many ways. You're not being very specific regarding the details, though. What does transpire is that the files are large and you need only 4/5 of the information.
Well, then the algorithm is as follows:
Normalize and reduce the documents to the information that matters.
Save the results.
Compare the results.
And the implementation:
Use the XmlReader API, which is efficient, to produce plain text representations of your information. Why plain text representation? Because diff tools predicated on the assumption that there is plain text. And so are our eyeballs. Why XmlReader? You could use SAX, which is memory-efficient, but XmlReader is more efficient. As for the precise spec of that plain text file ... you're just not including enough information.
Save the plain text files to some temp directory.
Use a command-line diff utility like GnuWin32 diff to get some diff output. Yeah, I know, not pure and proper, but works out of the box and there's no coding to be done. If you are familiar with some C# diff API (I am not), well, then use that API instead, of course.
Delete the temp files. (Or optionally keep them if you're going to reuse them.)

Use PDFBox to fill out a PDF Form

I have a pdf with a form in it. I am trying to write a class that will take data from my database and automatically populate the fields in the form.
I have already tried ITextSharp and their pricing is out of my budget, even though it works perfectly fine with my pdf. I need a free pdf parser that will let me import the pdf, set the data, and save the PDF out, preferably to a stream so that I can return a Stream object from my class rather than saving the pdf to the server.
I found this pdf reader and it doesn't work. Null reference errors are abundant and when I tried to "fix" them, it still couldn't find my fields.
So, I have moved on to PdfBox, as the documentation says it can manipulate a PDF, however, I cannot find any examples. Here is the code I have so far.
var document = PDDocument.load(inputPdf);
var catalog = document.getDocumentCatalog();
var form = catalog.getAcroForm();
form.getField("MY_FIELD").setValue("Test Value");
document.save("some location on my hard drive");
document.close();
The problem is that catalog.getAcroForm() is returning a null, so I can't access the fields. Does anyone know how I can use PdfBox to alter the field values and save the thing back out?
EDIT:
I did find this example, which is pretty much what I am doing. It's just that my acroform is null in pdfbox. I know there is one there because itextsharp can pull it out just fine.

Have you tried with the 1.2.1 version?
http://pdfbox.apache.org/apidocs/overview-summary.html

Creating something printable in C#

Just wondering if anyone could tell me of a simple way to create files for printing? At the moment I'm just scripting HTML, but I'm wondering if there isn't some easier way of doing it that would give me more control over what it being printed? Something along the lines of an Access printout, or Excel printout - where I could decide how to lay things out and almost "Mail merge" the details in via programming.
Basically, I want to create something for print that can have tables encasing it, and could be longer or shorter for each record depending upon the number of foreign keys (e.g. one staff member could have 10 jobs today, or just 3. I want to create a document that will generate and print).
Any ideas/advice/opinions? Thank you!
EDIT: Wow, thanks for all the responses! For this particular task, FlowDocuments seems to be the closest to what I'm actually after so I'll play with that. Either way I have several really good options now.
EDIT 2: After some playing, iTextSharp has become the choice for me. For anyone wondering in the future, here is a link to a great and simple tutorial: http://www.mikesdotnetting.com/Category/20
Thanks again!

I would create a PDF file which can be viewed just about anywhere and will maintain formatting. Take a look here: http://itextsharp.sourceforge.net/

There's always FlowDocuments. Check out the overview at MSDN
http://msdn.microsoft.com/en-us/library/aa970909.aspx and see if they match what you want to do. They're pretty easy to print and can be serialized to xaml. Might not be exactly what you're after, but they're pretty useful.

We are currently using PDFSharp with great success -
http://www.pdfsharp.com/PDFsharp/
GDI+ or WPF ... all .NET, not COM or interop.
Oh, and its open source. Here is some sample code -
http://www.pdfsharp.net/wiki/PDFsharpSamples.ashx

If you use PDF or XPS generator, it still requires you to define the document composition very much like scripting your HTML, so I dont see that it gives you much more values other than the created file is in print ready format.
What you need is something that you can design a template and just filling in the blank, so I suggest that you either go for Word or Excel automation, otherwise look at some lightweight report generation library. I come across this and maybe it is worth checking out too.
http://www.fyireporting.com/

Like David i also recommended I Text Sharp ;) It's relly easy to create pdf document with this ;) I use it in ASP.NET project. It have much of options to format pdf file, in my example i use basic ;)
Example:
string file = #"d:\print.pdf"; //path to pdf file
Document myDocument = new Document(PageSize.A4.Rotate());
PdfWriter.GetInstance(myDocument, new FileStream(file, FileMode.Create));
myDocument.Open();
//data to save in pdf- unimportant!
Opiekun obiekun = (from opiekunTmp in db.Opiekuns where opiekunTmp.idOpiekun == nalez.Dziecko.idOpiekun select opiekunTmp).SingleOrDefault();
Dziecko dzieckoZap = (from dzieckoTmp in db.Dzieckos where dzieckoTmp.idDziecko == nalez.idDziecko select dzieckoTmp).SingleOrDefault();
//some info about font
BaseFont times = BaseFont.CreateFont(BaseFont.TIMES_ROMAN, BaseFont.CP1250, BaseFont.EMBEDDED);
Font font = new Font(times, 12);
myDocument.Add(new Paragraph("--------------------------Raport opłaty--------------------------",font));
myDocument.Add(new Paragraph("Data rozliczenia: " + (((TextBox)this.GridViewOplaty.Rows[e.RowIndex].Cells[8].Controls[0]).Text), font));
myDocument.Add(new Paragraph("Płatnik: " + obiekun.Imie + " " + obiekun.Nazwisko, font));
myDocument.Add(new Paragraph("Dziecko: " + dzieckoZap.Imie + " " + dzieckoZap.Nazwisko, font));
myDocument.Add(new Paragraph(""));
myDocument.Add(new Paragraph("Data Podpis płatnika: " + obiekun.Imie + " " + obiekun.Nazwisko, font));
myDocument.Add(new Paragraph(""));
myDocument.Add(new Paragraph(" ........... ................................."));
myDocument.Close(); //we close the pdf and open
System.Diagnostics.Process.Start(file); //and open our file if You want that ;)

I have created printouts from web sites using Excel XML format. Basically this means you don't have to use the Office APIs to generate the document (which can be cumbersome and requires extra libraries on the web server). Instead, you can just take an XML template, use XPath, LINQ to XML, or other technologies to insert your data into the template, and then stream it to the user and they can print it.
Generating the template is easy. You just use Excel to create the document and then save it in the "XML Spreadsheet" format. The XML is a bit oppressive but it isn't terrible.
Documentation on the XML Spreadsheet format is here:
http://msdn.microsoft.com/en-us/library/aa140066%28office.10%29.aspx
Note that the documentation is for Excel 2002. The format does change in newer versions of Excel, but it is backwards compatible.

We use ActiveReports. It is very easy to define your layout and can print and export in a number of formats, pdf, rtf, excel, tiff, etc.

If you're looking for a very simple way to create docs (prob not the best, but it sure is easy), you can set up Word docs with bookmarks and insert data into the bookmarks through code, so I'm guessing this would work for Excel too (if they have bookmarks?):
EDIT: Here's a quick translation into c# (been tested in vb, but not c#):
Word.Application oWord = default(Word.Application);
Word.Document oDoc = default(Word.Document);
oWord = Interaction.CreateObject("Word.Application");
oWord.Visible = false;
oDoc = oWord.Documents.Add(Directory + "\\MyDocument.dot");
oDoc.Bookmarks("MyBookmark").Range.Text = strBookmark;
oDoc.PrintOut();
oDoc.Close(Word.WdSaveOptions.wdDoNotSaveChanges);
oDoc = null;
oWord.Application.Quit();
oWord = null;

I've done the word and pdf generation things in the past and the support for pdf generation in pdfsharp (#Kris) is pretty good and I would use it ahead of office automation.
I hope I've not mis-read your needs but rather than exporting in a specific format and then firing the print feature I would these days re-consider plain old browser printing. In the past the awful limitations of browser printing meant that printing was a bad experience (no shrink-to-fit etc). But modern browsers have sufficient print support to be acceptable for simple jobs.
I've just checked printing this page in Firefox 3.5.8 and IE8 and both support shrink-to-fit and I reckon a simple print stylesheet will generate nice looking job sheets straight out of the browser as long as your (presumably internal) audience is guaranteed to have a modern browser.

XPS is MSFT's solution to print structured and formatted documents. I'm not saying to send someone an XPS, just to use the .NET support for printing via the XPS framework.
http://roecode.wordpress.com/2007/12/21/using-flowdocument-xaml-to-print-xps-documents/
It's very easy to do if you're doing WPF. Essentially an XPS is MSFT's PDF. Anyone running Vista or 7 can view/print them fine.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.