Make a pdf conforming PDF/A with only images using iTextSharp - c#

I'm using iTextSharp to generate pdf-a documents from images. So far I've not been successful.
Edit: I'm using iTextSharp to generate the PDF
All I try is to make a pdf-a document (1a or 1b, whatever suits), with some images. This is the code I've come up so far, but I keep getting errors when I try to validate them with pdf-tools or validatepdfa.
This are the errors I get from pdf-tools (using PDF/A-1b validation):
Edit: MarkInfo and Color Space arn't yet working. The rest is okay
Validating file "0.pdf" for conformance level pdfa-1a
The key MarkInfo is required but missing.
A device-specific color space (DeviceRGB) without an appropriate output intent is used.
The document does not conform to the requested standard.
The document contains device-specific color spaces.
The document doesn't provide appropriate logical structure information.
Done.
Main flow
var output = new MemoryStream();
using (var iccProfileStream = new FileStream("ToPdfConverter/ColorProfiles/sRGB_v4_ICC_preference_displayclass.icc", FileMode.Open))
{
var document = new Document(new Rectangle(PageSize.A4.Width, PageSize.A4.Height), 0f, 0f, 0f, 0f);
var pdfWriter = PdfWriter.GetInstance(document, output);
pdfWriter.PDFXConformance = PdfWriter.PDFA1A;
document.Open();
var pdfDictionary = new PdfDictionary(PdfName.OUTPUTINTENT);
pdfDictionary.Put(PdfName.OUTPUTCONDITION, new PdfString("sRGB IEC61966-2.1"));
pdfDictionary.Put(PdfName.INFO, new PdfString("sRGB IEC61966-2.1"));
pdfDictionary.Put(PdfName.S, PdfName.GTS_PDFA1);
var iccProfile = ICC_Profile.GetInstance(iccProfileStream);
var pdfIccBased = new PdfICCBased(iccProfile);
pdfIccBased.Remove(PdfName.ALTERNATE);
pdfDictionary.Put(PdfName.DESTOUTPUTPROFILE, pdfWriter.AddToBody(pdfIccBased).IndirectReference);
pdfWriter.ExtraCatalog.Put(PdfName.OUTPUTINTENT, new PdfArray(pdfDictionary));
var image = PrepareImage(imageBytes);
document.Open();
document.Add(image);
pdfWriter.CreateXmpMetadata();
pdfWriter.CloseStream = false;
document.Close();
}
return output.GetBuffer();
This is prepareImage()
It's used to flatten the image to bmp, so I don't need to bother about alpha channels.
private Image PrepareImage(Stream stream)
{
Bitmap bmp = new Bitmap(System.Drawing.Image.FromStream(stream));
var file = new MemoryStream();
bmp.Save(file, ImageFormat.Bmp);
var image = Image.GetInstance(file.GetBuffer());
if (image.Height > PageSize.A4.Height || image.Width > PageSize.A4.Width)
{
image.ScaleToFit(PageSize.A4.Width, PageSize.A4.Height);
}
return image;
}
Can anyone help me into a direction to fix the errors?
Specifically the device-specific color spaces
Edit: More explanation: What I'm trying to achieve is, converting scanned images to PDF/A for long-term data storage
Edit: added some files I'm using to test with
PDFs and Pictures.rar (3.9 MB)
https://mega.co.nz/#!n8pClYgL!NJOJqSO3EuVrqLVyh3c43yW-u_U35NqeB0svc6giaSQ

OK, I checked one of your files in callas pdfToolbox and it says: "Device color space used but no PDF/A output intent". Which I took as a sign that you do something wrong while writing an output intent to the document. I then converted that document to PDF/A-1b with the same tool and the difference is obvious.
Perhaps there are other errors you need to fix, but the first error here is that you put a key in the catalog dict for the PDF file that is named "OutputIntent". That's wrong: page 75 of the PDF Specification states that the key should be named "OutputIntents".
Like I said, perhaps there are other problems with your file beyond this, but the wrong name for the key causes PDF/A validators not to find the Output Intent you try to put in the file...

First of all, pdfx IS NOT pdfa.
Second, you're using wrong PdfWriter. It should be PdfAWriter.
I do not have solution for image problem unfortunatelly, but I have for 1 and 2.
Regards
using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.Text;
using System.IO;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.html.simpleparser;
using iTextSharp.tool.xml;
using System.Drawing;
using System.Drawing.Imaging;
namespace Tests
{
/*
* References:
* UTF-8 encoding http://stackoverflow.com/questions/4902033/itextsharp-5-polish-character
* PDFA http://www.codeproject.com/Questions/661704/Create-pdf-A-using-itextsharp
* Images http://stackoverflow.com/questions/15896581/make-a-pdf-conforming-pdf-a-with-only-images-using-itextsharp
*/
[TestClass]
public class UnitTest1
{
/*
* IMPORTANT: Restrictions with html usage of tags and attributes
* 1. Dont use * <head> <title>Sklep</title> </head>, because title is rendered to the page
*/
// Test cases
static string contents = "<html><body style=\"font-family:arial unicode ms;font-size: 8px;\"><p style=\"text-align: center;\"> Davčna številka dolžnika: 74605968<br /> </p><table> <tr> <td><b>\u0160t. sklepa: 88711501</b></td> <td style=\"text-align: right;\">Davčna številka dolžnika: 74605968</td> </tr> </table> <br/><img src=\"http://img.rtvslo.si/_static/images/rtvslo_mmc_logo.png\" /></body></html>";
//static string contents = "<html><body style=\"font-family:arial unicode ms;font-size: 8px;\"><p style=\"text-align: center;\"> Davčna številka dolžnika: 74605968<br /> </p><table> <tr> <td><b>\u0160t. sklepa: 88711501</b></td> <td style=\"text-align: right;\">Davčna številka dolžnika: 74605968</td> </tr> </table> <br/></body></html>";
//[TestMethod]
public void CreatePdfHtml()
{
createPDF(contents, true);
}
private void createPDF(string html, bool isPdfa)
{
TextReader reader = new StringReader(html);
Document document = new Document(PageSize.A4, 30, 30, 30, 30);
HTMLWorker worker = new HTMLWorker(document);
PdfWriter writer;
if (isPdfa)
{
//set conformity level
writer = PdfAWriter.GetInstance(document, new FileStream(#"c:\temp\testA.pdf", FileMode.Create), PdfAConformanceLevel.PDF_A_1B);
//set pdf version
writer.SetPdfVersion(PdfAWriter.PDF_VERSION_1_4);
// Create XMP metadata. It's a PDF/A requirement.
writer.CreateXmpMetadata();
}
else
{
writer = PdfWriter.GetInstance(document, new FileStream(#"c:\temp\test.pdf", FileMode.Create));
}
document.Open();
if (isPdfa) // document should be opend, or it will fail
{
// Set output intent for uncalibrated color space. PDF/A requirement.
ICC_Profile icc = ICC_Profile.GetInstance(Environment.GetEnvironmentVariable("SystemRoot") + #"\System32\spool\drivers\color\sRGB Color Space Profile.icm");
writer.SetOutputIntents("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);
}
//register font used in html
FontFactory.Register(Environment.GetEnvironmentVariable("SystemRoot") + "\\Fonts\\ARIALUNI.TTF", "arial unicode ms");
//adding custom style attributes to html specific tasks. Can be used instead of css
//this one is a must fopr display of utf8 language specific characters (čćžđpš)
iTextSharp.text.html.simpleparser.StyleSheet ST = new iTextSharp.text.html.simpleparser.StyleSheet();
ST.LoadTagStyle("body", "encoding", "Identity-H");
worker.SetStyleSheet(ST);
worker.StartDocument();
worker.Parse(reader);
worker.EndDocument();
worker.Close();
document.Close();
}
}
}

Related

ITextSharp pdf resize and data alignment

I am using ITextSharp to convert HTML to PDF but i want the PDF to be generated of size 5cm width. I used the following code
var pgSize = new iTextSharp.text.Rectangle(2.05f, 2.05f);
Document doc = new Document(pgSize);
but it is just resizing the pdf and my data disappeared in the pdf or get hide.
How can i align the data in the center in PDF or resize the pdf? Here is my code
public void ConvertHTMLToPDF(string HTMLCode)
{
try
{
System.IO.StringWriter stringWrite = new StringWriter();
System.Web.UI.HtmlTextWriter htmlWrite = new HtmlTextWriter(stringWrite);
StringReader reader = new StringReader(HTMLCode);
var pgSize = new iTextSharp.text.Rectangle(2.05f, 2.05f);
Document doc = new Document(pgSize);
HTMLWorker parser = new HTMLWorker(doc);
PdfWriter.GetInstance(doc, new FileStream(Server.MapPath("~") + "/App_Data/HTMLToPDF.pdf",
FileMode.Create));
doc.Open();
foreach (IElement element in HTMLWorker.ParseToList(
new StringReader(HTMLCode), null))
{
doc.Add(element);
}
doc.Close();
Response.End();
}
catch (Exception ex)
{
}
}
You are creating a PDF that measures 0.0723 cm by 0.0723 cm. That is much too small to add any content. If you want to create a PDF of 5 cm by 5 cm, you need to create your document like this:
var pgSize = new iTextSharp.text.Rectangle(141.732f, 141.732f);
Document doc = new Document(pgSize);
As for the alignment, that should be defined in the HTML, but you are using an old version of iText and you are using the deprecated HTMLWorker.
You should upgrade to iText 7 and pdfHTML as described here: Converting HTML to PDF using iText
Also: the size of the page can be defined in the #page-rule of the CSS. See Huge white space after header in PDF using Flying Saucer
Why would you make it difficult for yourself by using an old iText version, when the new version allows you to do this:
#page {
size: 5cm 5cm;
}

Aspx to PDF iTextSharp usage

I have this code which I merged and modified for my needs. But I still can't make it work as I need. The first part that I made, it generates PDF with an option from aspx page chosen. Second, I need to have the background over the page, so I added next code, but now it generates just the second code and not the PDF. And im not able to merge those codes together.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.IO;
using iTextSharp.text;
using iTextSharp.text.pdf;
public partial class CreatePDFFromScratch : System.Web.UI.Page
{
protected void btnCreatePDF_Click(object sender, EventArgs e)
{
// Create a Document object
var document = new Document(iTextSharp.text.PageSize.LETTER.Rotate(), 0f, 0f, 0f, 0f);
// Create a new PdfWrite object, writing the output to a MemoryStream
var output = new MemoryStream();
var writer = PdfWriter.GetInstance(document, output);
// Open the Document for writing
document.Open();
// First, create our fonts..
var titleFont = FontFactory.GetFont("Arial", 18, Font.BOLD);
var subTitleFont = FontFactory.GetFont("Arial", 14, Font.BOLD);
var boldTableFont = FontFactory.GetFont("Arial", 12, Font.BOLD);
var endingMessageFont = FontFactory.GetFont("Arial", 10, Font.ITALIC);
var bodyFont = FontFactory.GetFont("Arial", 12, Font.NORMAL);
// Add the "Northwind Traders Receipt" title
document.Add(new Paragraph("Northwind Traders Receipt", titleFont));
// Now add the "Thank you for shopping at Northwind Traders. Your order details are below." message
document.Add(new Paragraph("Thank you for shopping at Northwind Traders. Your order details are below.", bodyFont));
document.Add(Chunk.NEWLINE);
// Add the "Order Information" subtitle
document.Add(new Paragraph("Order Information", subTitleFont));
// Create the Order Information table
var orderInfoTable = new PdfPTable(2);
orderInfoTable.HorizontalAlignment = 0;
orderInfoTable.SpacingBefore = 10;
orderInfoTable.SpacingAfter = 10;
orderInfoTable.DefaultCell.Border = 0;
orderInfoTable.SetWidths(new int[] { 1, 4 });
orderInfoTable.AddCell(new Phrase("Order:", boldTableFont));
orderInfoTable.AddCell(txtOrderID.Text);
orderInfoTable.AddCell(new Phrase("Price:", boldTableFont));
orderInfoTable.AddCell(Convert.ToDecimal(txtTotalPrice.Text).ToString("c"));
document.Add(orderInfoTable);
// Add the "Items In Your Order" subtitle
document.Add(new Paragraph("Items In Your Order", subTitleFont));
// Create the Order Details table
var orderDetailsTable = new PdfPTable(3);
orderDetailsTable.HorizontalAlignment = 0;
orderDetailsTable.SpacingBefore = 10;
orderDetailsTable.SpacingAfter = 35;
orderDetailsTable.DefaultCell.Border = 0;
orderDetailsTable.AddCell(new Phrase("Item #:", boldTableFont));
orderDetailsTable.AddCell(new Phrase("Item Name:", boldTableFont));
orderDetailsTable.AddCell(new Phrase("Qty:", boldTableFont));
foreach (System.Web.UI.WebControls.ListItem item in cblItemsPurchased.Items)
if (item.Selected)
{
// Each CheckBoxList item has a value of ITEMNAME|ITEM#|QTY, so we split on | and pull these values out...
var pieces = item.Value.Split("|".ToCharArray());
orderDetailsTable.AddCell(pieces[1]);
orderDetailsTable.AddCell(pieces[0]);
orderDetailsTable.AddCell(pieces[2]);
}
document.Add(orderDetailsTable);
// Add ending message
var endingMessage = new Paragraph("Thank you for your business! If you have any questions about your order, please contact us at 800-555-NORTH.", endingMessageFont);
endingMessage.SetAlignment("Center");
document.Add(endingMessage);
document.Close();
Response.ContentType = "application/pdf";
Response.AddHeader("Content-Disposition", string.Format("inline;filename=Receipt-{0}.pdf", txtOrderID.Text));
///create background
Response.BinaryWrite(output.ToArray());
Response.Cache.SetCacheability(HttpCacheability.NoCache);
string imageFilePath = Server.MapPath(".") + "/images/1.jpg";
iTextSharp.text.Image jpg = iTextSharp.text.Image.GetInstance(imageFilePath);
Document pdfDoc = new Document(iTextSharp.text.PageSize.LETTER.Rotate(), 0, 0, 0, 0);
jpg.ScaleToFit(790, 777);
jpg.Alignment = iTextSharp.text.Image.UNDERLYING;
pdfDoc.Open();
pdfDoc.NewPage();
pdfDoc.Add(jpg);
pdfDoc.Close();
Response.Write(pdfDoc);
Response.End();
}
}
Thanks
I almost missed this question because it wasn't tagged as an itext question.
First let me copy/paste/adapt #mkl's comment:
The first part of your code in which you create a document document makes sense.
The second part in which you create a document pdfDoc does not.
First of all, at the end of the first part you write the pdf to
the response. That PDF is complete. It's finished. It's done.
It's ready to send to the browser.
Why do you think anything additional written to the
response thereafter might have a chance of combining with the original
written data to a properly generated PDF?
Also: the second part of your code is
written as if you want to create a new PDF from scratch; but didn't you
want to manipulate the PDF created in the first part?
All of this is true, but it doesn't solve your problem. It only reveals your deep lack of understanding in PDF.
There are different ways to achieve what you want. I see that you want to use an image as a background of all the pages of a newly created PDF. In that case, you should create a page event, and add that image underneath all the existing content in the OnEndPage() method. This is explained in the answer to How can I add an image to all pages of my PDF?
Create a PDF as is done in the first part of your code, but introduce a page event:
// step 1
Document document = new Document();
// step 2
PdfWriter writer = PdfWriter.GetInstance(document, stream);
MyEvent event = new MyEvent();
writer.PageEvent = event;
// step 3
document.Open();
// step 4
// Add whatever content you want to add
// step 5
document.Close();
What is the MyEvent class, you might ask? Well, that's a class you create yourself like this:
protected class MyEvent : PdfPageEventHelper {
Image image;
public override void OnOpenDocument(PdfWriter writer, Document document) {
image = Image.GetInstance(Server.MapPath("~/images/background.png"));
image.SetAbsolutePosition(0, 0);
}
public override void OnEndPage(PdfWriter writer, Document document) {
writer.DirectContent.AddImage(image);
}
}
Suppose that your requirement isn't as easy as adding an image in the background, then you could use the bytes created as output to create a PdfReader instance. You could then use the PdfReader to create a PdfStamper and you can use the PdfStamper to watermark the original document. If the simple solution doesn't meet your needs, create a new question that involves PdfReader/PdfStamper and don't forget to tag that question as an iText question. (And also: please read the documentation. A lot of time was spent on the iText web site. That time was wasted if you don't consult it.)

Converting aspx page to PDF with css using iTextSharp in asp.net c#? [duplicate]

I want to convert the below HTML to PDF using iTextSharp but don't know where to start:
<style>
.headline{font-size:200%}
</style>
<p>
This <em>is </em>
<span class="headline" style="text-decoration: underline;">some</span>
<strong>sample<em> text</em></strong>
<span style="color: red;">!!!</span>
</p>
First, HTML and PDF are not related although they were created around the same time. HTML is intended to convey higher level information such as paragraphs and tables. Although there are methods to control it, it is ultimately up to the browser to draw these higher level concepts. PDF is intended to convey documents and the documents must "look" the same wherever they are rendered.
In an HTML document you might have a paragraph that's 100% wide and depending on the width of your monitor it might take 2 lines or 10 lines and when you print it it might be 7 lines and when you look at it on your phone it might take 20 lines. A PDF file, however, must be independent of the rendering device, so regardless of your screen size it must always render exactly the same.
Because of the musts above, PDF doesn't support abstract things like "tables" or "paragraphs". There are three basic things that PDF supports: text, lines/shapes and images. (There are other things like annotations and movies but I'm trying to keep it simple here.) In a PDF you don't say "here's a paragraph, browser do your thing!". Instead you say, "draw this text at this exact X,Y location using this exact font and don't worry, I've previously calculated the width of the text so I know it will all fit on this line". You also don't say "here's a table" but instead you say "draw this text at this exact location and then draw a rectangle at this other exact location that I've previously calculated so I know it will appear to be around the text".
Second, iText and iTextSharp parse HTML and CSS. That's it. ASP.Net, MVC, Razor, Struts, Spring, etc, are all HTML frameworks but iText/iTextSharp is 100% unaware of them. Same with DataGridViews, Repeaters, Templates, Views, etc. which are all framework-specific abstractions. It is your responsibility to get the HTML from your choice of framework, iText won't help you. If you get an exception saying The document has no pages or you think that "iText isn't parsing my HTML" it is almost definite that you don't actually have HTML, you only think you do.
Third, the built-in class that's been around for years is the HTMLWorker however this has been replaced with XMLWorker (Java / .Net). Zero work is being done on HTMLWorker which doesn't support CSS files and has only limited support for the most basic CSS properties and actually breaks on certain tags. If you do not see the HTML attribute or CSS property and value in this file then it probably isn't supported by HTMLWorker. XMLWorker can be more complicated sometimes but those complications also make it more extensible.
Below is C# code that shows how to parse HTML tags into iText abstractions that get automatically added to the document that you are working on. C# and Java are very similar so it should be relatively easy to convert this. Example #1 uses the built-in HTMLWorker to parse the HTML string. Since only inline styles are supported the class="headline" gets ignored but everything else should actually work. Example #2 is the same as the first except it uses XMLWorker instead. Example #3 also parses the simple CSS example.
//Create a byte array that will eventually hold our final PDF
Byte[] bytes;
//Boilerplate iTextSharp setup here
//Create a stream that we can write to, in this case a MemoryStream
using (var ms = new MemoryStream()) {
//Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
using (var doc = new Document()) {
//Create a writer that's bound to our PDF abstraction and our stream
using (var writer = PdfWriter.GetInstance(doc, ms)) {
//Open the document for writing
doc.Open();
//Our sample HTML and CSS
var example_html = #"<p>This <em>is </em><span class=""headline"" style=""text-decoration: underline;"">some</span> <strong>sample <em> text</em></strong><span style=""color: red;"">!!!</span></p>";
var example_css = #".headline{font-size:200%}";
/**************************************************
* Example #1 *
* *
* Use the built-in HTMLWorker to parse the HTML. *
* Only inline CSS is supported. *
* ************************************************/
//Create a new HTMLWorker bound to our document
using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc)) {
//HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)
using (var sr = new StringReader(example_html)) {
//Parse the HTML
htmlWorker.Parse(sr);
}
}
/**************************************************
* Example #2 *
* *
* Use the XMLWorker to parse the HTML. *
* Only inline CSS and absolutely linked *
* CSS is supported *
* ************************************************/
//XMLWorker also reads from a TextReader and not directly from a string
using (var srHtml = new StringReader(example_html)) {
//Parse the HTML
iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
}
/**************************************************
* Example #3 *
* *
* Use the XMLWorker to parse HTML and CSS *
* ************************************************/
//In order to read CSS as a string we need to switch to a different constructor
//that takes Streams instead of TextReaders.
//Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams
using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css))) {
using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html))) {
//Parse the HTML
iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, msHtml, msCss);
}
}
doc.Close();
}
}
//After all of the PDF "stuff" above is done and closed but **before** we
//close the MemoryStream, grab all of the active bytes from the stream
bytes = ms.ToArray();
}
//Now we just need to do something with those bytes.
//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.
//You could also write the bytes to a database in a varbinary() column (but please don't) or you
//could pass them to another function for further PDF processing.
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
System.IO.File.WriteAllBytes(testFile, bytes);
2017's update
There are good news for HTML-to-PDF demands. As this answer showed, the W3C standard css-break-3 will solve the problem... It is a Candidate Recommendation with plan to turn into definitive Recommendation this year, after tests.
As not-so-standard there are solutions, with plugins for C#, as showed by print-css.rocks.
As of 2018, there is also iText7 (A next iteration of old iTextSharp library) and its HTML to PDF package available: itext7.pdfhtml
Usage is straightforward:
HtmlConverter.ConvertToPdf(
new FileInfo(#"Path\to\Html\File.html"),
new FileInfo(#"Path\to\Pdf\File.pdf")
);
Method has many more overloads.
Update: iText* family of products has dual licensing model: free for open source, paid for commercial use.
#Chris Haas has explained very well how to use itextSharp to convert HTML to PDF, very helpful
my add is:
By using HtmlTextWriter I put html tags inside HTML table + inline CSS i got my PDF as I wanted without using XMLWorker .
Edit: adding sample code:
ASPX page:
<asp:Panel runat="server" ID="PendingOrdersPanel">
<!-- to be shown on PDF-->
<table style="border-spacing: 0;border-collapse: collapse;width:100%;display:none;" >
<tr><td><img src="abc.com/webimages/logo1.png" style="display: none;" width="230" /></td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:11px;color:#10466E;padding:0px;text-align:center;"><i>blablabla</i> Pending orders report<br /></td></tr>
</table>
<asp:GridView runat="server" ID="PendingOrdersGV" RowStyle-Wrap="false" AllowPaging="true" PageSize="10" Width="100%" CssClass="Grid" AlternatingRowStyle-CssClass="alt" AutoGenerateColumns="false"
PagerStyle-CssClass="pgr" HeaderStyle-ForeColor="White" PagerStyle-HorizontalAlign="Center" HeaderStyle-HorizontalAlign="Center" RowStyle-HorizontalAlign="Center" DataKeyNames="Document#"
OnPageIndexChanging="PendingOrdersGV_PageIndexChanging" OnRowDataBound="PendingOrdersGV_RowDataBound" OnRowCommand="PendingOrdersGV_RowCommand">
<EmptyDataTemplate><div style="text-align:center;">no records found</div></EmptyDataTemplate>
<Columns>
<asp:ButtonField CommandName="PendingOrders_Details" DataTextField="Document#" HeaderText="Document #" SortExpression="Document#" ItemStyle-ForeColor="Black" ItemStyle-Font-Underline="true"/>
<asp:BoundField DataField="Order#" HeaderText="order #" SortExpression="Order#"/>
<asp:BoundField DataField="Order Date" HeaderText="Order Date" SortExpression="Order Date" DataFormatString="{0:d}"></asp:BoundField>
<asp:BoundField DataField="Status" HeaderText="Status" SortExpression="Status"></asp:BoundField>
<asp:BoundField DataField="Amount" HeaderText="Amount" SortExpression="Amount" DataFormatString="{0:C2}"></asp:BoundField>
</Columns>
</asp:GridView>
</asp:Panel>
C# code:
protected void PendingOrdersPDF_Click(object sender, EventArgs e)
{
if (PendingOrdersGV.Rows.Count > 0)
{
//to allow paging=false & change style.
PendingOrdersGV.HeaderStyle.ForeColor = System.Drawing.Color.Black;
PendingOrdersGV.BorderColor = Color.Gray;
PendingOrdersGV.Font.Name = "Tahoma";
PendingOrdersGV.DataSource = clsBP.get_PendingOrders(lbl_BP_Id.Text);
PendingOrdersGV.AllowPaging = false;
PendingOrdersGV.Columns[0].Visible = false; //export won't work if there's a link in the gridview
PendingOrdersGV.DataBind();
//to PDF code --Sam
string attachment = "attachment; filename=report.pdf";
Response.ClearContent();
Response.AddHeader("content-disposition", attachment);
Response.ContentType = "application/pdf";
StringWriter stw = new StringWriter();
HtmlTextWriter htextw = new HtmlTextWriter(stw);
htextw.AddStyleAttribute("font-size", "8pt");
htextw.AddStyleAttribute("color", "Grey");
PendingOrdersPanel.RenderControl(htextw); //Name of the Panel
Document document = new Document();
document = new Document(PageSize.A4, 5, 5, 15, 5);
FontFactory.GetFont("Tahoma", 50, iTextSharp.text.BaseColor.BLUE);
PdfWriter.GetInstance(document, Response.OutputStream);
document.Open();
StringReader str = new StringReader(stw.ToString());
HTMLWorker htmlworker = new HTMLWorker(document);
htmlworker.Parse(str);
document.Close();
Response.Write(document);
}
}
of course include iTextSharp Refrences to cs file
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.html.simpleparser;
using iTextSharp.tool.xml;
Hope this helps!
Thank you
I use the following code to create PDF
protected void CreatePDF(Stream stream)
{
using (var document = new Document(PageSize.A4, 40, 40, 40, 30))
{
var writer = PdfWriter.GetInstance(document, stream);
writer.PageEvent = new ITextEvents();
document.Open();
// instantiate custom tag processor and add to `HtmlPipelineContext`.
var tagProcessorFactory = Tags.GetHtmlTagProcessorFactory();
tagProcessorFactory.AddProcessor(
new TableProcessor(),
new string[] { HTML.Tag.TABLE }
);
//Register Fonts.
XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.Register(HttpContext.Current.Server.MapPath("~/Content/Fonts/GothamRounded-Medium.ttf"), "Gotham Rounded Medium");
CssAppliers cssAppliers = new CssAppliersImpl(fontProvider);
var htmlPipelineContext = new HtmlPipelineContext(cssAppliers);
htmlPipelineContext.SetTagFactory(tagProcessorFactory);
var pdfWriterPipeline = new PdfWriterPipeline(document, writer);
var htmlPipeline = new HtmlPipeline(htmlPipelineContext, pdfWriterPipeline);
// get an ICssResolver and add the custom CSS
var cssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(true);
cssResolver.AddCss(CSSSource, "utf-8", true);
var cssResolverPipeline = new CssResolverPipeline(
cssResolver, htmlPipeline
);
var worker = new XMLWorker(cssResolverPipeline, true);
var parser = new XMLParser(worker);
using (var stringReader = new StringReader(HTMLSource))
{
parser.Parse(stringReader);
document.Close();
HttpContext.Current.Response.ContentType = "application /pdf";
if (base.View)
HttpContext.Current.Response.AddHeader("content-disposition", "inline;filename=\"" + OutputFileName + ".pdf\"");
else
HttpContext.Current.Response.AddHeader("content-disposition", "attachment;filename=\"" + OutputFileName + ".pdf\"");
HttpContext.Current.Response.Cache.SetCacheability(HttpCacheability.NoCache);
HttpContext.Current.Response.WriteFile(OutputPath);
HttpContext.Current.Response.End();
}
}
}
Here's the link I used as a guide. Hope this helps!
Converting HTML to PDF using ITextSharp
protected void Page_Load(object sender, EventArgs e)
{
try
{
string strHtml = string.Empty;
//HTML File path -http://aspnettutorialonline.blogspot.com/
string htmlFileName = Server.MapPath("~") + "\\files\\" + "ConvertHTMLToPDF.htm";
//pdf file path. -http://aspnettutorialonline.blogspot.com/
string pdfFileName = Request.PhysicalApplicationPath + "\\files\\" + "ConvertHTMLToPDF.pdf";
//reading html code from html file
FileStream fsHTMLDocument = new FileStream(htmlFileName, FileMode.Open, FileAccess.Read);
StreamReader srHTMLDocument = new StreamReader(fsHTMLDocument);
strHtml = srHTMLDocument.ReadToEnd();
srHTMLDocument.Close();
strHtml = strHtml.Replace("\r\n", "");
strHtml = strHtml.Replace("\0", "");
CreatePDFFromHTMLFile(strHtml, pdfFileName);
Response.Write("pdf creation successfully with password -http://aspnettutorialonline.blogspot.com/");
}
catch (Exception ex)
{
Response.Write(ex.Message);
}
}
public void CreatePDFFromHTMLFile(string HtmlStream, string FileName)
{
try
{
object TargetFile = FileName;
string ModifiedFileName = string.Empty;
string FinalFileName = string.Empty;
/* To add a Password to PDF -http://aspnettutorialonline.blogspot.com/ */
TestPDF.HtmlToPdfBuilder builder = new TestPDF.HtmlToPdfBuilder(iTextSharp.text.PageSize.A4);
TestPDF.HtmlPdfPage first = builder.AddPage();
first.AppendHtml(HtmlStream);
byte[] file = builder.RenderPdf();
File.WriteAllBytes(TargetFile.ToString(), file);
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(TargetFile.ToString());
ModifiedFileName = TargetFile.ToString();
ModifiedFileName = ModifiedFileName.Insert(ModifiedFileName.Length - 4, "1");
string password = "password";
iTextSharp.text.pdf.PdfEncryptor.Encrypt(reader, new FileStream(ModifiedFileName, FileMode.Append), iTextSharp.text.pdf.PdfWriter.STRENGTH128BITS, password, "", iTextSharp.text.pdf.PdfWriter.AllowPrinting);
//http://aspnettutorialonline.blogspot.com/
reader.Close();
if (File.Exists(TargetFile.ToString()))
File.Delete(TargetFile.ToString());
FinalFileName = ModifiedFileName.Remove(ModifiedFileName.Length - 5, 1);
File.Copy(ModifiedFileName, FinalFileName);
if (File.Exists(ModifiedFileName))
File.Delete(ModifiedFileName);
}
catch (Exception ex)
{
throw ex;
}
}
You can download the sample file. Just place the html you want to convert in the files folder and run. It will automatically generate the pdf file and place it in the same folder. But in your case, you can specify your html path in the htmlFileName variable.

images and CSS using itextsharp [duplicate]

I want to convert the below HTML to PDF using iTextSharp but don't know where to start:
<style>
.headline{font-size:200%}
</style>
<p>
This <em>is </em>
<span class="headline" style="text-decoration: underline;">some</span>
<strong>sample<em> text</em></strong>
<span style="color: red;">!!!</span>
</p>
First, HTML and PDF are not related although they were created around the same time. HTML is intended to convey higher level information such as paragraphs and tables. Although there are methods to control it, it is ultimately up to the browser to draw these higher level concepts. PDF is intended to convey documents and the documents must "look" the same wherever they are rendered.
In an HTML document you might have a paragraph that's 100% wide and depending on the width of your monitor it might take 2 lines or 10 lines and when you print it it might be 7 lines and when you look at it on your phone it might take 20 lines. A PDF file, however, must be independent of the rendering device, so regardless of your screen size it must always render exactly the same.
Because of the musts above, PDF doesn't support abstract things like "tables" or "paragraphs". There are three basic things that PDF supports: text, lines/shapes and images. (There are other things like annotations and movies but I'm trying to keep it simple here.) In a PDF you don't say "here's a paragraph, browser do your thing!". Instead you say, "draw this text at this exact X,Y location using this exact font and don't worry, I've previously calculated the width of the text so I know it will all fit on this line". You also don't say "here's a table" but instead you say "draw this text at this exact location and then draw a rectangle at this other exact location that I've previously calculated so I know it will appear to be around the text".
Second, iText and iTextSharp parse HTML and CSS. That's it. ASP.Net, MVC, Razor, Struts, Spring, etc, are all HTML frameworks but iText/iTextSharp is 100% unaware of them. Same with DataGridViews, Repeaters, Templates, Views, etc. which are all framework-specific abstractions. It is your responsibility to get the HTML from your choice of framework, iText won't help you. If you get an exception saying The document has no pages or you think that "iText isn't parsing my HTML" it is almost definite that you don't actually have HTML, you only think you do.
Third, the built-in class that's been around for years is the HTMLWorker however this has been replaced with XMLWorker (Java / .Net). Zero work is being done on HTMLWorker which doesn't support CSS files and has only limited support for the most basic CSS properties and actually breaks on certain tags. If you do not see the HTML attribute or CSS property and value in this file then it probably isn't supported by HTMLWorker. XMLWorker can be more complicated sometimes but those complications also make it more extensible.
Below is C# code that shows how to parse HTML tags into iText abstractions that get automatically added to the document that you are working on. C# and Java are very similar so it should be relatively easy to convert this. Example #1 uses the built-in HTMLWorker to parse the HTML string. Since only inline styles are supported the class="headline" gets ignored but everything else should actually work. Example #2 is the same as the first except it uses XMLWorker instead. Example #3 also parses the simple CSS example.
//Create a byte array that will eventually hold our final PDF
Byte[] bytes;
//Boilerplate iTextSharp setup here
//Create a stream that we can write to, in this case a MemoryStream
using (var ms = new MemoryStream()) {
//Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
using (var doc = new Document()) {
//Create a writer that's bound to our PDF abstraction and our stream
using (var writer = PdfWriter.GetInstance(doc, ms)) {
//Open the document for writing
doc.Open();
//Our sample HTML and CSS
var example_html = #"<p>This <em>is </em><span class=""headline"" style=""text-decoration: underline;"">some</span> <strong>sample <em> text</em></strong><span style=""color: red;"">!!!</span></p>";
var example_css = #".headline{font-size:200%}";
/**************************************************
* Example #1 *
* *
* Use the built-in HTMLWorker to parse the HTML. *
* Only inline CSS is supported. *
* ************************************************/
//Create a new HTMLWorker bound to our document
using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc)) {
//HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)
using (var sr = new StringReader(example_html)) {
//Parse the HTML
htmlWorker.Parse(sr);
}
}
/**************************************************
* Example #2 *
* *
* Use the XMLWorker to parse the HTML. *
* Only inline CSS and absolutely linked *
* CSS is supported *
* ************************************************/
//XMLWorker also reads from a TextReader and not directly from a string
using (var srHtml = new StringReader(example_html)) {
//Parse the HTML
iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
}
/**************************************************
* Example #3 *
* *
* Use the XMLWorker to parse HTML and CSS *
* ************************************************/
//In order to read CSS as a string we need to switch to a different constructor
//that takes Streams instead of TextReaders.
//Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams
using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css))) {
using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html))) {
//Parse the HTML
iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, msHtml, msCss);
}
}
doc.Close();
}
}
//After all of the PDF "stuff" above is done and closed but **before** we
//close the MemoryStream, grab all of the active bytes from the stream
bytes = ms.ToArray();
}
//Now we just need to do something with those bytes.
//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.
//You could also write the bytes to a database in a varbinary() column (but please don't) or you
//could pass them to another function for further PDF processing.
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
System.IO.File.WriteAllBytes(testFile, bytes);
2017's update
There are good news for HTML-to-PDF demands. As this answer showed, the W3C standard css-break-3 will solve the problem... It is a Candidate Recommendation with plan to turn into definitive Recommendation this year, after tests.
As not-so-standard there are solutions, with plugins for C#, as showed by print-css.rocks.
As of 2018, there is also iText7 (A next iteration of old iTextSharp library) and its HTML to PDF package available: itext7.pdfhtml
Usage is straightforward:
HtmlConverter.ConvertToPdf(
new FileInfo(#"Path\to\Html\File.html"),
new FileInfo(#"Path\to\Pdf\File.pdf")
);
Method has many more overloads.
Update: iText* family of products has dual licensing model: free for open source, paid for commercial use.
#Chris Haas has explained very well how to use itextSharp to convert HTML to PDF, very helpful
my add is:
By using HtmlTextWriter I put html tags inside HTML table + inline CSS i got my PDF as I wanted without using XMLWorker .
Edit: adding sample code:
ASPX page:
<asp:Panel runat="server" ID="PendingOrdersPanel">
<!-- to be shown on PDF-->
<table style="border-spacing: 0;border-collapse: collapse;width:100%;display:none;" >
<tr><td><img src="abc.com/webimages/logo1.png" style="display: none;" width="230" /></td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:11px;color:#10466E;padding:0px;text-align:center;"><i>blablabla</i> Pending orders report<br /></td></tr>
</table>
<asp:GridView runat="server" ID="PendingOrdersGV" RowStyle-Wrap="false" AllowPaging="true" PageSize="10" Width="100%" CssClass="Grid" AlternatingRowStyle-CssClass="alt" AutoGenerateColumns="false"
PagerStyle-CssClass="pgr" HeaderStyle-ForeColor="White" PagerStyle-HorizontalAlign="Center" HeaderStyle-HorizontalAlign="Center" RowStyle-HorizontalAlign="Center" DataKeyNames="Document#"
OnPageIndexChanging="PendingOrdersGV_PageIndexChanging" OnRowDataBound="PendingOrdersGV_RowDataBound" OnRowCommand="PendingOrdersGV_RowCommand">
<EmptyDataTemplate><div style="text-align:center;">no records found</div></EmptyDataTemplate>
<Columns>
<asp:ButtonField CommandName="PendingOrders_Details" DataTextField="Document#" HeaderText="Document #" SortExpression="Document#" ItemStyle-ForeColor="Black" ItemStyle-Font-Underline="true"/>
<asp:BoundField DataField="Order#" HeaderText="order #" SortExpression="Order#"/>
<asp:BoundField DataField="Order Date" HeaderText="Order Date" SortExpression="Order Date" DataFormatString="{0:d}"></asp:BoundField>
<asp:BoundField DataField="Status" HeaderText="Status" SortExpression="Status"></asp:BoundField>
<asp:BoundField DataField="Amount" HeaderText="Amount" SortExpression="Amount" DataFormatString="{0:C2}"></asp:BoundField>
</Columns>
</asp:GridView>
</asp:Panel>
C# code:
protected void PendingOrdersPDF_Click(object sender, EventArgs e)
{
if (PendingOrdersGV.Rows.Count > 0)
{
//to allow paging=false & change style.
PendingOrdersGV.HeaderStyle.ForeColor = System.Drawing.Color.Black;
PendingOrdersGV.BorderColor = Color.Gray;
PendingOrdersGV.Font.Name = "Tahoma";
PendingOrdersGV.DataSource = clsBP.get_PendingOrders(lbl_BP_Id.Text);
PendingOrdersGV.AllowPaging = false;
PendingOrdersGV.Columns[0].Visible = false; //export won't work if there's a link in the gridview
PendingOrdersGV.DataBind();
//to PDF code --Sam
string attachment = "attachment; filename=report.pdf";
Response.ClearContent();
Response.AddHeader("content-disposition", attachment);
Response.ContentType = "application/pdf";
StringWriter stw = new StringWriter();
HtmlTextWriter htextw = new HtmlTextWriter(stw);
htextw.AddStyleAttribute("font-size", "8pt");
htextw.AddStyleAttribute("color", "Grey");
PendingOrdersPanel.RenderControl(htextw); //Name of the Panel
Document document = new Document();
document = new Document(PageSize.A4, 5, 5, 15, 5);
FontFactory.GetFont("Tahoma", 50, iTextSharp.text.BaseColor.BLUE);
PdfWriter.GetInstance(document, Response.OutputStream);
document.Open();
StringReader str = new StringReader(stw.ToString());
HTMLWorker htmlworker = new HTMLWorker(document);
htmlworker.Parse(str);
document.Close();
Response.Write(document);
}
}
of course include iTextSharp Refrences to cs file
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.html.simpleparser;
using iTextSharp.tool.xml;
Hope this helps!
Thank you
I use the following code to create PDF
protected void CreatePDF(Stream stream)
{
using (var document = new Document(PageSize.A4, 40, 40, 40, 30))
{
var writer = PdfWriter.GetInstance(document, stream);
writer.PageEvent = new ITextEvents();
document.Open();
// instantiate custom tag processor and add to `HtmlPipelineContext`.
var tagProcessorFactory = Tags.GetHtmlTagProcessorFactory();
tagProcessorFactory.AddProcessor(
new TableProcessor(),
new string[] { HTML.Tag.TABLE }
);
//Register Fonts.
XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.Register(HttpContext.Current.Server.MapPath("~/Content/Fonts/GothamRounded-Medium.ttf"), "Gotham Rounded Medium");
CssAppliers cssAppliers = new CssAppliersImpl(fontProvider);
var htmlPipelineContext = new HtmlPipelineContext(cssAppliers);
htmlPipelineContext.SetTagFactory(tagProcessorFactory);
var pdfWriterPipeline = new PdfWriterPipeline(document, writer);
var htmlPipeline = new HtmlPipeline(htmlPipelineContext, pdfWriterPipeline);
// get an ICssResolver and add the custom CSS
var cssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(true);
cssResolver.AddCss(CSSSource, "utf-8", true);
var cssResolverPipeline = new CssResolverPipeline(
cssResolver, htmlPipeline
);
var worker = new XMLWorker(cssResolverPipeline, true);
var parser = new XMLParser(worker);
using (var stringReader = new StringReader(HTMLSource))
{
parser.Parse(stringReader);
document.Close();
HttpContext.Current.Response.ContentType = "application /pdf";
if (base.View)
HttpContext.Current.Response.AddHeader("content-disposition", "inline;filename=\"" + OutputFileName + ".pdf\"");
else
HttpContext.Current.Response.AddHeader("content-disposition", "attachment;filename=\"" + OutputFileName + ".pdf\"");
HttpContext.Current.Response.Cache.SetCacheability(HttpCacheability.NoCache);
HttpContext.Current.Response.WriteFile(OutputPath);
HttpContext.Current.Response.End();
}
}
}
Here's the link I used as a guide. Hope this helps!
Converting HTML to PDF using ITextSharp
protected void Page_Load(object sender, EventArgs e)
{
try
{
string strHtml = string.Empty;
//HTML File path -http://aspnettutorialonline.blogspot.com/
string htmlFileName = Server.MapPath("~") + "\\files\\" + "ConvertHTMLToPDF.htm";
//pdf file path. -http://aspnettutorialonline.blogspot.com/
string pdfFileName = Request.PhysicalApplicationPath + "\\files\\" + "ConvertHTMLToPDF.pdf";
//reading html code from html file
FileStream fsHTMLDocument = new FileStream(htmlFileName, FileMode.Open, FileAccess.Read);
StreamReader srHTMLDocument = new StreamReader(fsHTMLDocument);
strHtml = srHTMLDocument.ReadToEnd();
srHTMLDocument.Close();
strHtml = strHtml.Replace("\r\n", "");
strHtml = strHtml.Replace("\0", "");
CreatePDFFromHTMLFile(strHtml, pdfFileName);
Response.Write("pdf creation successfully with password -http://aspnettutorialonline.blogspot.com/");
}
catch (Exception ex)
{
Response.Write(ex.Message);
}
}
public void CreatePDFFromHTMLFile(string HtmlStream, string FileName)
{
try
{
object TargetFile = FileName;
string ModifiedFileName = string.Empty;
string FinalFileName = string.Empty;
/* To add a Password to PDF -http://aspnettutorialonline.blogspot.com/ */
TestPDF.HtmlToPdfBuilder builder = new TestPDF.HtmlToPdfBuilder(iTextSharp.text.PageSize.A4);
TestPDF.HtmlPdfPage first = builder.AddPage();
first.AppendHtml(HtmlStream);
byte[] file = builder.RenderPdf();
File.WriteAllBytes(TargetFile.ToString(), file);
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(TargetFile.ToString());
ModifiedFileName = TargetFile.ToString();
ModifiedFileName = ModifiedFileName.Insert(ModifiedFileName.Length - 4, "1");
string password = "password";
iTextSharp.text.pdf.PdfEncryptor.Encrypt(reader, new FileStream(ModifiedFileName, FileMode.Append), iTextSharp.text.pdf.PdfWriter.STRENGTH128BITS, password, "", iTextSharp.text.pdf.PdfWriter.AllowPrinting);
//http://aspnettutorialonline.blogspot.com/
reader.Close();
if (File.Exists(TargetFile.ToString()))
File.Delete(TargetFile.ToString());
FinalFileName = ModifiedFileName.Remove(ModifiedFileName.Length - 5, 1);
File.Copy(ModifiedFileName, FinalFileName);
if (File.Exists(ModifiedFileName))
File.Delete(ModifiedFileName);
}
catch (Exception ex)
{
throw ex;
}
}
You can download the sample file. Just place the html you want to convert in the files folder and run. It will automatically generate the pdf file and place it in the same folder. But in your case, you can specify your html path in the htmlFileName variable.

How to convert HTML to PDF using iTextSharp

I want to convert the below HTML to PDF using iTextSharp but don't know where to start:
<style>
.headline{font-size:200%}
</style>
<p>
This <em>is </em>
<span class="headline" style="text-decoration: underline;">some</span>
<strong>sample<em> text</em></strong>
<span style="color: red;">!!!</span>
</p>
First, HTML and PDF are not related although they were created around the same time. HTML is intended to convey higher level information such as paragraphs and tables. Although there are methods to control it, it is ultimately up to the browser to draw these higher level concepts. PDF is intended to convey documents and the documents must "look" the same wherever they are rendered.
In an HTML document you might have a paragraph that's 100% wide and depending on the width of your monitor it might take 2 lines or 10 lines and when you print it it might be 7 lines and when you look at it on your phone it might take 20 lines. A PDF file, however, must be independent of the rendering device, so regardless of your screen size it must always render exactly the same.
Because of the musts above, PDF doesn't support abstract things like "tables" or "paragraphs". There are three basic things that PDF supports: text, lines/shapes and images. (There are other things like annotations and movies but I'm trying to keep it simple here.) In a PDF you don't say "here's a paragraph, browser do your thing!". Instead you say, "draw this text at this exact X,Y location using this exact font and don't worry, I've previously calculated the width of the text so I know it will all fit on this line". You also don't say "here's a table" but instead you say "draw this text at this exact location and then draw a rectangle at this other exact location that I've previously calculated so I know it will appear to be around the text".
Second, iText and iTextSharp parse HTML and CSS. That's it. ASP.Net, MVC, Razor, Struts, Spring, etc, are all HTML frameworks but iText/iTextSharp is 100% unaware of them. Same with DataGridViews, Repeaters, Templates, Views, etc. which are all framework-specific abstractions. It is your responsibility to get the HTML from your choice of framework, iText won't help you. If you get an exception saying The document has no pages or you think that "iText isn't parsing my HTML" it is almost definite that you don't actually have HTML, you only think you do.
Third, the built-in class that's been around for years is the HTMLWorker however this has been replaced with XMLWorker (Java / .Net). Zero work is being done on HTMLWorker which doesn't support CSS files and has only limited support for the most basic CSS properties and actually breaks on certain tags. If you do not see the HTML attribute or CSS property and value in this file then it probably isn't supported by HTMLWorker. XMLWorker can be more complicated sometimes but those complications also make it more extensible.
Below is C# code that shows how to parse HTML tags into iText abstractions that get automatically added to the document that you are working on. C# and Java are very similar so it should be relatively easy to convert this. Example #1 uses the built-in HTMLWorker to parse the HTML string. Since only inline styles are supported the class="headline" gets ignored but everything else should actually work. Example #2 is the same as the first except it uses XMLWorker instead. Example #3 also parses the simple CSS example.
//Create a byte array that will eventually hold our final PDF
Byte[] bytes;
//Boilerplate iTextSharp setup here
//Create a stream that we can write to, in this case a MemoryStream
using (var ms = new MemoryStream()) {
//Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
using (var doc = new Document()) {
//Create a writer that's bound to our PDF abstraction and our stream
using (var writer = PdfWriter.GetInstance(doc, ms)) {
//Open the document for writing
doc.Open();
//Our sample HTML and CSS
var example_html = #"<p>This <em>is </em><span class=""headline"" style=""text-decoration: underline;"">some</span> <strong>sample <em> text</em></strong><span style=""color: red;"">!!!</span></p>";
var example_css = #".headline{font-size:200%}";
/**************************************************
* Example #1 *
* *
* Use the built-in HTMLWorker to parse the HTML. *
* Only inline CSS is supported. *
* ************************************************/
//Create a new HTMLWorker bound to our document
using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc)) {
//HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)
using (var sr = new StringReader(example_html)) {
//Parse the HTML
htmlWorker.Parse(sr);
}
}
/**************************************************
* Example #2 *
* *
* Use the XMLWorker to parse the HTML. *
* Only inline CSS and absolutely linked *
* CSS is supported *
* ************************************************/
//XMLWorker also reads from a TextReader and not directly from a string
using (var srHtml = new StringReader(example_html)) {
//Parse the HTML
iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
}
/**************************************************
* Example #3 *
* *
* Use the XMLWorker to parse HTML and CSS *
* ************************************************/
//In order to read CSS as a string we need to switch to a different constructor
//that takes Streams instead of TextReaders.
//Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams
using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css))) {
using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html))) {
//Parse the HTML
iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, msHtml, msCss);
}
}
doc.Close();
}
}
//After all of the PDF "stuff" above is done and closed but **before** we
//close the MemoryStream, grab all of the active bytes from the stream
bytes = ms.ToArray();
}
//Now we just need to do something with those bytes.
//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.
//You could also write the bytes to a database in a varbinary() column (but please don't) or you
//could pass them to another function for further PDF processing.
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
System.IO.File.WriteAllBytes(testFile, bytes);
2017's update
There are good news for HTML-to-PDF demands. As this answer showed, the W3C standard css-break-3 will solve the problem... It is a Candidate Recommendation with plan to turn into definitive Recommendation this year, after tests.
As not-so-standard there are solutions, with plugins for C#, as showed by print-css.rocks.
As of 2018, there is also iText7 (A next iteration of old iTextSharp library) and its HTML to PDF package available: itext7.pdfhtml
Usage is straightforward:
HtmlConverter.ConvertToPdf(
new FileInfo(#"Path\to\Html\File.html"),
new FileInfo(#"Path\to\Pdf\File.pdf")
);
Method has many more overloads.
Update: iText* family of products has dual licensing model: free for open source, paid for commercial use.
#Chris Haas has explained very well how to use itextSharp to convert HTML to PDF, very helpful
my add is:
By using HtmlTextWriter I put html tags inside HTML table + inline CSS i got my PDF as I wanted without using XMLWorker .
Edit: adding sample code:
ASPX page:
<asp:Panel runat="server" ID="PendingOrdersPanel">
<!-- to be shown on PDF-->
<table style="border-spacing: 0;border-collapse: collapse;width:100%;display:none;" >
<tr><td><img src="abc.com/webimages/logo1.png" style="display: none;" width="230" /></td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:11px;color:#10466E;padding:0px;text-align:center;"><i>blablabla</i> Pending orders report<br /></td></tr>
</table>
<asp:GridView runat="server" ID="PendingOrdersGV" RowStyle-Wrap="false" AllowPaging="true" PageSize="10" Width="100%" CssClass="Grid" AlternatingRowStyle-CssClass="alt" AutoGenerateColumns="false"
PagerStyle-CssClass="pgr" HeaderStyle-ForeColor="White" PagerStyle-HorizontalAlign="Center" HeaderStyle-HorizontalAlign="Center" RowStyle-HorizontalAlign="Center" DataKeyNames="Document#"
OnPageIndexChanging="PendingOrdersGV_PageIndexChanging" OnRowDataBound="PendingOrdersGV_RowDataBound" OnRowCommand="PendingOrdersGV_RowCommand">
<EmptyDataTemplate><div style="text-align:center;">no records found</div></EmptyDataTemplate>
<Columns>
<asp:ButtonField CommandName="PendingOrders_Details" DataTextField="Document#" HeaderText="Document #" SortExpression="Document#" ItemStyle-ForeColor="Black" ItemStyle-Font-Underline="true"/>
<asp:BoundField DataField="Order#" HeaderText="order #" SortExpression="Order#"/>
<asp:BoundField DataField="Order Date" HeaderText="Order Date" SortExpression="Order Date" DataFormatString="{0:d}"></asp:BoundField>
<asp:BoundField DataField="Status" HeaderText="Status" SortExpression="Status"></asp:BoundField>
<asp:BoundField DataField="Amount" HeaderText="Amount" SortExpression="Amount" DataFormatString="{0:C2}"></asp:BoundField>
</Columns>
</asp:GridView>
</asp:Panel>
C# code:
protected void PendingOrdersPDF_Click(object sender, EventArgs e)
{
if (PendingOrdersGV.Rows.Count > 0)
{
//to allow paging=false & change style.
PendingOrdersGV.HeaderStyle.ForeColor = System.Drawing.Color.Black;
PendingOrdersGV.BorderColor = Color.Gray;
PendingOrdersGV.Font.Name = "Tahoma";
PendingOrdersGV.DataSource = clsBP.get_PendingOrders(lbl_BP_Id.Text);
PendingOrdersGV.AllowPaging = false;
PendingOrdersGV.Columns[0].Visible = false; //export won't work if there's a link in the gridview
PendingOrdersGV.DataBind();
//to PDF code --Sam
string attachment = "attachment; filename=report.pdf";
Response.ClearContent();
Response.AddHeader("content-disposition", attachment);
Response.ContentType = "application/pdf";
StringWriter stw = new StringWriter();
HtmlTextWriter htextw = new HtmlTextWriter(stw);
htextw.AddStyleAttribute("font-size", "8pt");
htextw.AddStyleAttribute("color", "Grey");
PendingOrdersPanel.RenderControl(htextw); //Name of the Panel
Document document = new Document();
document = new Document(PageSize.A4, 5, 5, 15, 5);
FontFactory.GetFont("Tahoma", 50, iTextSharp.text.BaseColor.BLUE);
PdfWriter.GetInstance(document, Response.OutputStream);
document.Open();
StringReader str = new StringReader(stw.ToString());
HTMLWorker htmlworker = new HTMLWorker(document);
htmlworker.Parse(str);
document.Close();
Response.Write(document);
}
}
of course include iTextSharp Refrences to cs file
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.html.simpleparser;
using iTextSharp.tool.xml;
Hope this helps!
Thank you
I use the following code to create PDF
protected void CreatePDF(Stream stream)
{
using (var document = new Document(PageSize.A4, 40, 40, 40, 30))
{
var writer = PdfWriter.GetInstance(document, stream);
writer.PageEvent = new ITextEvents();
document.Open();
// instantiate custom tag processor and add to `HtmlPipelineContext`.
var tagProcessorFactory = Tags.GetHtmlTagProcessorFactory();
tagProcessorFactory.AddProcessor(
new TableProcessor(),
new string[] { HTML.Tag.TABLE }
);
//Register Fonts.
XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.Register(HttpContext.Current.Server.MapPath("~/Content/Fonts/GothamRounded-Medium.ttf"), "Gotham Rounded Medium");
CssAppliers cssAppliers = new CssAppliersImpl(fontProvider);
var htmlPipelineContext = new HtmlPipelineContext(cssAppliers);
htmlPipelineContext.SetTagFactory(tagProcessorFactory);
var pdfWriterPipeline = new PdfWriterPipeline(document, writer);
var htmlPipeline = new HtmlPipeline(htmlPipelineContext, pdfWriterPipeline);
// get an ICssResolver and add the custom CSS
var cssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(true);
cssResolver.AddCss(CSSSource, "utf-8", true);
var cssResolverPipeline = new CssResolverPipeline(
cssResolver, htmlPipeline
);
var worker = new XMLWorker(cssResolverPipeline, true);
var parser = new XMLParser(worker);
using (var stringReader = new StringReader(HTMLSource))
{
parser.Parse(stringReader);
document.Close();
HttpContext.Current.Response.ContentType = "application /pdf";
if (base.View)
HttpContext.Current.Response.AddHeader("content-disposition", "inline;filename=\"" + OutputFileName + ".pdf\"");
else
HttpContext.Current.Response.AddHeader("content-disposition", "attachment;filename=\"" + OutputFileName + ".pdf\"");
HttpContext.Current.Response.Cache.SetCacheability(HttpCacheability.NoCache);
HttpContext.Current.Response.WriteFile(OutputPath);
HttpContext.Current.Response.End();
}
}
}
Here's the link I used as a guide. Hope this helps!
Converting HTML to PDF using ITextSharp
protected void Page_Load(object sender, EventArgs e)
{
try
{
string strHtml = string.Empty;
//HTML File path -http://aspnettutorialonline.blogspot.com/
string htmlFileName = Server.MapPath("~") + "\\files\\" + "ConvertHTMLToPDF.htm";
//pdf file path. -http://aspnettutorialonline.blogspot.com/
string pdfFileName = Request.PhysicalApplicationPath + "\\files\\" + "ConvertHTMLToPDF.pdf";
//reading html code from html file
FileStream fsHTMLDocument = new FileStream(htmlFileName, FileMode.Open, FileAccess.Read);
StreamReader srHTMLDocument = new StreamReader(fsHTMLDocument);
strHtml = srHTMLDocument.ReadToEnd();
srHTMLDocument.Close();
strHtml = strHtml.Replace("\r\n", "");
strHtml = strHtml.Replace("\0", "");
CreatePDFFromHTMLFile(strHtml, pdfFileName);
Response.Write("pdf creation successfully with password -http://aspnettutorialonline.blogspot.com/");
}
catch (Exception ex)
{
Response.Write(ex.Message);
}
}
public void CreatePDFFromHTMLFile(string HtmlStream, string FileName)
{
try
{
object TargetFile = FileName;
string ModifiedFileName = string.Empty;
string FinalFileName = string.Empty;
/* To add a Password to PDF -http://aspnettutorialonline.blogspot.com/ */
TestPDF.HtmlToPdfBuilder builder = new TestPDF.HtmlToPdfBuilder(iTextSharp.text.PageSize.A4);
TestPDF.HtmlPdfPage first = builder.AddPage();
first.AppendHtml(HtmlStream);
byte[] file = builder.RenderPdf();
File.WriteAllBytes(TargetFile.ToString(), file);
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(TargetFile.ToString());
ModifiedFileName = TargetFile.ToString();
ModifiedFileName = ModifiedFileName.Insert(ModifiedFileName.Length - 4, "1");
string password = "password";
iTextSharp.text.pdf.PdfEncryptor.Encrypt(reader, new FileStream(ModifiedFileName, FileMode.Append), iTextSharp.text.pdf.PdfWriter.STRENGTH128BITS, password, "", iTextSharp.text.pdf.PdfWriter.AllowPrinting);
//http://aspnettutorialonline.blogspot.com/
reader.Close();
if (File.Exists(TargetFile.ToString()))
File.Delete(TargetFile.ToString());
FinalFileName = ModifiedFileName.Remove(ModifiedFileName.Length - 5, 1);
File.Copy(ModifiedFileName, FinalFileName);
if (File.Exists(ModifiedFileName))
File.Delete(ModifiedFileName);
}
catch (Exception ex)
{
throw ex;
}
}
You can download the sample file. Just place the html you want to convert in the files folder and run. It will automatically generate the pdf file and place it in the same folder. But in your case, you can specify your html path in the htmlFileName variable.

Categories