How to Page Break HTML Content in HTML Renderer - c#

I have a project where HTML code is converted to a PDF using HTML Renderer. The HTML code contains a single table. The PDF is displayed but the issue is that the contents of the table are cut off at the end. So is there any solution to the problem?
PdfDocument pdf=new PdfDocument();
var config = new PdfGenerateConfig()
{
MarginBottom = 20,
MarginLeft = 20,
MarginRight = 20,
MarginTop = 20,
};
//config.PageOrientation = PageOrientation.Landscape;
config.ManualPageSize = new PdfSharp.Drawing.XSize(1080, 828);
pdf = PdfGenerator.GeneratePdf(html, config);
byte[] fileContents = null;
using (MemoryStream stream = new MemoryStream())
{
pdf.Save(stream, true);
fileContents = stream.ToArray();
return new FileStreamResult(new MemoryStream(fileContents.ToArray()), "application/pdf");
}

HTMLRenderer should be able break the table to the next page.
See also:
https://github.com/ArthurHub/HTML-Renderer/pull/41
Make sure you are using the latest version. You may have to add those CSS properties.
Also see this answer:
https://stackoverflow.com/a/37833107/162529

As far as I know page breaks are not supported, but I've done a bit of a work-around (which may not work for all cases) by splitting the HTML into separate pages using a page break class, then adding each page to the pdf.
See example code below:
//This will only work on page break elements that are direct children of the body element.
//Each page's content must be inside the pagebreak element
private static PdfDocument SplitHtmlIntoPagedPdf(string html, string pageBreakBeforeClass, PdfGenerateConfig config, PdfDocument pdf)
{
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlBodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
var tempHtml = string.Empty;
foreach (var bodyNode in htmlBodyNode.ChildNodes)
{
if (bodyNode.Attributes["class"]?.Value == pageBreakBeforeClass)
{
if (!string.IsNullOrWhiteSpace(tempHtml))
{
//add any content found before the page break
AddPageToPdf(htmlDoc,tempHtml,config,ref pdf);
tempHtml = string.Empty;
}
AddPageToPdf(htmlDoc,bodyNode.OuterHtml,config,ref pdf);
}
else
{
tempHtml += bodyNode.OuterHtml;
}
}
if (!string.IsNullOrWhiteSpace(tempHtml))
{
//add any content found after the last page break
AddPageToPdf(htmlDoc, tempHtml, config, ref pdf);
}
return pdf;
}
private static void AddPageToPdf(HtmlDocument htmlDoc, string html, PdfGenerateConfig config, ref PdfDocument pdf)
{
var tempDoc = new HtmlDocument();
tempDoc.LoadHtml(htmlDoc.DocumentNode.OuterHtml);
var docNode = tempDoc.DocumentNode;
docNode.SelectSingleNode("//body").InnerHtml = html;
var nodeDoc = PdfGenerator.GeneratePdf(docNode.OuterHtml, config);
using (var tempMemoryStream = new MemoryStream())
{
nodeDoc.Save(tempMemoryStream, false);
var openedDoc = PdfReader.Open(tempMemoryStream, PdfDocumentOpenMode.Import);
foreach (PdfPage page in openedDoc.Pages)
{
pdf.AddPage(page);
}
}
}
Then call the code as follows:
var pdf = new PdfDocument();
var config = new PdfGenerateConfig()
{
MarginLeft = 5,
MarginRight = 5,
PageOrientation = PageOrientation.Portrait,
PageSize = PageSize.A4
};
if (!string.IsNullOrWhiteSpace(pageBreakBeforeClass))
{
pdf = SplitHtmlIntoPagedPdf(html, pageBreakBeforeClass, config, pdf);
}
else
{
pdf = PdfGenerator.GeneratePdf(html, config);
}
For any html that you want to have in its own page, just put the html inside a div with a class of "pagebreak" (or whatever you want to call it). If you want to, you could add that class to your css and give it "page-break-before: always;", so that the html will be print-friendly.

I've just figured out how to make it work, rather than page-break-inside on a TD, do that on the TABLE. Here's the code:
table { page-break-inside: avoid; }
I'm currently on the following versions (not working on stable versions at the moment):
HtmlRenderer on v1.5.1-beta1
PDFsharp on v1.51.5185-beta

Related

Append HTML header to Spire PDF

I am using Spire PDF to convert my HTML template to PDF file. Here is the sample code for the same:
class Program
{
static void Main(string[] args)
{
//Create a pdf document.
PdfDocument doc = new PdfDocument();
PdfPageSettings setting = new PdfPageSettings();
setting.Size = new SizeF(1000,1000);
setting.Margins = new Spire.Pdf.Graphics.PdfMargins(20);
PdfHtmlLayoutFormat htmlLayoutFormat = new PdfHtmlLayoutFormat();
htmlLayoutFormat.IsWaiting = true;
String url = "https://www.wikipedia.org/";
Thread thread = new Thread(() =>
{ doc.LoadFromHTML(url, false, false, false, setting,htmlLayoutFormat); });
thread.SetApartmentState(ApartmentState.STA);
thread.Start();
thread.Join();
//Save pdf file.
doc.SaveToFile("output-wiki.pdf");
doc.Close();
//Launching the Pdf file.
System.Diagnostics.Process.Start("output-wiki.pdf");
}
}
This is working as expected but now I want to add Header and Footer to all the pages. Though adding header and footer is possible using SprirePdf but my requirement is to add HTML template to the Header which I am not able to achieve. Is there any way to render html template to Header and footer?
Spire.PDF provides a class PdfHTMLTextElement supporting to render simple HTML tags including Font, B, I, U, Sub, Sup and BR on a PDF page. You can append HTML to header space in an existing PDF document using the following code snippet. As far as I know, there is no way to render complicated HTML only as a part of the document by using Spire.PDF.
//load an existing pdf document
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(#"C:\Users\Administrator\Desktop\sample.pdf");
//loop through the pages
for (int i = 0; i < doc.Pages.Count; i++)
{
//get the specfic page
PdfPageBase page = doc.Pages[i];
//define HTML string
string htmlText = "<b>XXX lnc.</b><br/><i>Tel:889 974 544</i><br/><font color='#FF4500'>Website:www.xxx.com</font>";
//render HTML text
PdfFont font = new PdfFont(PdfFontFamily.Helvetica, 12);
PdfBrush brush = PdfBrushes.Black;
PdfHTMLTextElement richTextElement = new PdfHTMLTextElement(htmlText, font, brush);
richTextElement.TextAlign = TextAlign.Left;
//draw html string at the top white space
richTextElement.Draw(page.Canvas, new RectangleF(70, 20, page.GetClientSize().Width - 140, page.GetClientSize().Height - 20));
}
//save to file
doc.SaveToFile("output.pdf");

How to embed all fonts from other PDF iText 7

I am trying to overlay two PDF files using iText7/C#.
The first one is kind of background and the second one is containing form fields.
Everything works fine and only problem is that I lose fonts from the second file.
I try as follows:
static public bool Overlay(string back_path, string front_path, string merge_path)
{
PdfReader reader;
PdfDocument pdf = null, front;
try
{
reader = new PdfReader(back_path);
pdf = new PdfDocument(reader, new PdfWriter(merge_path));
front = new PdfDocument(new PdfReader(front_path));
var form = PdfAcroForm.GetAcroForm(front, false);
PdfAcroForm dform = PdfAcroForm.GetAcroForm(pdf, true);
IDictionary<String, PdfFormField> fields = form.GetFormFields();
// copy styles
dform.SetDefaultResources(form.GetDefaultResources());
dform.SetDefaultAppearance(form.GetDefaultAppearance().GetValue());
// do overlay
foreach (KeyValuePair<string, PdfFormField> pair in fields)
{
try
{
var field = pair.Value;
PdfPage page = field.GetWidgets().First().GetPage();
int pg_no = front.GetPageNumber(page);
if (pg_no < front_start_page || pg_no > front_end_page)
continue;
PdfObject copied = field.GetPdfObject().CopyTo(pdf, true);
PdfFormField copiedField = PdfFormField.MakeFormField(copied, pdf);
// The following returns null. If it returns something, I think I could use copiedField.setFont(font).
// var font = field.GetFont();
dform.AddField(copiedField, pdf.GetPage(pg_no));
}
catch (Exception ex)
{
System.Diagnostics.Debug.WriteLine($"Overlaying field {pair.Key} failed. ({ex.Message})");
}
}
pdf.Close();
return true;
}
catch (Exception ex)
{
throw new OverlayException(ex.Message);
}
}
public static PdfDictionary get_font_dict(PdfDocument pdfDoc)
{
PdfDictionary acroForm = pdfDoc.GetCatalog().GetPdfObject().GetAsDictionary(PdfName.AcroForm);
if (acroForm == null)
{
return null;
}
PdfDictionary dr = acroForm.GetAsDictionary(PdfName.DR);
if (dr == null)
{
return null;
}
PdfDictionary font = dr.GetAsDictionary(PdfName.Font);
return font;
}
So basically I get all fonts from the second PDF and copy them to the final PDF.
But it does not work.
Logically, I think setting font of the original field to the copied one is the right way.
I mean PdfFormField.GetFont() and SetFont().
But it always returns null.
In a comment you clarified:
the background PDF can be assumed not to have form fields or annotations. I mean we can assume background PDF only contains static content (scanned form) and the front PDF only contains formfields.
In that case the easiest way to implement your method is to add the background as xobject to the form PDF instead of adding the form to the background PDF.
You can simply do that like this:
PdfReader formReader = new PdfReader(front_path);
PdfReader backReader = new PdfReader(back_path);
PdfWriter writer = new PdfWriter(merge_path);
using (PdfDocument source = new PdfDocument(backReader))
using (PdfDocument target = new PdfDocument(formReader, writer))
{
PdfFormXObject xobject = source.GetPage(1).CopyAsFormXObject(target);
PdfPage targetFirstPage = target.GetFirstPage();
PdfStream stream = targetFirstPage.NewContentStreamBefore();
PdfCanvas pdfCanvas = new PdfCanvas(stream, targetFirstPage.GetResources(), target);
Rectangle cropBox = targetFirstPage.GetCropBox();
pdfCanvas.AddXObject(xobject, cropBox.GetX(), cropBox.GetY());
}
Depending on the exact static contents of the background and the form PDF, you might want to use NewContentStreamAfter instead of NewContentStreamBefore or even to use some nifty blend mode to get the exact static content look you want.

HTML Renderer/PDFsharp Combine Two HTML-Generated PDF Documents

I am trying to add two pages in one document. These two pages are generated from HTML.
Info : HTML Renderer for PDF using PDFsharp, HtmlRenderer.PdfSharp 1.5.0.6
var config = new PdfGenerateConfig
{
PageOrientation = PageOrientation.Portrait,
PageSize = PageSize.A4,
MarginBottom = 0,
MarginLeft = 0,
MarginRight = 0,
MarginTop = 0
};
string pdfFirstPage = CreateHtml();
string pdfsecondPage = CreateHtml2();
PdfDocument doc=new PdfDocument();
doc.AddPage(new PdfPage(PdfGenerator.GeneratePdf(pdfFirstPage, config)));
doc.AddPage(new PdfPage(PdfGenerator.GeneratePdf(pdfsecondPage, config)));
I tried few ways, but the most given error is Import Mode. This is the last test, but it is not successful .How can I combine two pages generated from HTML strings as 2 pages in 1 document and download it?
Here is code that works:
static void Main(string[] args)
{
PdfDocument pdf1 = PdfGenerator.GeneratePdf("<p><h1>Hello World</h1>This is html rendered text #1</p>", PageSize.A4);
PdfDocument pdf2 = PdfGenerator.GeneratePdf("<p><h1>Hello World</h1>This is html rendered text #2</p>", PageSize.A4);
PdfDocument pdf1ForImport = ImportPdfDocument(pdf1);
PdfDocument pdf2ForImport = ImportPdfDocument(pdf2);
var combinedPdf = new PdfDocument();
combinedPdf.Pages.Add(pdf1ForImport.Pages[0]);
combinedPdf.Pages.Add(pdf2ForImport.Pages[0]);
combinedPdf.Save("document.pdf");
}
private static PdfDocument ImportPdfDocument(PdfDocument pdf1)
{
using (var stream = new MemoryStream())
{
pdf1.Save(stream, false);
stream.Position = 0;
var result = PdfReader.Open(stream, PdfDocumentOpenMode.Import);
return result;
}
}
I save the PDF document to a MemoryStream and open them for import. This allows to add the pages to a new PdfDocument. Only the first page of the documents is used for simplicity - add loops as needed.

How to add a Pdfptable little by little to a document when generating a Pdf from a Html

The application I'm working on generates pdf reports from html files using itextsharp library. Apparently, when large tables are generated, itextsharp allocates a lot of memory (~300 MB for a 200KB file).
A solution to this problem is adding the table little by little to the document (so all existing data in table will be flushed), as described in the following links:
IText Large Tables
How to reduce memory consumption of PdfPTable with many cells
Question: How can I add a pdfptable in steps when generating the pdf from an existing html file?
Here is my code:
public byte[] GetReportPdf(string template, string cssString)
{
byte[] result;
using (var stream = new MemoryStream())
{
using (
var doc = new Document(
this.Settings.Size,
this.Settings.Margins.Left,
this.Settings.Margins.Right,
this.Settings.Margins.Top,
this.Settings.Margins.Bottom))
{
using (var writer = PdfWriter.GetInstance(doc, stream))
{
// adding the page event, or null
writer.PageEvent = this.Settings.PageEvent;
doc.Open();
// CSS
var cssResolver = new StyleAttrCSSResolver();
using (var cssStream = new MemoryStream(Encoding.UTF8.GetBytes(cssString)))
{
var cssFile = XMLWorkerHelper.GetCSS(cssStream);
cssResolver.AddCss(cssFile);
}
// HTML
var fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
var cssAppliers = new CssAppliersImpl(fontProvider);
var htmlContext = new HtmlPipelineContext(cssAppliers);
htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
// pipelines
var pdf = new PdfWriterPipeline(doc, writer);
var html = new HtmlPipeline(htmlContext, pdf);
var css = new CssResolverPipeline(cssResolver, html);
// XML worker
var worker = new XMLWorker(css, true);
var parser = new XMLParser(worker);
using (var stringReader = new StringReader(template))
{
parser.Parse(stringReader);
}
doc.Close();
}
}
result = stream.ToArray();
}
return result;
Notes:
The solution in the previous links are not using an html to create
the pdf
The steps described are: set table complete property to false, add every 50 table rows to the document, set the table complete property to true.
Using an AbstractTagProcessor, I managed to set the table complete property when the html is parsed, but found no option on how to trigger table adding while it's generated.
itextsharp version 5.5.10.0
itextsharp.xmlworker version 5.5.10.0
var tagFactory = Tags.GetHtmlTagProcessorFactory();
tagFactory.AddProcessor(new TableTagProcessor(doc), new string[]{"table"});
public class TableTagProcessor : iTextSharp.tool.xml.html.table.Table{
public override IList<IElement> Start(IWorkerContext ctx, Tag tag)
{
var result = base.Start(ctx, tag);
foreach (PdfPTable table in result.OfType<PdfPTable>())
{
table.Complete = false;
}
return result;
}
public override IList<IElement> End(IWorkerContext ctx, Tag tag, IList<IElement> currentContent)
{
var result = base.End(ctx, tag, currentContent);
foreach (PdfPTable table in result.OfType<PdfPTable>())
{
table.Complete = true;
}
return result;
}}

Using PDFsharp and MigraDoc to write to and then read from a PDF

I'm trying to write verification code for our PDF generating routines, and I'm having difficulty getting PDFsharp to extract text from files created with MigraDoc. The ExtractText code works with other PDFs, but not with the PDFs that I generate with MigraDoc (see code below.)
Any tips on what I'm doing wrong?
//Create the Doc
var doc = new MigraDoc.DocumentObjectModel.Document();
doc.Info.Title = "VerifyReadWrite";
var section = doc.AddSection();
section.AddParagraph("ABCDEF abcdef");
//Render the PDF
var renderer = new PdfDocumentRenderer(true);
var pdf = new PdfDocument();
renderer.PdfDocument = pdf;
renderer.Document = doc;
renderer.RenderDocument();
var msOut = new MemoryStream();
pdf.Save(msOut, true);
var pdfBytes = msOut.ToArray();
//Read the PDF into PdfSharp
var ms = new MemoryStream(pdfBytes);
var pdfRead = PdfSharp.Pdf.IO.PdfReader.Open(ms, PdfDocumentOpenMode.ReadOnly);
var segments = pdfRead.Pages[0].ExtractText().ToList();
Results in the following:
segments[0] = "\0$\0%\0&\0'\0(\0)"
segments[1] = "\0D\0E\0F\0G\0H\0I"
I'd expect to see:
segments[0] = "ABCDEF"
segments[1] = "abcdef"
I'm using the ExtractText code from here:
C# Extract text from PDF using PdfSharp
and it works very well for all but PDFs generated with MigraDoc.
public static IEnumerable<string> ExtractText(this PdfPage page)
{
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
return text.Select(x => x.Trim());
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = (COperator) cObject;
if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else
{
var sequence = cObject as CSequence;
if (sequence != null)
{
var cSequence = sequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = (CString) cObject;
yield return cString.Value;
}
}
}
It seems the code used to extract text does not support all cases.
Try new PdfDocumentRenderer(false) (instead of 'true'). AFAIK this will lead to a different encoding and the text extraction might work.

Categories