Merge PDF files with TOC element

Merge PDF files with TOC element - c#

I'm merging PDF files using GemBox.Pdf as shown here. This works great and I can easily add outlines.
I've previously done a similar thing and merged Word files with GemBox.Document as shown here.
But now my problem is that there is no TOC element in GemBox.Pdf. I want to get automatically a Table of Contents while merging multiple PDF files into one.
Am I missing something or is there really no such element for PDF?
Do I need to recreate it, if yes then how would I do that?
I can add a bookmark, but I don't know how to add a link to it.

There is no such element in PDF files, so we need to create this content ourselves.
Now one way would be to create text elements, outlines, and link annotations, position them appropriately, and set the link destinations to outlines.
However, this could be quite some work so perhaps it would be easier to just create the desired TOC element with GemBox.Document, save it as a PDF file, and then import it into the resulting PDF.
// Source data for creating TOC entries with specified text and associated PDF files.
var pdfEntries = new[]
{
new { Title = "First Document Title", Pdf = PdfDocument.Load("input1.pdf") },
new { Title = "Second Document Title", Pdf = PdfDocument.Load("input2.pdf") },
new { Title = "Third Document Title", Pdf = PdfDocument.Load("input3.pdf") },
};
/***************************************************************/
/* Create new document with TOC element using GemBox.Document. */
/***************************************************************/
// Create new document.
var tocDocument = new DocumentModel();
var section = new Section(tocDocument);
tocDocument.Sections.Add(section);
// Create and add TOC element.
var toc = new TableOfEntries(tocDocument, FieldType.TOC);
section.Blocks.Add(toc);
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));
// Create heading style.
// By default, when updating TOC element a TOC entry is created for each paragraph that has heading style.
var heading1Style = (ParagraphStyle)tocDocument.Styles.GetOrAdd(StyleTemplateType.Heading1);
// Add heading and empty (placeholder) pages.
// The number of added placeholder pages depend on the number of pages that actual PDF file has so that TOC entries have correct page numbers.
int totalPageCount = 0;
foreach (var pdfEntry in pdfEntries)
{
section.Blocks.Add(new Paragraph(tocDocument, pdfEntry.Title) { ParagraphFormat = { Style = heading1Style } });
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));
int currentPageCount = pdfEntry.Pdf.Pages.Count;
totalPageCount += currentPageCount;
while (--currentPageCount > 0)
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));
}
// Remove last extra-added empty page.
section.Blocks.RemoveAt(section.Blocks.Count - 1);
// Update TOC element and save the document as PDF stream.
toc.Update();
var pdfStream = new MemoryStream();
tocDocument.Save(pdfStream, new GemBox.Document.PdfSaveOptions());
/***************************************************************/
/* Merge PDF files into PDF with TOC element using GemBox.Pdf. */
/***************************************************************/
// Load a PDF stream using GemBox.Pdf.
var pdfDocument = PdfDocument.Load(pdfStream);
var rootDictionary = (PdfDictionary)((PdfIndirectObject)pdfDocument.GetDictionary()[PdfName.Create("Root")]).Value;
var pagesDictionary = (PdfDictionary)((PdfIndirectObject)rootDictionary[PdfName.Create("Pages")]).Value;
var kidsArray = (PdfArray)pagesDictionary[PdfName.Create("Kids")];
var pageIds = kidsArray.Cast<PdfIndirectObject>().Select(obj => obj.Id).ToArray();
// Remove empty (placeholder) pages.
while (totalPageCount-- > 0)
pdfDocument.Pages.RemoveAt(pdfDocument.Pages.Count - 1);
// Add pages from PDF files.
foreach (var pdfEntry in pdfEntries)
foreach (var page in pdfEntry.Pdf.Pages)
pdfDocument.Pages.AddClone(page);
/*****************************************************************************/
/* Update TOC links from placeholder pages to actual pages using GemBox.Pdf. */
/*****************************************************************************/
// Create a mapping from an ID of a empty (placeholder) page indirect object to an actual page indirect object.
var pageCloneMap = new Dictionary<PdfIndirectObjectIdentifier, PdfIndirectObject>();
for (int i = 0; i < kidsArray.Count; ++i)
pageCloneMap.Add(pageIds[i], (PdfIndirectObject)kidsArray[i]);
foreach (var entry in pageCloneMap)
{
// If page was updated, it means that we passed TOC pages, so break from the loop.
if (entry.Key != entry.Value.Id)
break;
// For each TOC page, get its 'Annots' entry.
// For each link annotation from the 'Annots' get the 'Dest' entry.
// Update the first item in the 'Dest' array so that it no longer points to a removed page.
if (((PdfDictionary)entry.Value.Value).TryGetValue(PdfName.Create("Annots"), out PdfBasicObject annotsObj))
foreach (PdfIndirectObject annotObj in (PdfArray)annotsObj)
if (((PdfDictionary)annotObj.Value).TryGetValue(PdfName.Create("Dest"), out PdfBasicObject destObj))
{
var destArray = (PdfArray)destObj;
destArray[0] = pageCloneMap[((PdfIndirectObject)destArray[0]).Id];
}
}
// Save resulting PDF file.
pdfDocument.Save("Result.pdf");
pdfDocument.Close();
This way you can easily customize the TOC element by using the TOC switches and styles. For more info, see the Table Of Content example from GemBox.Document.

Related

How do I create new PDF file every time i iterate or loop through the documents list?

I am new to PDF creation and I am following the existing code to create a pdf file. I am amending the existing code by creating a new pdf.
From the list of the documents, I am looping through each document and give it a new name.
How do I create a new PDF file every time I iterate or loop through the documents list?
resultCollection - Got the list of the documents
currentCompanySegmentsSetings - An object with the details
creatingTableOfContent - table content
What I have tried
foreach (var item in resultCollection)
{
var guidID = Guid.NewGuid().ToString();
var newFileName = $"{currentCompanySegmentsSetings.FriendlySegmentName}-{Translator.TranslateDocumentType("invoice", currentCompanySegmentsSetings).ToLower()}-{guidID}-{message.documents.First().AccountNumber}.pdf";
outputFileNames.Add(newFileName);
//Create PDF's and send to the location
System.IO.Directory.CreateDirectory(currentOutputDirectory);
var firstDocsMetadata = resultCollection.First().MetaData;
string generatedPDFLocation = System.IO.Path.Combine(currentOutputDirectory, newFileName);
var file = DocumentsToPDFDocs(financialDocument);
PdfDocument pdfDoc = new PdfDocument(new PdfWriter(generatedPDFLocation, CreateEncryptionWriteProperties()));
Document doc = new Document(pdfDoc);
//Bookmarks
pdfDoc.GetCatalog().SetPageMode(PdfName.UseOutlines);
doc.SetMargins(22f, 22f, 22f, 22f);
doc.SetFontSize(8);
doc.SetFontColor(Color.BLACK);
//1.Create table of contents
var tableOfContentTopMargin = 176;
Table tableOfContent = creatingTableOfContent(file, currentCompanySegmentsSetings);
tableOfContent.SetDestination("p" + "index");
doc.Add(tableOfContent.SetMarginTop(tableOfContentTopMargin));
//How do i continue from here to create pdf to the directory
message.FinancialDocumentAttachments.Add(new MessageQueueAttachment()
{
Location = documentPath,
IsNew = true,
Id = Guid.NewGuid()
});
}

I assume you want to create a separate PDF file for each item in resultCollection.
You already generate a filename for each file:
var newFileName = ...
string generatedPDFLocation = System.IO.Path.Combine(currentOutputDirectory, newFileName);
And you make a new PdfDocument and Document instance, to be written to a file with that filename:
PdfDocument pdfDoc = new PdfDocument(new PdfWriter(generatedPDFLocation,
CreateEncryptionWriteProperties()));
Document doc = new Document(pdfDoc);
Then just add content to the Document instance (or to the PdfDocument). I assume that content is in the item or resultCollection somehow.
doc.add(new Paragraph("content for this document"));
Finally, close the document which will flush the PDF file to disk.
doc.close();
The next iteration of the foreach loop will generate a different filename and thus the next document will be written to a different file.

You're using:
var firstDocsMetadata = resultCollection.First().MetaData;
When it should be:
var firstDocsMetadata = item.MetaData;
Since you're iterating the resultCollection and item is the element that is obtained in each loop.

Processing word document using OpenXML and C#

So I'm trying to populate the content controls in a word document by matching the Tag and populating the text within that content control.
The following displays in a MessageBox all of the tags I have in my document.
//Create a copy of the template file and open the document
File.Delete(hhscDocument);
File.Copy(hhscTemplate, hhscDocument, true);
//Open the word document specified by location
using (var document = WordprocessingDocument.Open(hhscDocument, true))
{
//Change the document type from template to document
var mainDocument = document.MainDocumentPart.Document;
if (mainDocument.Body.Descendants<Tag>().Any())
{
//MessageBox.Show(mainDocument.Body.Descendants<Table>().Count().ToString());
var tags = mainDocument.Body.Descendants<Tag>().ToList();
var aString = string.Empty;
foreach(var tag in tags)
{
aString += string.Format("{0}{1}", tag.Val, Environment.NewLine);
}
MessageBox.Show(aString);
}
}
However when I try the following it doesn't work.
//Create a copy of the template file and open the document
File.Delete(hhscDocument);
File.Copy(hhscTemplate, hhscDocument, true);
//Open the word document specified by location
using (var document = WordprocessingDocument.Open(hhscDocument, true))
{
//Change the document type from template to document
var mainDocument = document.MainDocumentPart.Document;
if (mainDocument.Body.Descendants<Tag>().Any())
{
//MessageBox.Show(mainDocument.Body.Descendants<Table>().Count().ToString());
var tags = mainDocument.Body.Descendants<Tag>().ToList();
var bString = string.Empty;
foreach(var tag in tags)
{
bString += string.Format("{0}{1}", tag.Parent.GetFirstChild<Text>().Text, Environment.NewLine);
}
MessageBox.Show(bString);
}
}
My objective in the end is if I match the appropriate tag I want to populate/change the text in the content control that tag belongs to.

So I basically used FirstChild and InnerXml to pick apart the documents XML contents. From there I developed the following that does what I need.
//Open the word document specified by location
using (var document = WordprocessingDocument.Open(hhscDocument, true))
{
var mainDocument = document.MainDocumentPart.Document;
if (mainDocument.Body.Descendants<Tag>().Any())
{
//Find all elements(descendants) of type tag
var tags = mainDocument.Body.Descendants<Tag>().ToList();
//Foreach of these tags
foreach (var tag in tags)
{
//Jump up two levels (.Parent.Parent) in the XML element and then jump down to the run level
var run = tag.Parent.Parent.Descendants<Run>().ToList();
//I access the 1st element because there is only one element in run
run[0].GetFirstChild<Text>().Text = "<new_text_value>";
}
}
mainDocument.Save();
}
This finds all the tags inside of your document and stores the elements in a list
var tags = mainDocument.Body.Descendants<Tag>().ToList();
This part of the code starts off at the tag part of the xml. From there I call parent twice to jump up two levels in the XML code so I can gain access to the Run level using descendants.
var run = tag.Parent.Parent.Descendants<Run>().ToList();
And last but not least the following code stores a new value into the text part of the PlainText Content control.
run[0].GetFirstChild<Text>().Text = "<new_text_value>";
Things that I noticed is the xml hierarchy is a funky thing. I find it easier to access these things from bottom up, hence why I started with the tags and moved up.

Convert a Word (DOCX) file to a PDF in C# on cloud environment

I have generated a word file using Open Xml and I need to send it as attachment in a email with pdf format but I cannot save any physical pdf or word file on disk because I develop my application in cloud environment(CRM online).
I found only way is "Aspose Word to .Net".
http://www.aspose.com/docs/display/wordsnet/How+to++Convert+a+Document+to+a+Byte+Array But it is too expensive.
Then I found a solution is to convert word to html, then convert html to pdf. But there is a picture in my word. And I cannot resolve the issue.

The most accurate conversion from DOCX to PDF is going to be through Word. Your best option for that is setting up a server with OWAS (Office Web Apps Server) and doing your conversion through that.
You'll need to set up a WOPI endpoint on your application server and call:
/wv/WordViewer/request.pdf?WOPISrc={WopiUrl}&type=downloadpdf
OR
/wv/WordViewer/request.pdf?WOPISrc={WopiUrl}&type=printpdf
Alternatively you could try and do it using OneDrive and Word Online, but you'll need to work out the parameters Word Online uses as well as whether that's permitted within the Ts & Cs.

You can try Gnostice XtremeDocumentStudio .NET.
Converting From DOCX To PDF Using XtremeDocumentStudio .NET
http://www.gnostice.com/goto.asp?id=24900&t=convert_docx_to_pdf_using_xdoc.net
In the published article, conversion has been demonstrated to save to a physical file. You can use documentConverter.ConvertToStream method to convert a document to a Stream as shown below in the code snippet.
DocumentConverter documentConverter = new DocumentConverter();
// input can be a FilePath, Stream, list of FilePaths or list of Streams
Object input = "InputDocument.docx";
string outputFileFormat = "pdf";
ConversionMode conversionMode = ConversionMode.ConvertToSeperateFiles;
List<Stream> outputStreams = documentConverter.ConvertToStream(input, outputFileFormat, conversionMode);
Disclaimer: I work for Gnostice.

If you wanna convert bytes array, then to use Metamorphosis:
string docxPath = #"example.docx";
string pdfPath = Path.ChangeExtension(docxPath, ".pdf");
byte[] docx = File.ReadAllBytes(docxPath);
// Convert DOCX to PDF in memory
byte[] pdf = p.DocxToPdfConvertByte(docx);
if (pdf != null)
{
// Save the PDF document to a file for a viewing purpose.
File.WriteAllBytes(pdfPath, pdf);
System.Diagnostics.Process.Start(pdfPath);
}
else
{
System.Console.WriteLine("Conversion failed!");
Console.ReadLine();
}

I have recently used SautinSoft 'Document .Net' library to convert docx to pdf in my React(frontend), .NET core(micro services- backend) application. It only take 15 seconds to generate a pdf having 23 pages. This 15 seconds includes getting data from database, then merging data with docx template and then converting it to pdf. The code has deployed to azure Linux box and works fine.
https://sautinsoft.com/products/document/
Sample code
public string GeneratePDF(PDFDocumentModel document)
{
byte[] output = null;
using (var outputStream = new MemoryStream())
{
// Create single pdf.
DocumentCore singlePDF = new DocumentCore();
var documentCores = new List<DocumentCore>();
foreach (var section in document.Sections)
{
documentCores.Add(GenerateDocument(section));
}
foreach (var dc in documentCores)
{
// Create import session.
ImportSession session = new ImportSession(dc, singlePDF, StyleImportingMode.KeepSourceFormatting);
// Loop through all sections in the source document.
foreach (Section sourceSection in dc.Sections)
{
// Because we are copying a section from one document to another,
// it is required to import the Section into the destination document.
// This adjusts any document-specific references to styles, bookmarks, etc.
// Importing a element creates a copy of the original element, but the copy
// is ready to be inserted into the destination document.
Section importedSection = singlePDF.Import<Section>(sourceSection, true, session);
// First section start from new page.
if (dc.Sections.IndexOf(sourceSection) == 0)
importedSection.PageSetup.SectionStart = SectionStart.NewPage;
// Now the new section can be appended to the destination document.
singlePDF.Sections.Add(importedSection);
//Paging
HeaderFooter footer = new HeaderFooter(singlePDF, HeaderFooterType.FooterDefault);
// Create a new paragraph to insert a page numbering.
// So that, our page numbering looks as: Page N of M.
Paragraph par = new Paragraph(singlePDF);
par.ParagraphFormat.Alignment = HorizontalAlignment.Center;
CharacterFormat cf = new CharacterFormat() { FontName = "Consolas", Size = 11.0 };
par.Content.Start.Insert("Page ", cf.Clone());
// Page numbering is a Field.
Field fPage = new Field(singlePDF, FieldType.Page);
fPage.CharacterFormat = cf.Clone();
par.Content.End.Insert(fPage.Content);
par.Content.End.Insert(" of ", cf.Clone());
Field fPages = new Field(singlePDF, FieldType.NumPages);
fPages.CharacterFormat = cf.Clone();
par.Content.End.Insert(fPages.Content);
footer.Blocks.Add(par);
importedSection.HeadersFooters.Add(footer);
}
}
var pdfOptions = new PdfSaveOptions();
pdfOptions.Compression = false;
pdfOptions.EmbedAllFonts = false;
pdfOptions.EmbeddedImagesFormat = PdfSaveOptions.EmbImagesFormat.Png;
pdfOptions.EmbeddedJpegQuality = 100;
//dont allow editing after population, also ensures content can be printed.
pdfOptions.PreserveFormFields = false;
pdfOptions.PreserveContentControls = false;
if (!string.IsNullOrEmpty(document.PdfProperties.Title))
{
singlePDF.Document.Properties.BuiltIn[BuiltInDocumentProperty.Title] = document.PdfProperties.Title;
}
if (!string.IsNullOrEmpty(document.PdfProperties.Author))
{
singlePDF.Document.Properties.BuiltIn[BuiltInDocumentProperty.Author] = document.PdfProperties.Author;
}
if (!string.IsNullOrEmpty(document.PdfProperties.Subject))
{
singlePDF.Document.Properties.BuiltIn[BuiltInDocumentProperty.Subject] = document.PdfProperties.Subject;
}
singlePDF.Save(outputStream, pdfOptions);
output = outputStream.ToArray();
}
return Convert.ToBase64String(output);
}

Blank pages when converting to PDF with TuesPechkin

I'd like to convert several different web pages into one PDF document. I found Pechkin / TuesPechkin, which has been a wonderful discovery, but I am running into one problem: only the last Object gets converted, and all the other PDF pages are blank. What could be causing this problem?
var document = new HtmlToPdfDocument
{
GlobalSettings =
{
Margins =
{
All = 1.375,
Unit = Unit.Centimeters
}
}
};
// Each "page" variable contains one HTML page
foreach (var page in pages)
document.Objects.Add(new ObjectSettings { HtmlText = page.Html });
// Create converter
var converter = Factory.Create();
// Convert!
var result = converter.Convert(document);
// Save
File.WriteAllBytes(path, result);

Turns out that this is a confirmed bug.
https://github.com/tuespetre/TuesPechkin/issues/23
I ended up solving the issue by generating one page at a time and merging the pages with iTextSharp.

Heading 1, Heading 2 is not highlighted in style ribbon of document after merging docx file

I am merging few docx files, those files were created using openxml and wordml through C#. Those files having heading tag as heading 1 , heading 2 etc. along with some text with these tags. When those files are created individually then if we click or select those text which are tagged with heading 1 and heading 2, then the Heading 1, Heading 2 etc are getting highlighted and the navigation pan are also showing against those Heading 1, Heading 2 tags, but after merging those documents when we click or select these text the Heading 1 and Heading 2 is not getting highlighted. in the style ribbon. The code for that merging is given here,
MemoryStream ms = new MemoryStream();
using (WordprocessingDocument myDoc =
WordprocessingDocument.Create(ms, WordprocessingDocumentType.Document))
{
MainDocumentPart mainPart = myDoc.AddMainDocumentPart();
mainPart.Document = new Document { Body = new Body() };
int counter = 1;
foreach (var sectionOutput in sectionOutputs)
{
foreach (var outputFile in sectionOutput.Files)
{
Paragraph sectionBreakPara = null;
if (!sectionOutput.SectionType.Equals(sectionOutputs[sectionOutputs.Count - 1].SectionType))
{
if (outputFile == sectionOutput.Files.Last())
//check whether this is the last file in this section
{
using (
WordprocessingDocument pkgSourceDoc =
WordprocessingDocument.Open(outputFile.OutputStream, true))
{
var sourceBody = pkgSourceDoc.MainDocumentPart.Document.Body;
SectionProperties docSectionBreak =
sourceBody.Descendants<SectionProperties>().LastOrDefault();
if (docSectionBreak != null)
{
var clonedSectionBreak = (SectionProperties)docSectionBreak.CloneNode(true);
clonedSectionBreak.RemoveAllChildren<FooterReference>();
clonedSectionBreak.RemoveAllChildren<HeaderReference>();
sectionBreakPara = new Paragraph();
ParagraphProperties sectionParaProp = new ParagraphProperties();
sectionParaProp.AppendChild(clonedSectionBreak);
sectionBreakPara.AppendChild(sectionParaProp);
}
}
}
}
string altChunkId = string.Format("altchunkId{0}", counter);
AlternativeFormatImportPart chunk = mainPart.AddAlternativeFormatImportPart(
AlternativeFormatImportPartType.WordprocessingML, altChunkId);
outputFile.OutputStream.Seek(0, SeekOrigin.Begin);
chunk.FeedData(outputFile.OutputStream);
AltChunk altChunk = new AltChunk(new AltChunkProperties(new MatchSource { Val = new OnOffValue(true) })) { Id = altChunkId };
mainPart.Document.Body.AppendChild(altChunk);
if (sectionBreakPara != null)
{
mainPart.Document
.Body
.AppendChild(sectionBreakPara);
}
counter++;
}
}
mainPart.Document.Save();
}
return ms;

In general, this symptom arises when the style definition is not present in the styles.xml part. If during the merge process the document content was carried over but the styles parts weren't, that could cause this problem.
In a new Word document, there are only a very few basic styles, like Normal. A style definition like Heading 1 is not added to the styles.xml until you assign that style to a paragraph. When a paragraph element contains a style assignment for a style not present in the package, the style is ignored.
It can also arise in table cells, where a table setting is overriding the style. For example, in a table you can say the first row (like headings) should appear in a particular font and color, and that will override a style setting.
If neither of those works, if you post a smallish amount of the XML that's generated, right around one of the paragraphs and its immediate context, that might give some clues.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Merge PDF files with TOC element - c#

Related

How do I create new PDF file every time i iterate or loop through the documents list?

Processing word document using OpenXML and C#

Convert a Word (DOCX) file to a PDF in C# on cloud environment

Blank pages when converting to PDF with TuesPechkin

Heading 1, Heading 2 is not highlighted in style ribbon of document after merging docx file

Categories

Resources