Processing word document using OpenXML and C#

Processing word document using OpenXML and C# - c#

So I'm trying to populate the content controls in a word document by matching the Tag and populating the text within that content control.
The following displays in a MessageBox all of the tags I have in my document.
//Create a copy of the template file and open the document
File.Delete(hhscDocument);
File.Copy(hhscTemplate, hhscDocument, true);
//Open the word document specified by location
using (var document = WordprocessingDocument.Open(hhscDocument, true))
{
//Change the document type from template to document
var mainDocument = document.MainDocumentPart.Document;
if (mainDocument.Body.Descendants<Tag>().Any())
{
//MessageBox.Show(mainDocument.Body.Descendants<Table>().Count().ToString());
var tags = mainDocument.Body.Descendants<Tag>().ToList();
var aString = string.Empty;
foreach(var tag in tags)
{
aString += string.Format("{0}{1}", tag.Val, Environment.NewLine);
}
MessageBox.Show(aString);
}
}
However when I try the following it doesn't work.
//Create a copy of the template file and open the document
File.Delete(hhscDocument);
File.Copy(hhscTemplate, hhscDocument, true);
//Open the word document specified by location
using (var document = WordprocessingDocument.Open(hhscDocument, true))
{
//Change the document type from template to document
var mainDocument = document.MainDocumentPart.Document;
if (mainDocument.Body.Descendants<Tag>().Any())
{
//MessageBox.Show(mainDocument.Body.Descendants<Table>().Count().ToString());
var tags = mainDocument.Body.Descendants<Tag>().ToList();
var bString = string.Empty;
foreach(var tag in tags)
{
bString += string.Format("{0}{1}", tag.Parent.GetFirstChild<Text>().Text, Environment.NewLine);
}
MessageBox.Show(bString);
}
}
My objective in the end is if I match the appropriate tag I want to populate/change the text in the content control that tag belongs to.

So I basically used FirstChild and InnerXml to pick apart the documents XML contents. From there I developed the following that does what I need.
//Open the word document specified by location
using (var document = WordprocessingDocument.Open(hhscDocument, true))
{
var mainDocument = document.MainDocumentPart.Document;
if (mainDocument.Body.Descendants<Tag>().Any())
{
//Find all elements(descendants) of type tag
var tags = mainDocument.Body.Descendants<Tag>().ToList();
//Foreach of these tags
foreach (var tag in tags)
{
//Jump up two levels (.Parent.Parent) in the XML element and then jump down to the run level
var run = tag.Parent.Parent.Descendants<Run>().ToList();
//I access the 1st element because there is only one element in run
run[0].GetFirstChild<Text>().Text = "<new_text_value>";
}
}
mainDocument.Save();
}
This finds all the tags inside of your document and stores the elements in a list
var tags = mainDocument.Body.Descendants<Tag>().ToList();
This part of the code starts off at the tag part of the xml. From there I call parent twice to jump up two levels in the XML code so I can gain access to the Run level using descendants.
var run = tag.Parent.Parent.Descendants<Run>().ToList();
And last but not least the following code stores a new value into the text part of the PlainText Content control.
run[0].GetFirstChild<Text>().Text = "<new_text_value>";
Things that I noticed is the xml hierarchy is a funky thing. I find it easier to access these things from bottom up, hence why I started with the tags and moved up.

Related

Merge PDF files with TOC element

I'm merging PDF files using GemBox.Pdf as shown here. This works great and I can easily add outlines.
I've previously done a similar thing and merged Word files with GemBox.Document as shown here.
But now my problem is that there is no TOC element in GemBox.Pdf. I want to get automatically a Table of Contents while merging multiple PDF files into one.
Am I missing something or is there really no such element for PDF?
Do I need to recreate it, if yes then how would I do that?
I can add a bookmark, but I don't know how to add a link to it.

There is no such element in PDF files, so we need to create this content ourselves.
Now one way would be to create text elements, outlines, and link annotations, position them appropriately, and set the link destinations to outlines.
However, this could be quite some work so perhaps it would be easier to just create the desired TOC element with GemBox.Document, save it as a PDF file, and then import it into the resulting PDF.
// Source data for creating TOC entries with specified text and associated PDF files.
var pdfEntries = new[]
{
new { Title = "First Document Title", Pdf = PdfDocument.Load("input1.pdf") },
new { Title = "Second Document Title", Pdf = PdfDocument.Load("input2.pdf") },
new { Title = "Third Document Title", Pdf = PdfDocument.Load("input3.pdf") },
};
/***************************************************************/
/* Create new document with TOC element using GemBox.Document. */
/***************************************************************/
// Create new document.
var tocDocument = new DocumentModel();
var section = new Section(tocDocument);
tocDocument.Sections.Add(section);
// Create and add TOC element.
var toc = new TableOfEntries(tocDocument, FieldType.TOC);
section.Blocks.Add(toc);
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));
// Create heading style.
// By default, when updating TOC element a TOC entry is created for each paragraph that has heading style.
var heading1Style = (ParagraphStyle)tocDocument.Styles.GetOrAdd(StyleTemplateType.Heading1);
// Add heading and empty (placeholder) pages.
// The number of added placeholder pages depend on the number of pages that actual PDF file has so that TOC entries have correct page numbers.
int totalPageCount = 0;
foreach (var pdfEntry in pdfEntries)
{
section.Blocks.Add(new Paragraph(tocDocument, pdfEntry.Title) { ParagraphFormat = { Style = heading1Style } });
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));
int currentPageCount = pdfEntry.Pdf.Pages.Count;
totalPageCount += currentPageCount;
while (--currentPageCount > 0)
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));
}
// Remove last extra-added empty page.
section.Blocks.RemoveAt(section.Blocks.Count - 1);
// Update TOC element and save the document as PDF stream.
toc.Update();
var pdfStream = new MemoryStream();
tocDocument.Save(pdfStream, new GemBox.Document.PdfSaveOptions());
/***************************************************************/
/* Merge PDF files into PDF with TOC element using GemBox.Pdf. */
/***************************************************************/
// Load a PDF stream using GemBox.Pdf.
var pdfDocument = PdfDocument.Load(pdfStream);
var rootDictionary = (PdfDictionary)((PdfIndirectObject)pdfDocument.GetDictionary()[PdfName.Create("Root")]).Value;
var pagesDictionary = (PdfDictionary)((PdfIndirectObject)rootDictionary[PdfName.Create("Pages")]).Value;
var kidsArray = (PdfArray)pagesDictionary[PdfName.Create("Kids")];
var pageIds = kidsArray.Cast<PdfIndirectObject>().Select(obj => obj.Id).ToArray();
// Remove empty (placeholder) pages.
while (totalPageCount-- > 0)
pdfDocument.Pages.RemoveAt(pdfDocument.Pages.Count - 1);
// Add pages from PDF files.
foreach (var pdfEntry in pdfEntries)
foreach (var page in pdfEntry.Pdf.Pages)
pdfDocument.Pages.AddClone(page);
/*****************************************************************************/
/* Update TOC links from placeholder pages to actual pages using GemBox.Pdf. */
/*****************************************************************************/
// Create a mapping from an ID of a empty (placeholder) page indirect object to an actual page indirect object.
var pageCloneMap = new Dictionary<PdfIndirectObjectIdentifier, PdfIndirectObject>();
for (int i = 0; i < kidsArray.Count; ++i)
pageCloneMap.Add(pageIds[i], (PdfIndirectObject)kidsArray[i]);
foreach (var entry in pageCloneMap)
{
// If page was updated, it means that we passed TOC pages, so break from the loop.
if (entry.Key != entry.Value.Id)
break;
// For each TOC page, get its 'Annots' entry.
// For each link annotation from the 'Annots' get the 'Dest' entry.
// Update the first item in the 'Dest' array so that it no longer points to a removed page.
if (((PdfDictionary)entry.Value.Value).TryGetValue(PdfName.Create("Annots"), out PdfBasicObject annotsObj))
foreach (PdfIndirectObject annotObj in (PdfArray)annotsObj)
if (((PdfDictionary)annotObj.Value).TryGetValue(PdfName.Create("Dest"), out PdfBasicObject destObj))
{
var destArray = (PdfArray)destObj;
destArray[0] = pageCloneMap[((PdfIndirectObject)destArray[0]).Id];
}
}
// Save resulting PDF file.
pdfDocument.Save("Result.pdf");
pdfDocument.Close();
This way you can easily customize the TOC element by using the TOC switches and styles. For more info, see the Table Of Content example from GemBox.Document.

How to insert text into a content control with the Open XML SDK

I'm trying to develop a solution which takes the input from a ASP.Net Web Page and Embed the input values into Corresponding Content Controls within a MS Word Document. The MS Word Document has also got Static Data with some Dynamic data to be Embed into the Header and Footer fields.
The Idea here is that the solution should be Web based. Can I use OpenXML for this purpose or any other approach that you can suggest.
Thank you very much in advance for all your valuable inputs. I really appreciate them.

I have a little code sample from my project, to insert a few words in a content control you've created in a Word document:
public static WordprocessingDocument InsertText(this WordprocessingDocument doc, string contentControlTag, string text)
{
SdtElement element = doc.MainDocumentPart.Document.Body.Descendants<SdtElement>()
.FirstOrDefault(sdt => sdt.SdtProperties.GetFirstChild<Tag>()?.Val == contentControlTag);
if (element == null)
throw new ArgumentException($"ContentControlTag \"{contentControlTag}\" doesn't exist.");
element.Descendants<Text>().First().Text = text;
element.Descendants<Text>().Skip(1).ToList().ForEach(t => t.Remove());
return doc;
}
It simply looks for the first contentcontrol in the document with a specific Tag (you can set that by enabling designer mode in word and right-clicking on the content control), and replaces the current text with the text passed into the method. After this the document will still contain the content controls of course which may not be desired. So when I'm done editing the document I run the following method to get rid of the content controls:
internal static WordprocessingDocument RemoveSdtBlocks(this WordprocessingDocument doc, IEnumerable<string> contentBlocks)
{
List<SdtElement> SdtBlocks = doc.MainDocumentPart.Document.Descendants<SdtElement>().ToList();
if (contentBlocks == null)
return doc;
foreach(var s in contentBlocks)
{
SdtElement currentElement = SdtBlocks.FirstOrDefault(sdt => sdt.SdtProperties.GetFirstChild<Tag>()?.Val == s);
if (currentElement == null)
continue;
IEnumerable<OpenXmlElement> elements = null;
if (currentElement is SdtBlock)
elements = (currentElement as SdtBlock).SdtContentBlock.Elements();
else if (currentElement is SdtCell)
elements = (currentElement as SdtCell).SdtContentCell.Elements();
else if (currentElement is SdtRun)
elements = (currentElement as SdtRun).SdtContentRun.Elements();
foreach (var el in elements)
currentElement.InsertBeforeSelf(el.CloneNode(true));
currentElement.Remove();
}
return doc;
}
To open the WordProcessingDocument from a template and edit it, there is plenty of information available online.
Edit:
Little sample code to open/save documents while working with them in a memorystream, of course you should take care of this with an extra repository class that takes care of managing the document in the real code:
byte[] byteArray = File.ReadAllBytes(#"C:\...\Template.dotx");
using (var stream = new MemoryStream())
{
stream.Write(byteArray, 0, byteArray.Length);
using (WordprocessingDocument doc = WordprocessingDocument.Open(stream, true))
{
//Needed because I'm working with template dotx file,
//remove this if the template is a normal docx.
doc.ChangeDocumentType(DocumentFormat.OpenXml.WordprocessingDocumentType.Document);
doc.InsertText("contentControlName","testtesttesttest");
}
using (FileStream fs = new FileStream(#"C:\...\newFile.docx", FileMode.Create))
{
stream.WriteTo(fs);
}
}

Convert a Word (DOCX) file to a PDF in C# on cloud environment

I have generated a word file using Open Xml and I need to send it as attachment in a email with pdf format but I cannot save any physical pdf or word file on disk because I develop my application in cloud environment(CRM online).
I found only way is "Aspose Word to .Net".
http://www.aspose.com/docs/display/wordsnet/How+to++Convert+a+Document+to+a+Byte+Array But it is too expensive.
Then I found a solution is to convert word to html, then convert html to pdf. But there is a picture in my word. And I cannot resolve the issue.

The most accurate conversion from DOCX to PDF is going to be through Word. Your best option for that is setting up a server with OWAS (Office Web Apps Server) and doing your conversion through that.
You'll need to set up a WOPI endpoint on your application server and call:
/wv/WordViewer/request.pdf?WOPISrc={WopiUrl}&type=downloadpdf
OR
/wv/WordViewer/request.pdf?WOPISrc={WopiUrl}&type=printpdf
Alternatively you could try and do it using OneDrive and Word Online, but you'll need to work out the parameters Word Online uses as well as whether that's permitted within the Ts & Cs.

You can try Gnostice XtremeDocumentStudio .NET.
Converting From DOCX To PDF Using XtremeDocumentStudio .NET
http://www.gnostice.com/goto.asp?id=24900&t=convert_docx_to_pdf_using_xdoc.net
In the published article, conversion has been demonstrated to save to a physical file. You can use documentConverter.ConvertToStream method to convert a document to a Stream as shown below in the code snippet.
DocumentConverter documentConverter = new DocumentConverter();
// input can be a FilePath, Stream, list of FilePaths or list of Streams
Object input = "InputDocument.docx";
string outputFileFormat = "pdf";
ConversionMode conversionMode = ConversionMode.ConvertToSeperateFiles;
List<Stream> outputStreams = documentConverter.ConvertToStream(input, outputFileFormat, conversionMode);
Disclaimer: I work for Gnostice.

If you wanna convert bytes array, then to use Metamorphosis:
string docxPath = #"example.docx";
string pdfPath = Path.ChangeExtension(docxPath, ".pdf");
byte[] docx = File.ReadAllBytes(docxPath);
// Convert DOCX to PDF in memory
byte[] pdf = p.DocxToPdfConvertByte(docx);
if (pdf != null)
{
// Save the PDF document to a file for a viewing purpose.
File.WriteAllBytes(pdfPath, pdf);
System.Diagnostics.Process.Start(pdfPath);
}
else
{
System.Console.WriteLine("Conversion failed!");
Console.ReadLine();
}

I have recently used SautinSoft 'Document .Net' library to convert docx to pdf in my React(frontend), .NET core(micro services- backend) application. It only take 15 seconds to generate a pdf having 23 pages. This 15 seconds includes getting data from database, then merging data with docx template and then converting it to pdf. The code has deployed to azure Linux box and works fine.
https://sautinsoft.com/products/document/
Sample code
public string GeneratePDF(PDFDocumentModel document)
{
byte[] output = null;
using (var outputStream = new MemoryStream())
{
// Create single pdf.
DocumentCore singlePDF = new DocumentCore();
var documentCores = new List<DocumentCore>();
foreach (var section in document.Sections)
{
documentCores.Add(GenerateDocument(section));
}
foreach (var dc in documentCores)
{
// Create import session.
ImportSession session = new ImportSession(dc, singlePDF, StyleImportingMode.KeepSourceFormatting);
// Loop through all sections in the source document.
foreach (Section sourceSection in dc.Sections)
{
// Because we are copying a section from one document to another,
// it is required to import the Section into the destination document.
// This adjusts any document-specific references to styles, bookmarks, etc.
// Importing a element creates a copy of the original element, but the copy
// is ready to be inserted into the destination document.
Section importedSection = singlePDF.Import<Section>(sourceSection, true, session);
// First section start from new page.
if (dc.Sections.IndexOf(sourceSection) == 0)
importedSection.PageSetup.SectionStart = SectionStart.NewPage;
// Now the new section can be appended to the destination document.
singlePDF.Sections.Add(importedSection);
//Paging
HeaderFooter footer = new HeaderFooter(singlePDF, HeaderFooterType.FooterDefault);
// Create a new paragraph to insert a page numbering.
// So that, our page numbering looks as: Page N of M.
Paragraph par = new Paragraph(singlePDF);
par.ParagraphFormat.Alignment = HorizontalAlignment.Center;
CharacterFormat cf = new CharacterFormat() { FontName = "Consolas", Size = 11.0 };
par.Content.Start.Insert("Page ", cf.Clone());
// Page numbering is a Field.
Field fPage = new Field(singlePDF, FieldType.Page);
fPage.CharacterFormat = cf.Clone();
par.Content.End.Insert(fPage.Content);
par.Content.End.Insert(" of ", cf.Clone());
Field fPages = new Field(singlePDF, FieldType.NumPages);
fPages.CharacterFormat = cf.Clone();
par.Content.End.Insert(fPages.Content);
footer.Blocks.Add(par);
importedSection.HeadersFooters.Add(footer);
}
}
var pdfOptions = new PdfSaveOptions();
pdfOptions.Compression = false;
pdfOptions.EmbedAllFonts = false;
pdfOptions.EmbeddedImagesFormat = PdfSaveOptions.EmbImagesFormat.Png;
pdfOptions.EmbeddedJpegQuality = 100;
//dont allow editing after population, also ensures content can be printed.
pdfOptions.PreserveFormFields = false;
pdfOptions.PreserveContentControls = false;
if (!string.IsNullOrEmpty(document.PdfProperties.Title))
{
singlePDF.Document.Properties.BuiltIn[BuiltInDocumentProperty.Title] = document.PdfProperties.Title;
}
if (!string.IsNullOrEmpty(document.PdfProperties.Author))
{
singlePDF.Document.Properties.BuiltIn[BuiltInDocumentProperty.Author] = document.PdfProperties.Author;
}
if (!string.IsNullOrEmpty(document.PdfProperties.Subject))
{
singlePDF.Document.Properties.BuiltIn[BuiltInDocumentProperty.Subject] = document.PdfProperties.Subject;
}
singlePDF.Save(outputStream, pdfOptions);
output = outputStream.ToArray();
}
return Convert.ToBase64String(output);
}

Using OpenXML SDK to replace text on a docx file with a line break (newline)

I am trying to use C# to replace a specific string of text on an entire DOCX file with a line break (newline).
The string of text that I am searching for could be in a paragraph or in a table in the file.
I am currently using the code below to replace text.
using (WordprocessingDocument doc = WordprocessingDocument.Open("yourdoc.docx", true))
{
var body = doc.MainDocumentPart.Document.Body;
foreach (var text in body.Descendants<Text>())
{
if (text.Text.Contains("##Text1##"))
{
text.Text = text.Text.Replace("##Text1##", Environment.NewLine);
}
}
}
ISSUE: When I run this code, the output DOCX file has the text replaced with a space (i.e. " ") instead of a line break.
How can I change this code to make this work?

Try with a break. Check the example on this link. You just have to append a Break
Paragraphs, smart tags, hyperlinks are all inside Run. So maybe you could try this approach.
To change the text inside a table, you will have to use this approach. Again the text is always inside a Run.
If you are saying that the replace is only replacing for an empty string, i would try this:
using (WordprocessingDocument doc =
WordprocessingDocument.Open(#"yourpath\testdocument.docx", true))
{
var body = doc.MainDocumentPart.Document.Body;
var paras = body.Elements<Paragraph>();
foreach (var para in paras)
{
foreach (var run in para.Elements<Run>())
{
foreach (var text in run.Elements<Text>())
{
if (text.Text.Contains("text-to-replace"))
{
text.Text = text.Text.Replace("text-to-replace", "");
run.AppendChild(new Break());
}
}
}
}
}

Instead of adding a line break, try making two paragraphs, one before the "text to be replaced" and one after. Doing that will automatically add a line break between two paragraphs.

text.Parent.Append(new DocumentFormat.OpenXml.Wordprocessing.Break());

// Open a WordprocessingDocument for editing using the filepath.
using (WordprocessingDocument wordprocessingDocument = WordprocessingDocument.Open(filepath, true))
{
// Assign a reference to the existing document body.
Body body = wordprocessingDocument.MainDocumentPart.Document.Body;
//itenerate throught text
foreach (var text in body.Descendants<Text>())
{
text.Parent.Append(new Text("Text:"));
text.Parent.Append(new Break());
text.Parent.Append(new Text("Text"));
}
}
This will return:
Text:Text:

Hiding Word Content Control With Custom XML Parts

I have a word document with a bunch of content controls on it. These are mapped to a custom XML part. To build the document on the fly, I simply overwrite the custom XML part.
The problem I'm having, is that if I don't define a particular item, it's space is still visible in the document, pushing down the stuff below it, and looking inconsistent with the rest of the document.
Here's a basic example of my code:
var path = HttpContext.Current.Server.MapPath("~/Classes/Word/LawyerBio.docx");
using (WordprocessingDocument myDoc = WordprocessingDocument.Open(path, true))
{
//create new XML string
//these values will populate the template word doc
string newXML = "<root>";
if (!String.IsNullOrEmpty(_lawyer["Recognition"]))
{
newXML += "<recognition>";
newXML += _text.Field("Recognition Title");
newXML += "</recognition>";
}
if (!String.IsNullOrEmpty(_lawyer["Board Memberships"]))
{
newXML += "<boards>";
newXML += _text.Field("Board Memberships Title");
newXML += "</boards>";
}
newXML += "</root>";
MainDocumentPart mainPart = myDoc.MainDocumentPart;
//delete old xml part
mainPart.DeleteParts<CustomXmlPart>(mainPart.CustomXmlParts);
//add new xml part
CustomXmlPart customXml = mainPart.AddCustomXmlPart(CustomXmlPartType.CustomXml);
using(StreamWriter ts = new StreamWriter(customXml.GetStream()))
{
ts.Write(newXML);
}
myDoc.Close();
}
Is there any way to make these content controls actually collapse/hide?

I think you will have to do either some preprocessing before the docx is opened in Word, or some postprocessing (eg via a macro).
As an example of the preprocessing approach, OpenDoPE defines a "condition" which you could use to exclude the undefined stuff.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Processing word document using OpenXML and C# - c#

Related

Merge PDF files with TOC element

How to insert text into a content control with the Open XML SDK

Convert a Word (DOCX) file to a PDF in C# on cloud environment

Using OpenXML SDK to replace text on a docx file with a line break (newline)

Hiding Word Content Control With Custom XML Parts

Categories

Resources