OpenXML Find Replace Text - c#

Environment
Visual Studio 2017 C# (Word .docx file)
Problem
The find/replace only replaces "{Today}" - it fails to replace the "{ConsultantName}" field. I've checked the document and tried using different approaches (see commented-out code) but no joy.
The Word document has just a few paragraphs of text - there are no tables or text boxes in the document. What am I doing wrong?
Update
When I inspect doc_text string, I can see "{Today}" but "{ConsultantName}" is split into multiple runs. The opening and closing braces are not together with the word - there are XML tags between them:
{</w:t></w:r><w:proofErr w:type="spellStart"/><w:r w:rsidR="00544806"><w:t>ConsultantName</w:t></w:r><w:proofErr w:type="spellEnd"/><w:r w:rsidR="00544806"><w:t>}
Code
string doc_text = string.Empty;
List<string> s_find = new List<string>();
List<string> s_replace = new List<string>();
// Regex regexText = null;
s_find.Add("{Today}");
s_replace.Add("24 Sep 2018");
s_find.Add("{ConsultantName}");
s_replace.Add("John Doe");
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(filePath, true))
{
// read document
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
doc_text = sr.ReadToEnd();
}
// find replace
for (byte b = 0; b < s_find.Count; b++)
{
doc_text = new Regex(s_find[b], RegexOptions.IgnoreCase).Replace(doc_text, s_replace[b]);
// regexText = new Regex(s_find[b]);
// doc_text = doc_text.Replace(s_find[b], s_replace[b]);
// doc_text = regexText.Replace(doc_text, s_replace[b]);
}
// update document
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(doc_text);
}
}

Note: I want to avoid using Word Interop. I don't want to create an instance of Word and use Word's object model to do the Find/Replace.
There is no way to avoid Word splitting text into multiple runs. It happens even if you type text directly into the document, make no changes and apply no formatting.
However, I found a way around the problem by adding custom fields to the document as follows:
Open Word document. Go to File->Info
Click the Properties heading and select Advanced Properties.
Select the Custom tab.
Add the field names you want to use and Save.
In the document click Insert on the main menu.
Click Explore Quick Parts icon and select Field...
Drop-down Categories and select Document Information.
Under Field names: select DocProperty.
Select your custom field name in the "Property" list and click ok.
This inserts the field into your document and even if you apply formatting, the field name will be whole and not be broken into multiple runs.
Update
To save users the laborious task of manually adding a lot of custom properties to a document, I wrote a method to do this using OpenXML.
Add the following usings:
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.CustomProperties;
using DocumentFormat.OpenXml.VariantTypes;
Code to add custom (text) properties to the document:
static public bool RunWordDocumentAddProperties(string filePath, List<string> strName, List<string> strVal)
{
bool is_ok = true;
try
{
if (File.Exists(filePath) == false)
return false;
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(filePath, true))
{
var customProps = wordDoc.CustomFilePropertiesPart;
if (customProps == null)
{
// no custom properties? Add the part, and the collection of properties
customProps = wordDoc.AddCustomFilePropertiesPart();
customProps.Properties = new DocumentFormat.OpenXml.CustomProperties.Properties();
}
for (byte b = 0; b < strName.Count; b++)
{
var props = customProps.Properties;
if (props != null)
{
var newProp = new CustomDocumentProperty();
newProp.VTLPWSTR = new VTLPWSTR(strVal[b].ToString());
newProp.FormatId = "{D5CDD505-2E9C-101B-9397-08002B2CF9AE}";
newProp.Name = strName[b];
// append the new property, and fix up all the property ID values
// property ID values must start at 2
props.AppendChild(newProp);
int pid = 2;
foreach (CustomDocumentProperty item in props)
{
item.PropertyId = pid++;
}
props.Save();
}
}
}
}
catch (Exception ex)
{
is_ok = false;
ProcessError(ex);
}
return is_ok;
}

You only need to do this:
*.csproj
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>netcoreapp3.1</TargetFramework>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="DocumentFormat.OpenXml" Version="2.12.3" />
</ItemGroup>
</Project>
add these packages:
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
and put this code in your system
using (WordprocessingDocument wordprocessingDocument =
WordprocessingDocument.Open(filepath, true))
{
var body = wordprocessingDocument.MainDocumentPart.Document.Body;
var paras = body.Elements<Paragraph>();
foreach (var para in paras)
{
foreach (var run in para.Elements<Run>())
{
foreach (var text in run.Elements<Text>())
{
if (text.Text.Contains("#_KEY_1_#"))
{
text.Text = text.Text.Replace("#_KEY_1_#", "replaced-text");
}
}
}
}
}
done

Related

How to delete paragraph and run with change tracking mode?

I have an existing document. I am able to open the document in change tracking mode using TrackRevisions. Now, how can I delete a few selected paragraphs and runs as delete? I want to save document in such a state that when a user open the Word document it will display the deleted content as strikethrough and if user accepts all changes, it will remove all the deleted content.
Is it feasible to do? Any sample code would be highly appreciated. Thank you in advance!
I tried following, it generate markup with w:del element as a child of paragraph. However I am expecting all the children of paragraph under w:del element. I tried adding run elements of paragraph to deletedParagraph (commented code), but it throws error "Non-composite elements do not have child elements.".
using (var document = WordprocessingDocument.Open(#"C:\Data\Test.docx", true))
{
// Change tracking code
DocumentSettingsPart documentSettingsPart = document.MainDocumentPart.DocumentSettingsPart ?? document.MainDocumentPart.AddNewPart<DocumentSettingsPart>();
Settings settings = documentSettingsPart.Settings ?? new Settings();
TrackRevisions trackRevisions = new TrackRevisions();
trackRevisions.Val = new DocumentFormat.OpenXml.OnOffValue(true);
settings.AppendChild(trackRevisions);
foreach(var paragraph in document.MainDocumentPart.Document.Body.Descendants<Paragraph>())
{
Deleted deletedParagraph = new Deleted();
deletedParagraph.Author = "Author Name";
deletedParagraph.Date = DateTime.Now;
paragraph.AppendChild(deletedParagraph);
foreach (var run in paragraph.Elements<Run>())
{
foreach(var text in run.Elements<Text>())
{
DeletedText deletedText = new DeletedText(text.InnerText);
run.ReplaceChild(deletedText, text);
// This throws error
//deletedParagraph.AppendChild(run.Clone() as Run);
//run.Remove();
}
}
}
document.Save();
}
The above code generates xml like this:
<w:body>
<w:p w:rsidRPr="0081286C" w:rsidR="003F5596" w:rsidP="0081286C" w:rsidRDefault="001B56FE">
<w:bookmarkStart w:name="_GoBack" w:id="0"/>
<w:bookmarkEnd w:id="0"/>
<w:r>
<w:delText>This is a sentence</w:delText>
</w:r>
<w:del w:author="Author Name" w:date="2022-07-26T07:38:26.7978264-04:00"/>
</w:p>
<w:sectPr w:rsidRPr="0081286C" w:rsidR="003F5596">
<w:pgSz w:w="12240" w:h="15840"/>
<w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/>
<w:cols w:space="708"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
To mark a paragraph or some text as deleted in MS Word track changes mode you need to generate the OOXml of document with following.
Replace the w:t/Text elements with w:delText/DeletedText elements.
foreach (var text in run.Descendants<Text>().ToList())
{
DeletedText deletedText = new DeletedText(text.InnerText)
{
Space = SpaceProcessingModeValues.Preserve
};
run.ReplaceChild(deletedText, text);
}
Surround w:r/run elements with w:del/Delete elements
var deleted = new Deleted
{
Author = 'RevisionAuthor',
Date = DateTime.Now
};
private static void ReplaceRunWithDelete(this Run run, Deleted deleted)
{
if(run.Descendants<FieldCode>().Any() && !run.Descendants<Deleted>().Any())
{
return;
}
var parent = run.Parent;
deleted.Id = Convert.ToString(++deletedCount);
XElement xDelete = XElement.Parse(deleted.OuterXml);
xDelete.Add(XElement.Parse(run.OuterXml));
parent.ReplaceChild(ToOpenXmlElement(xDelete), run);
}
public static OpenXmlElement ToOpenXmlElement(this XElement xElement)
{
OpenXmlElement openXmlElement = null;
using (StreamWriter sw = new StreamWriter(new MemoryStream()))
{
sw.Write(xElement.ToString());
sw.Flush();
sw.BaseStream.Seek(0, SeekOrigin.Begin);
OpenXmlReader re = OpenXmlReader.Create(sw.BaseStream);
re.Read();
openXmlElement = re.LoadCurrentElement();
re.Close();
}
return openXmlElement;
}
If you want to delete the whole paragraph (all the runs surrounded by deleted) than add w:del/Deleted element to the properties of the paragraph.
private static void AddDeleteToRunProps(this Paragraph paragraph, Deleted deleted)
{
// All the runs deleted so paragraph should be deleted
if (!paragraph.Descendants<Run>().Any(run => run.Descendants<Text>().Any()))
{
paragraph.ParagraphProperties ??= new ParagraphProperties();
var runProps = paragraph.ParagraphProperties.Elements<RunProperties>().Any() ? paragraph.ParagraphProperties.Elements<RunProperties>().FirstOrDefault()
: new RunProperties();
deleted.Id = Convert.ToString(++deletedCount);
runProps.AppendChild(deleted.Clone() as OpenXmlElement);
paragraph.ParagraphProperties.AppendChild(runProps);
}
}

Apose.Words ImportNode ignores font formatting when appendingchild

I am currently using Aspose.Words to open a document, pull content between a bookmark start and a bookmark end and then place that content into another document. The issue that I'm having is that when using the ImportNode method is imports onto my document but changes all of the fonts from Calibri to Times New Roman and changes the font size from whatever it was on the original document to 12pt.
The way I'm obtaining the content from the bookmark is by using the Aspose ExtractContent method.
Because I'm having the issue with the ImportNode stripping my font formatting I tried making some adjustments and saving each node to an HTML string using ToString(HtmlSaveOptions). This works mostly but the problem with this is it is stripping out my returns on the word document so none of my text has the appropriate spacing. My returns end up coming in as HTML in the following format
"<p style=\"margin-top:0pt; margin-bottom:8pt; line-height:108%; font-size:11pt\"><span style=\"font-family:Calibri; display:none; -aw-import:ignore\"> </span></p>"
When using
DocumentBuilder.InsertHtml("<p style=\"margin-top:0pt; margin-bottom:8pt; line-height:108%; font-size:11pt\"><span style=\"font-family:Calibri; display:none; -aw-import:ignore\"> </span></p>");
it does not correctly add the return on the word document.
Here is the code I'm using, please forgive the comments etc... this has been my attempts at correcting this.
public async Task<string> GenerateHtmlString(Document srcDoc, ArrayList nodes)
{
// Create a blank document.
Document dstDoc = new Document();
ELSLogHelper.InsertInfoLog(_callContext, ELSLogHelper.AsposeLogMessage("Open"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
// Remove the first paragraph from the empty document.
dstDoc.FirstSection.Body.RemoveAllChildren();
// Create a new Builder for the temporary document that gets generated with the header or footer data.
// This allows us to control font and styles separately from the main document being built.
var newBuilder = new DocumentBuilder(dstDoc);
Aspose.Words.Saving.HtmlSaveOptions htmlSaveOptions = new Aspose.Words.Saving.HtmlSaveOptions();
htmlSaveOptions.ExportImagesAsBase64 = true;
htmlSaveOptions.SaveFormat = SaveFormat.Html;
htmlSaveOptions.ExportFontsAsBase64 = true;
htmlSaveOptions.ExportFontResources = true;
htmlSaveOptions.ExportTextBoxAsSvg = true;
htmlSaveOptions.ExportRoundtripInformation = true;
htmlSaveOptions.Encoding = Encoding.UTF8;
// Obtain all the links from the source document
// This is used later to add hyperlinks to the html
// because by default extracting nodes using Aspose
// does not pull in the links in a usable way.
var srcDocLinks = srcDoc.Range.Fields.GroupBy(x => x.DisplayResult).Select(x => x.First()).Where(x => x.Type == Aspose.Words.Fields.FieldType.FieldHyperlink).Distinct().ToList();
var childNodes = nodes.Cast<Node>().Select(x => x).ToList();
var oldBuilder = new DocumentBuilder(srcDoc);
oldBuilder.MoveToBookmark("Header");
var allchildren = oldBuilder.CurrentParagraph.Runs;
var allChildNodes = childNodes[0].Document.GetChildNodes(NodeType.Any, true);
var headerText = allChildNodes[0].Range.Bookmarks["Header"].BookmarkStart.GetText();
foreach (Node node in nodes)
{
var html = node.ToString(htmlSaveOptions);
try
{
//   is used by aspose because it works in XML
// If we see this character and the text of the node is \r we need to insert a break
if (html.Contains(" ") && node.Range.Text == "\r")
{
newBuilder.InsertHtml(html, false);
// Change the node into an HTML string
/*var htmlString = node.ToString(SaveFormat.Html);
var tempHtmlLinkDoc = new HtmlDocument();
tempHtmlLinkDoc.LoadHtml(htmlString);
// Get all the child nodes of the html document
var allChildNodes = tempHtmlLinkDoc.DocumentNode.SelectNodes("//*");
// Loop over all child nodes so we can make sure we apply the correct font family and size to the break.
foreach (var childNode in allChildNodes)
{
// Get the style attribute from the child node
var childNodeStyles = childNode.GetAttributeValue("style", "").Split(';');
foreach (var childNodeStyle in childNodeStyles)
{
// Apply the font name and size to the new builder on the document.
if (childNodeStyle.ToLower().Contains("font-family"))
{
newBuilder.Font.Name = childNodeStyle.Split(':')[1].Trim();
}
if (childNodeStyle.ToLower().Contains("font-size"))
{
newBuilder.Font.Size = Convert.ToDouble(childNodeStyle.Split(':')[1]
.Replace("pt", "")
.Replace("px", "")
.Replace("em", "")
.Replace("rem", "")
.Replace("%", "")
.Trim());
}
}
}
// Insert the break with the corresponding font size and name.
newBuilder.InsertBreak(BreakType.ParagraphBreak);*/
}
else
{
// Loop through the source document links so the link can be applied to the HTML.
foreach (var srcDocLink in srcDocLinks)
{
if (html.Contains(srcDocLink.DisplayResult))
{
// Now that we know the html string has one of the links in it we need to get the address from the node.
var linkAddress = srcDocLink.Start.NextSibling.GetText().Replace(" HYPERLINK \"", "").Replace("\"", "");
//Convert the node into an HTML String so we can get the correct font color, name, size, and any text decoration.
var htmlString = srcDocLink.Start.NextSibling.ToString(SaveFormat.Html);
var tempHtmlLinkDoc = new HtmlDocument();
tempHtmlLinkDoc.LoadHtml(htmlString);
var linkStyles = tempHtmlLinkDoc.DocumentNode.ChildNodes[0].GetAttributeValue("style", "").Split(';');
var linkStyleHtml = "";
foreach (var linkStyle in linkStyles)
{
if (linkStyle.ToLower().Contains("color"))
{
linkStyleHtml += $"color:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("font-family"))
{
linkStyleHtml += $"font-family:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("font-size"))
{
linkStyleHtml += $"font-size:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("text-decoration"))
{
linkStyleHtml += $"text-decoration:{linkStyle.Split(':')[1].Trim()};";
}
}
if (linkAddress.ToLower().Contains("mailto:"))
{
// Since the link has mailto included don't add the target attribute to the link.
html = new Regex($#"\b{srcDocLink.DisplayResult}\b").Replace(html, $"{srcDocLink.DisplayResult}");
//html = html.Replace(srcDocLink.DisplayResult, $"{srcDocLink.DisplayResult}");
}
else
{
// Since the links is not an email include the target attribute.
html = new Regex($#"\b{srcDocLink.DisplayResult}\b").Replace(html, $"{srcDocLink.DisplayResult}");
//html = html.Replace(srcDocLink.DisplayResult, $"{srcDocLink.DisplayResult}");
}
}
}
// Inseret the HTML String into the temporary document.
newBuilder.InsertHtml(html, false);
}
}
catch (Exception ex)
{
throw;
}
}
// This is just for debugging/troubleshooting purposes and to make sure thigns look correct
string tempDocxPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "temp", "TemporaryCompiledDocument.docx");
dstDoc.Save(tempDocxPath);
// We generate this HTML file then load it back up and pass the DocumentNode.OuterHtml back to the requesting method.
ELSLogHelper.InsertInfoLog(_callContext, ELSLogHelper.AsposeLogMessage("Save"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
string tempHtmlPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "temp", "TemporaryCompiledDocument.html");
dstDoc.Save(tempHtmlPath, htmlSaveOptions);
var tempHtmlDoc = new HtmlDocument();
tempHtmlDoc.Load(tempHtmlPath);
var htmlText = tempHtmlDoc.DocumentNode.OuterHtml;
// Clean up our mess...
if (File.Exists(tempDocxPath))
{
File.Delete(tempDocxPath);
}
if (File.Exists(tempHtmlPath))
{
File.Delete(tempHtmlPath);
}
// Return the generated HTML string.
return htmlText;
}
Saving each node to HTML and then inserting them into the destination document is not a good idea. Because not all nodes can be properly saved to HTML and some formatting can be lost after Aspose.Words DOM -> HTML -> Aspose.Words DOM roundtrip.
Regarding the original issue, the problem might occur because you are using ImportFormatMode.UseDestinationStyles, in this case styles and default of the destination document are used and font might be changed. If you need to keep the source document formatting, you should use ImportFormatMode.KeepSourceFormatting.
If the problem occurs even with ImportFormatMode.KeepSourceFormatting this must be a bug and you should report this to Aspose.Words staff in the support forum.

Replace bookmarks content without removing the bookmark

I want to replace the text content of bookmarks without loosing the bookmark.
foreach(Bookmark b in document.Bookmarks)
{
b.Range.Text = "newtext"; // text is set in document but bookmark is gone
}
I tried to set the new Range of the bookmark before the Text setting but I still have the same problem.
I also tried to re-add the bookmark with document.Bookmarks.Add(name, range); but I can't create an instance of range.
I had to readd the bookmarks and save the range temporarily. I also had to add a list of processed items to evade an endless loop.
List<string> bookmarksProcessed = new List<string>();
foreach (Bookmark b in document.Bookmarks)
{
if (!bookmarksProcessed.Contains(b.Name))
{
string text = getTextFromBookmarkName(b.Name);
var newend = b.Range.Start + text.Length;
var name = b.Name;
Range rng = b.Range;
b.Range.Text = text;
rng.End = newend;
document.Bookmarks.Add(name, rng);
bookmarksProcessed.Add(name);
}
}
Looks like you solved your problem, but here is a cleaner way to do it:
using Office = Microsoft.Office.Core;
using Microsoft.Office.Tools.Word;
using System.Text.RegularExpressions;
using Word = Microsoft.Office.Interop.Word;
//declare and get the current document
Document extendedDocument = Globals.Factory.GetVstoObject(Globals.ThisAddIn.Application.ActiveDocument);
List<string> bookmarksProcessed = new List<string>();
foreach(Word.Bookmark oldBookmark in extendedDocument.Bookmarks)
{
if(bookmarksProcessed.Contains(oldBookmark.Name))
{
string newText = getTextFromBookmarkName(oldBookmark.Name)
Word.Range newRange = oldBookmark.Range;
newRange.End = newText.Length;
Word.Bookmark newBookmark = extendedDocument.Controls.AddBookmark(newRange, oldBookmark.Name);
newBookmark.Text = newText;
oldBookmark.Delete();
}
}
Code isn't tested but should work.
With the above approach, you still lose the bookmark before it is added back in. If you really need to preserve the bookmark, I find that you can create an inner bookmark that wraps around the text (a bookmark within bookmark). After having the inner bookmark, you simply need to do:
innerBookmark.Range.Text = newText;
After the text replacing, the inner bookmark is gone and the outer bookmark is preserved. No need to set range.End.
You can create the inner bookmark manually or programmatically depending on your situation.

How to add custom tags to powerpoint slides using OpenXml in c#

I'm trying to add custom tags to PowerPoint slides using OpenXml component in c#.
But after adding the tags dynamically, the presentation gets corrupted and is causing the powerpoint to stop working.
Following is the code I'm using for adding the tags:
public static void AddSlideTag(string presentationPath, string tagValue)
{
using (PresentationDocument presentation = PresentationDocument.Open(presentationPath, true))
{
var presPart = presentation.PresentationPart;
// Copy each slide in the source presentation, in order, to
// the destination presentation.
var slideIndex = 0;
foreach (var openXmlElement in presPart.Presentation.SlideIdList)
{
var slideId = (SlideId)openXmlElement;
// Create a unique relationship id.
var sp = (SlidePart)presPart.GetPartById(slideId.RelationshipId);
//create userDefinedTag
Tag slideObjectTag = new Tag() { Name = "CustomTag", Val = tagValue};
UserDefinedTagsPart userDefinedTagsPart1 = sp.AddNewPart<UserDefinedTagsPart>();
if (userDefinedTagsPart1.TagList == null)
userDefinedTagsPart1.TagList = new TagList();
userDefinedTagsPart1.TagList.Append(slideObjectTag);
//add tag to CustomerDataList element
var id = sp.GetIdOfPart(userDefinedTagsPart1);
if (sp.Slide.CommonSlideData == null)
{
sp.Slide.CommonSlideData = new CommonSlideData();
}
if (sp.Slide.CommonSlideData.CustomerDataList == null)
sp.Slide.CommonSlideData.CustomerDataList = new CustomerDataList();
CustomerDataTags tags = new CustomerDataTags();
tags.Id = id;
sp.Slide.CommonSlideData.CustomerDataList.AppendChild(tags);
slideIndex++;
sp.Slide.Save();
}
presPart.Presentation.Save();
}
}
It seems like the xml of the slides is broken after executing this code, but I cannot figure out what is the problem.
Does anyone know what is wrong with the code above, or what is causing the powerpoint to gets corrupted after running this code? Am I missing something here?
I'm using DocumentFormat.OpenXml.2.5 package.

Replace MergeFields in a Word 2003 document and keep style

I've been trying to create a library to replace the MergeFields on a Word 2003 document, everything works fine, except that I lose the style applied to the field when I replace it, is there a way to keep it?
This is the code I'm using to replace the fields:
private void FillFields2003(string template, Dictionary<string, string> values)
{
object missing = Missing.Value;
var application = new ApplicationClass();
var document = new Microsoft.Office.Interop.Word.Document();
try
{
// Open the file
foreach (Field mergeField in document.Fields)
{
if (mergeField.Type == WdFieldType.wdFieldMergeField)
{
string fieldText = mergeField.Code.Text;
string fieldName = Extensions.GetFieldName(fieldText);
if (values.ContainsKey(fieldName))
{
mergeField.Select();
application.Selection.TypeText(values[fieldName]);
}
}
}
document.Save();
}
finally
{
// Release resources
}
}
I tried using the CopyFormat and PasteFormat methods in the selection, also using the get_style and set_style but to no exent.
Instead of using TypeText over the top of your selection use the the Result property of the Field:
if (values.ContainsKey(fieldName))
{
mergeField.Result = (values[fieldName]);
}
This will ensure any formatting in the field is retained.

Categories