Replace bookmarks content without removing the bookmark - c#

I want to replace the text content of bookmarks without loosing the bookmark.
foreach(Bookmark b in document.Bookmarks)
{
b.Range.Text = "newtext"; // text is set in document but bookmark is gone
}
I tried to set the new Range of the bookmark before the Text setting but I still have the same problem.
I also tried to re-add the bookmark with document.Bookmarks.Add(name, range); but I can't create an instance of range.

I had to readd the bookmarks and save the range temporarily. I also had to add a list of processed items to evade an endless loop.
List<string> bookmarksProcessed = new List<string>();
foreach (Bookmark b in document.Bookmarks)
{
if (!bookmarksProcessed.Contains(b.Name))
{
string text = getTextFromBookmarkName(b.Name);
var newend = b.Range.Start + text.Length;
var name = b.Name;
Range rng = b.Range;
b.Range.Text = text;
rng.End = newend;
document.Bookmarks.Add(name, rng);
bookmarksProcessed.Add(name);
}
}

Looks like you solved your problem, but here is a cleaner way to do it:
using Office = Microsoft.Office.Core;
using Microsoft.Office.Tools.Word;
using System.Text.RegularExpressions;
using Word = Microsoft.Office.Interop.Word;
//declare and get the current document
Document extendedDocument = Globals.Factory.GetVstoObject(Globals.ThisAddIn.Application.ActiveDocument);
List<string> bookmarksProcessed = new List<string>();
foreach(Word.Bookmark oldBookmark in extendedDocument.Bookmarks)
{
if(bookmarksProcessed.Contains(oldBookmark.Name))
{
string newText = getTextFromBookmarkName(oldBookmark.Name)
Word.Range newRange = oldBookmark.Range;
newRange.End = newText.Length;
Word.Bookmark newBookmark = extendedDocument.Controls.AddBookmark(newRange, oldBookmark.Name);
newBookmark.Text = newText;
oldBookmark.Delete();
}
}
Code isn't tested but should work.

With the above approach, you still lose the bookmark before it is added back in. If you really need to preserve the bookmark, I find that you can create an inner bookmark that wraps around the text (a bookmark within bookmark). After having the inner bookmark, you simply need to do:
innerBookmark.Range.Text = newText;
After the text replacing, the inner bookmark is gone and the outer bookmark is preserved. No need to set range.End.
You can create the inner bookmark manually or programmatically depending on your situation.

Related

Apose.Words ImportNode ignores font formatting when appendingchild

I am currently using Aspose.Words to open a document, pull content between a bookmark start and a bookmark end and then place that content into another document. The issue that I'm having is that when using the ImportNode method is imports onto my document but changes all of the fonts from Calibri to Times New Roman and changes the font size from whatever it was on the original document to 12pt.
The way I'm obtaining the content from the bookmark is by using the Aspose ExtractContent method.
Because I'm having the issue with the ImportNode stripping my font formatting I tried making some adjustments and saving each node to an HTML string using ToString(HtmlSaveOptions). This works mostly but the problem with this is it is stripping out my returns on the word document so none of my text has the appropriate spacing. My returns end up coming in as HTML in the following format
"<p style=\"margin-top:0pt; margin-bottom:8pt; line-height:108%; font-size:11pt\"><span style=\"font-family:Calibri; display:none; -aw-import:ignore\"> </span></p>"
When using
DocumentBuilder.InsertHtml("<p style=\"margin-top:0pt; margin-bottom:8pt; line-height:108%; font-size:11pt\"><span style=\"font-family:Calibri; display:none; -aw-import:ignore\"> </span></p>");
it does not correctly add the return on the word document.
Here is the code I'm using, please forgive the comments etc... this has been my attempts at correcting this.
public async Task<string> GenerateHtmlString(Document srcDoc, ArrayList nodes)
{
// Create a blank document.
Document dstDoc = new Document();
ELSLogHelper.InsertInfoLog(_callContext, ELSLogHelper.AsposeLogMessage("Open"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
// Remove the first paragraph from the empty document.
dstDoc.FirstSection.Body.RemoveAllChildren();
// Create a new Builder for the temporary document that gets generated with the header or footer data.
// This allows us to control font and styles separately from the main document being built.
var newBuilder = new DocumentBuilder(dstDoc);
Aspose.Words.Saving.HtmlSaveOptions htmlSaveOptions = new Aspose.Words.Saving.HtmlSaveOptions();
htmlSaveOptions.ExportImagesAsBase64 = true;
htmlSaveOptions.SaveFormat = SaveFormat.Html;
htmlSaveOptions.ExportFontsAsBase64 = true;
htmlSaveOptions.ExportFontResources = true;
htmlSaveOptions.ExportTextBoxAsSvg = true;
htmlSaveOptions.ExportRoundtripInformation = true;
htmlSaveOptions.Encoding = Encoding.UTF8;
// Obtain all the links from the source document
// This is used later to add hyperlinks to the html
// because by default extracting nodes using Aspose
// does not pull in the links in a usable way.
var srcDocLinks = srcDoc.Range.Fields.GroupBy(x => x.DisplayResult).Select(x => x.First()).Where(x => x.Type == Aspose.Words.Fields.FieldType.FieldHyperlink).Distinct().ToList();
var childNodes = nodes.Cast<Node>().Select(x => x).ToList();
var oldBuilder = new DocumentBuilder(srcDoc);
oldBuilder.MoveToBookmark("Header");
var allchildren = oldBuilder.CurrentParagraph.Runs;
var allChildNodes = childNodes[0].Document.GetChildNodes(NodeType.Any, true);
var headerText = allChildNodes[0].Range.Bookmarks["Header"].BookmarkStart.GetText();
foreach (Node node in nodes)
{
var html = node.ToString(htmlSaveOptions);
try
{
//   is used by aspose because it works in XML
// If we see this character and the text of the node is \r we need to insert a break
if (html.Contains(" ") && node.Range.Text == "\r")
{
newBuilder.InsertHtml(html, false);
// Change the node into an HTML string
/*var htmlString = node.ToString(SaveFormat.Html);
var tempHtmlLinkDoc = new HtmlDocument();
tempHtmlLinkDoc.LoadHtml(htmlString);
// Get all the child nodes of the html document
var allChildNodes = tempHtmlLinkDoc.DocumentNode.SelectNodes("//*");
// Loop over all child nodes so we can make sure we apply the correct font family and size to the break.
foreach (var childNode in allChildNodes)
{
// Get the style attribute from the child node
var childNodeStyles = childNode.GetAttributeValue("style", "").Split(';');
foreach (var childNodeStyle in childNodeStyles)
{
// Apply the font name and size to the new builder on the document.
if (childNodeStyle.ToLower().Contains("font-family"))
{
newBuilder.Font.Name = childNodeStyle.Split(':')[1].Trim();
}
if (childNodeStyle.ToLower().Contains("font-size"))
{
newBuilder.Font.Size = Convert.ToDouble(childNodeStyle.Split(':')[1]
.Replace("pt", "")
.Replace("px", "")
.Replace("em", "")
.Replace("rem", "")
.Replace("%", "")
.Trim());
}
}
}
// Insert the break with the corresponding font size and name.
newBuilder.InsertBreak(BreakType.ParagraphBreak);*/
}
else
{
// Loop through the source document links so the link can be applied to the HTML.
foreach (var srcDocLink in srcDocLinks)
{
if (html.Contains(srcDocLink.DisplayResult))
{
// Now that we know the html string has one of the links in it we need to get the address from the node.
var linkAddress = srcDocLink.Start.NextSibling.GetText().Replace(" HYPERLINK \"", "").Replace("\"", "");
//Convert the node into an HTML String so we can get the correct font color, name, size, and any text decoration.
var htmlString = srcDocLink.Start.NextSibling.ToString(SaveFormat.Html);
var tempHtmlLinkDoc = new HtmlDocument();
tempHtmlLinkDoc.LoadHtml(htmlString);
var linkStyles = tempHtmlLinkDoc.DocumentNode.ChildNodes[0].GetAttributeValue("style", "").Split(';');
var linkStyleHtml = "";
foreach (var linkStyle in linkStyles)
{
if (linkStyle.ToLower().Contains("color"))
{
linkStyleHtml += $"color:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("font-family"))
{
linkStyleHtml += $"font-family:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("font-size"))
{
linkStyleHtml += $"font-size:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("text-decoration"))
{
linkStyleHtml += $"text-decoration:{linkStyle.Split(':')[1].Trim()};";
}
}
if (linkAddress.ToLower().Contains("mailto:"))
{
// Since the link has mailto included don't add the target attribute to the link.
html = new Regex($#"\b{srcDocLink.DisplayResult}\b").Replace(html, $"{srcDocLink.DisplayResult}");
//html = html.Replace(srcDocLink.DisplayResult, $"{srcDocLink.DisplayResult}");
}
else
{
// Since the links is not an email include the target attribute.
html = new Regex($#"\b{srcDocLink.DisplayResult}\b").Replace(html, $"{srcDocLink.DisplayResult}");
//html = html.Replace(srcDocLink.DisplayResult, $"{srcDocLink.DisplayResult}");
}
}
}
// Inseret the HTML String into the temporary document.
newBuilder.InsertHtml(html, false);
}
}
catch (Exception ex)
{
throw;
}
}
// This is just for debugging/troubleshooting purposes and to make sure thigns look correct
string tempDocxPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "temp", "TemporaryCompiledDocument.docx");
dstDoc.Save(tempDocxPath);
// We generate this HTML file then load it back up and pass the DocumentNode.OuterHtml back to the requesting method.
ELSLogHelper.InsertInfoLog(_callContext, ELSLogHelper.AsposeLogMessage("Save"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
string tempHtmlPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "temp", "TemporaryCompiledDocument.html");
dstDoc.Save(tempHtmlPath, htmlSaveOptions);
var tempHtmlDoc = new HtmlDocument();
tempHtmlDoc.Load(tempHtmlPath);
var htmlText = tempHtmlDoc.DocumentNode.OuterHtml;
// Clean up our mess...
if (File.Exists(tempDocxPath))
{
File.Delete(tempDocxPath);
}
if (File.Exists(tempHtmlPath))
{
File.Delete(tempHtmlPath);
}
// Return the generated HTML string.
return htmlText;
}
Saving each node to HTML and then inserting them into the destination document is not a good idea. Because not all nodes can be properly saved to HTML and some formatting can be lost after Aspose.Words DOM -> HTML -> Aspose.Words DOM roundtrip.
Regarding the original issue, the problem might occur because you are using ImportFormatMode.UseDestinationStyles, in this case styles and default of the destination document are used and font might be changed. If you need to keep the source document formatting, you should use ImportFormatMode.KeepSourceFormatting.
If the problem occurs even with ImportFormatMode.KeepSourceFormatting this must be a bug and you should report this to Aspose.Words staff in the support forum.

Itext7 PdfAction.CreateGoTo() Links not working in final document

I've parsed html into a PDF and created a table of contents from the Header tags. The bookmarks in the document work fine, but clicking on the line in the table of contents doesn't do anything. The cursor doesn't change icons like it does if I put a URL in the link.
I used Itext RUPS to inspect the final PDF and the named destinations are in the final file.
I tried hard coding a couple of the names in just to see what happens, but they also didn't work. Putting in .CreateURL and google.com works fine.
The one thing I'm doing that may or may not be an issue is I'm creating the body document, then creating the table of contents and merging the two documents.
Maybe Bruno can make a cameo on this one.
private static List ProcessOutlineChildren(PdfDocument pdfDocument, List tableOfContents, IEnumerable<PdfOutline> pdfOutlines, IDictionary<String, PdfObject> names = null)
{
List<TabStop> tabStops = new List<TabStop>();
tabStops.Add(new TabStop(580, TabAlignment.RIGHT));
foreach (var o in pdfOutlines)
{
ListItem currentOutlineItem = new ListItem();
Paragraph paragraph = new Paragraph();
paragraph.AddTabStops(tabStops);
paragraph.Add(o.GetTitle());
paragraph.Add(new Tab());
paragraph.Add((pdfDocument.GetPageNumber((PdfDictionary) o.GetDestination().GetDestinationPage(names))).ToString());
paragraph.SetAction(PdfAction.CreateGoTo(o.GetDestination()));
currentOutlineItem.Add(paragraph);
if (o.GetAllChildren().Any())
{
currentOutlineItem.Add(ProcessOutlineChildren(pdfDocument, new List(), o.GetAllChildren(), names));
}
tableOfContents.Add(currentOutlineItem);
}
return tableOfContents;
}
public class CustomOutlineHandler : OutlineHandler
{
//PDF's require a unique name for destinations, this is how the actions/bookmarks jump to a location.
protected override string GenerateUniqueDestinationName(IElementNode element)
{
string destinationName = base.GenerateUniqueDestinationName(element);
if ("p".Equals(element.Name()))
{
destinationName = destinationName.Replace(GetDestinationNamePrefix(), "paragraph-prefix-");
}
return destinationName;
}
}
//From my main method converting things into PDF.
OutlineHandler customOutlineHandler = new CustomOutlineHandler().PutAllTagPriorityMappings(priorityMappings);
customOutlineHandler.SetDestinationNamePrefix("destination-name-");
properties.SetOutlineHandler(customOutlineHandler);

OpenXML Find Replace Text

Environment
Visual Studio 2017 C# (Word .docx file)
Problem
The find/replace only replaces "{Today}" - it fails to replace the "{ConsultantName}" field. I've checked the document and tried using different approaches (see commented-out code) but no joy.
The Word document has just a few paragraphs of text - there are no tables or text boxes in the document. What am I doing wrong?
Update
When I inspect doc_text string, I can see "{Today}" but "{ConsultantName}" is split into multiple runs. The opening and closing braces are not together with the word - there are XML tags between them:
{</w:t></w:r><w:proofErr w:type="spellStart"/><w:r w:rsidR="00544806"><w:t>ConsultantName</w:t></w:r><w:proofErr w:type="spellEnd"/><w:r w:rsidR="00544806"><w:t>}
Code
string doc_text = string.Empty;
List<string> s_find = new List<string>();
List<string> s_replace = new List<string>();
// Regex regexText = null;
s_find.Add("{Today}");
s_replace.Add("24 Sep 2018");
s_find.Add("{ConsultantName}");
s_replace.Add("John Doe");
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(filePath, true))
{
// read document
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
doc_text = sr.ReadToEnd();
}
// find replace
for (byte b = 0; b < s_find.Count; b++)
{
doc_text = new Regex(s_find[b], RegexOptions.IgnoreCase).Replace(doc_text, s_replace[b]);
// regexText = new Regex(s_find[b]);
// doc_text = doc_text.Replace(s_find[b], s_replace[b]);
// doc_text = regexText.Replace(doc_text, s_replace[b]);
}
// update document
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(doc_text);
}
}
Note: I want to avoid using Word Interop. I don't want to create an instance of Word and use Word's object model to do the Find/Replace.
There is no way to avoid Word splitting text into multiple runs. It happens even if you type text directly into the document, make no changes and apply no formatting.
However, I found a way around the problem by adding custom fields to the document as follows:
Open Word document. Go to File->Info
Click the Properties heading and select Advanced Properties.
Select the Custom tab.
Add the field names you want to use and Save.
In the document click Insert on the main menu.
Click Explore Quick Parts icon and select Field...
Drop-down Categories and select Document Information.
Under Field names: select DocProperty.
Select your custom field name in the "Property" list and click ok.
This inserts the field into your document and even if you apply formatting, the field name will be whole and not be broken into multiple runs.
Update
To save users the laborious task of manually adding a lot of custom properties to a document, I wrote a method to do this using OpenXML.
Add the following usings:
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.CustomProperties;
using DocumentFormat.OpenXml.VariantTypes;
Code to add custom (text) properties to the document:
static public bool RunWordDocumentAddProperties(string filePath, List<string> strName, List<string> strVal)
{
bool is_ok = true;
try
{
if (File.Exists(filePath) == false)
return false;
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(filePath, true))
{
var customProps = wordDoc.CustomFilePropertiesPart;
if (customProps == null)
{
// no custom properties? Add the part, and the collection of properties
customProps = wordDoc.AddCustomFilePropertiesPart();
customProps.Properties = new DocumentFormat.OpenXml.CustomProperties.Properties();
}
for (byte b = 0; b < strName.Count; b++)
{
var props = customProps.Properties;
if (props != null)
{
var newProp = new CustomDocumentProperty();
newProp.VTLPWSTR = new VTLPWSTR(strVal[b].ToString());
newProp.FormatId = "{D5CDD505-2E9C-101B-9397-08002B2CF9AE}";
newProp.Name = strName[b];
// append the new property, and fix up all the property ID values
// property ID values must start at 2
props.AppendChild(newProp);
int pid = 2;
foreach (CustomDocumentProperty item in props)
{
item.PropertyId = pid++;
}
props.Save();
}
}
}
}
catch (Exception ex)
{
is_ok = false;
ProcessError(ex);
}
return is_ok;
}
You only need to do this:
*.csproj
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>netcoreapp3.1</TargetFramework>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="DocumentFormat.OpenXml" Version="2.12.3" />
</ItemGroup>
</Project>
add these packages:
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
and put this code in your system
using (WordprocessingDocument wordprocessingDocument =
WordprocessingDocument.Open(filepath, true))
{
var body = wordprocessingDocument.MainDocumentPart.Document.Body;
var paras = body.Elements<Paragraph>();
foreach (var para in paras)
{
foreach (var run in para.Elements<Run>())
{
foreach (var text in run.Elements<Text>())
{
if (text.Text.Contains("#_KEY_1_#"))
{
text.Text = text.Text.Replace("#_KEY_1_#", "replaced-text");
}
}
}
}
}
done

Aspose PDF - get text from page that has a matching string

I'm working with an existing library - the goal of the library is to pull text out of PDFs to verify against expected values to quality check recorded data vs data in pdf.
I'm looking for a way to succinctly pull a specific page worth of text given a string that should only fall on that specific page.
var pdfDocument = new Document(file.PdfFilePath);
var textAbsorber = new TextAbsorber{
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
pdfDocument.Pages.Accept(textAbsorber);
foreach (var page in pdfDocument.Pages)
{
}
I'm stuck inside the foreach(var page in pdfDocument.Pages) portion... or is that the right area to be looking?
Answer: Text Absorber recreated each page - inside the foreach loop.
If the absorber isn't recreated, it keeps text from previous loops.
public List<string> ProcessPage(MyInfoClass file, string find)
{
var pdfDocument = new Document(file.PdfFilePath);
foreach (Page page in pdfDocument.Pages)
{
var textAbsorber = new TextAbsorber {
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
page.Accept(textAbsorber);
var ext = textAbsorber.Text;
var exts = ext.Replace("\n", "").Split('\r').ToList();
if (ext.Contains(find))
return exts;
}
return null;
}

Replace MergeFields in a Word 2003 document and keep style

I've been trying to create a library to replace the MergeFields on a Word 2003 document, everything works fine, except that I lose the style applied to the field when I replace it, is there a way to keep it?
This is the code I'm using to replace the fields:
private void FillFields2003(string template, Dictionary<string, string> values)
{
object missing = Missing.Value;
var application = new ApplicationClass();
var document = new Microsoft.Office.Interop.Word.Document();
try
{
// Open the file
foreach (Field mergeField in document.Fields)
{
if (mergeField.Type == WdFieldType.wdFieldMergeField)
{
string fieldText = mergeField.Code.Text;
string fieldName = Extensions.GetFieldName(fieldText);
if (values.ContainsKey(fieldName))
{
mergeField.Select();
application.Selection.TypeText(values[fieldName]);
}
}
}
document.Save();
}
finally
{
// Release resources
}
}
I tried using the CopyFormat and PasteFormat methods in the selection, also using the get_style and set_style but to no exent.
Instead of using TypeText over the top of your selection use the the Result property of the Field:
if (values.ContainsKey(fieldName))
{
mergeField.Result = (values[fieldName]);
}
This will ensure any formatting in the field is retained.

Categories