OpenXML : replacing text in multiples runs - c#

I have a word document with fields that I need to change (see below) but for a reason that I don't understand, my modification is not saved during the process.
I'm using the OpenXML .NET SDK in C#.
Code :
using (WordprocessingDocument myDoc = WordprocessingDocument.Open(destinationFile, true))
{
var body = myDoc.MainDocumentPart.Document.Body;
foreach (var headerParts in myDoc.MainDocumentPart.HeaderParts)
{
foreach (var Para in headerParts.Header.Descendants<DocumentFormat.OpenXml.Wordprocessing.Paragraph>())
{
foreach (var run in Para.Descendants<DocumentFormat.OpenXml.Wordprocessing.Run>())
{
foreach (var text in run.Descendants<DocumentFormat.OpenXml.Wordprocessing.Text>())
{
text.Text = text.Text.Replace("Nom", cv.firstName);
text.Text = text.Text.Replace("Prenom", cv.secondName);
text.Text = text.Text.Replace("NbAnnee", cv.nbAnneeExp.ToString());
text.Text = text.Text.Replace("Objet", cv.objet);
}
}
}
}
myDoc.MainDocumentPart.Document.Save();
}
I don't know where I'm wrong, I followed a lot of templates that were present on SO.
Does anyone have an idea ?

So I found myself the answer : In fact, in my document, I add content controller and because there are not editable, the changement could not be saved during the process.
Morality : this method works... only if you don't use content controller :D

Related

No Errors yet nothing in Console?

Just after some help with some code i've written to extract data using HttpClient.
I am new to writing code so can't find my problem. Could someone pls help me troubleshoot this.
I expect to write the data of the table i'm scraping to the console line.
Any help appreciated
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net.Http;
using HtmlAgilityPack;
namespace weatherCheck
{
class Program
{
private static void Main(string[] args)
{
GetHtmlAsync();
Console.ReadLine();
}
protected static async void GetHtmlAsync()
{
var url = "https://www.weatherzone.com.au/vic/melbourne/melbourne";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
//grab the rain chance, rain in mm and date
var MyTable = Enumerable.FirstOrDefault(htmlDocument.DocumentNode.Descendants("table")
.Where(table => table.Attributes.Contains("id"))
, table => table.Attributes["id"].Value == "forecast-table");
List<HtmlNode> rows = htmlDocument.DocumentNode.SelectNodes("//tr").ToList();
foreach (var row in rows)
{
try
{
if (MyTable != null)
{
Console.WriteLine(MyTable.GetAttributeValue("forecast-table", " "));
}
}
catch (Exception)
{
}
}
}
}
}
I used your code to look up the values but it didnt produce anything for me either. When i look at the htmlDocument.DocumentNode.OuterHtml to view the entire Html it is scraping, I dont see anything in the document that reflects an attribute forecast-table.
Also, you are validating MyTable each time you loop through rows. You should validate row != null along with printing attribute from row.
var MyTable = Enumerable.FirstOrDefault(htmlDocument.DocumentNode.Descendants("table")
.Where(table => table.Attributes.Contains("id")), table => table.Attributes["id"].Value == "forecast-table");
List<HtmlNode> rows = htmlDocument.DocumentNode.SelectNodes("//tr").ToList();
foreach (var row in rows)
{
try
{
if (row != null) // Here, it should be row, not My Table along with MyTable in line below.
Console.WriteLine(row.GetAttributeValue("forecast-table", " "));
}
catch (Exception)
{
}
}
Problem is
You also should know that Html you view by using Dev Tools on chrome is not the same as the one you see in HtmlAgilityPack. Chrome renders the page after executing the scripts where HtmlAgilityPack simply provides you with default HTML of the page. This is the reason why you are not able to get the value of forecast-table.
From Doc, For GetAttributeValue(name,def) it'll return def if the attribute not found.
So, it'll print ""(empty string if the attribute not found in your case)
remove async and await as you already calling httpClient.GetStringAsync(url);
var html =httpClient.GetStringAsync(url).Result;
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
And print,
Console.WriteLine(MyTable.GetAttributeValue("forecast-table","SOME_TEXT_HERE").ToString());

OpenXML Find Replace Text

Environment
Visual Studio 2017 C# (Word .docx file)
Problem
The find/replace only replaces "{Today}" - it fails to replace the "{ConsultantName}" field. I've checked the document and tried using different approaches (see commented-out code) but no joy.
The Word document has just a few paragraphs of text - there are no tables or text boxes in the document. What am I doing wrong?
Update
When I inspect doc_text string, I can see "{Today}" but "{ConsultantName}" is split into multiple runs. The opening and closing braces are not together with the word - there are XML tags between them:
{</w:t></w:r><w:proofErr w:type="spellStart"/><w:r w:rsidR="00544806"><w:t>ConsultantName</w:t></w:r><w:proofErr w:type="spellEnd"/><w:r w:rsidR="00544806"><w:t>}
Code
string doc_text = string.Empty;
List<string> s_find = new List<string>();
List<string> s_replace = new List<string>();
// Regex regexText = null;
s_find.Add("{Today}");
s_replace.Add("24 Sep 2018");
s_find.Add("{ConsultantName}");
s_replace.Add("John Doe");
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(filePath, true))
{
// read document
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
doc_text = sr.ReadToEnd();
}
// find replace
for (byte b = 0; b < s_find.Count; b++)
{
doc_text = new Regex(s_find[b], RegexOptions.IgnoreCase).Replace(doc_text, s_replace[b]);
// regexText = new Regex(s_find[b]);
// doc_text = doc_text.Replace(s_find[b], s_replace[b]);
// doc_text = regexText.Replace(doc_text, s_replace[b]);
}
// update document
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(doc_text);
}
}
Note: I want to avoid using Word Interop. I don't want to create an instance of Word and use Word's object model to do the Find/Replace.
There is no way to avoid Word splitting text into multiple runs. It happens even if you type text directly into the document, make no changes and apply no formatting.
However, I found a way around the problem by adding custom fields to the document as follows:
Open Word document. Go to File->Info
Click the Properties heading and select Advanced Properties.
Select the Custom tab.
Add the field names you want to use and Save.
In the document click Insert on the main menu.
Click Explore Quick Parts icon and select Field...
Drop-down Categories and select Document Information.
Under Field names: select DocProperty.
Select your custom field name in the "Property" list and click ok.
This inserts the field into your document and even if you apply formatting, the field name will be whole and not be broken into multiple runs.
Update
To save users the laborious task of manually adding a lot of custom properties to a document, I wrote a method to do this using OpenXML.
Add the following usings:
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.CustomProperties;
using DocumentFormat.OpenXml.VariantTypes;
Code to add custom (text) properties to the document:
static public bool RunWordDocumentAddProperties(string filePath, List<string> strName, List<string> strVal)
{
bool is_ok = true;
try
{
if (File.Exists(filePath) == false)
return false;
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(filePath, true))
{
var customProps = wordDoc.CustomFilePropertiesPart;
if (customProps == null)
{
// no custom properties? Add the part, and the collection of properties
customProps = wordDoc.AddCustomFilePropertiesPart();
customProps.Properties = new DocumentFormat.OpenXml.CustomProperties.Properties();
}
for (byte b = 0; b < strName.Count; b++)
{
var props = customProps.Properties;
if (props != null)
{
var newProp = new CustomDocumentProperty();
newProp.VTLPWSTR = new VTLPWSTR(strVal[b].ToString());
newProp.FormatId = "{D5CDD505-2E9C-101B-9397-08002B2CF9AE}";
newProp.Name = strName[b];
// append the new property, and fix up all the property ID values
// property ID values must start at 2
props.AppendChild(newProp);
int pid = 2;
foreach (CustomDocumentProperty item in props)
{
item.PropertyId = pid++;
}
props.Save();
}
}
}
}
catch (Exception ex)
{
is_ok = false;
ProcessError(ex);
}
return is_ok;
}
You only need to do this:
*.csproj
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>netcoreapp3.1</TargetFramework>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="DocumentFormat.OpenXml" Version="2.12.3" />
</ItemGroup>
</Project>
add these packages:
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
and put this code in your system
using (WordprocessingDocument wordprocessingDocument =
WordprocessingDocument.Open(filepath, true))
{
var body = wordprocessingDocument.MainDocumentPart.Document.Body;
var paras = body.Elements<Paragraph>();
foreach (var para in paras)
{
foreach (var run in para.Elements<Run>())
{
foreach (var text in run.Elements<Text>())
{
if (text.Text.Contains("#_KEY_1_#"))
{
text.Text = text.Text.Replace("#_KEY_1_#", "replaced-text");
}
}
}
}
}
done

Aspose PDF - get text from page that has a matching string

I'm working with an existing library - the goal of the library is to pull text out of PDFs to verify against expected values to quality check recorded data vs data in pdf.
I'm looking for a way to succinctly pull a specific page worth of text given a string that should only fall on that specific page.
var pdfDocument = new Document(file.PdfFilePath);
var textAbsorber = new TextAbsorber{
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
pdfDocument.Pages.Accept(textAbsorber);
foreach (var page in pdfDocument.Pages)
{
}
I'm stuck inside the foreach(var page in pdfDocument.Pages) portion... or is that the right area to be looking?
Answer: Text Absorber recreated each page - inside the foreach loop.
If the absorber isn't recreated, it keeps text from previous loops.
public List<string> ProcessPage(MyInfoClass file, string find)
{
var pdfDocument = new Document(file.PdfFilePath);
foreach (Page page in pdfDocument.Pages)
{
var textAbsorber = new TextAbsorber {
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
page.Accept(textAbsorber);
var ext = textAbsorber.Text;
var exts = ext.Replace("\n", "").Split('\r').ToList();
if (ext.Contains(find))
return exts;
}
return null;
}

How to add custom tags to powerpoint slides using OpenXml in c#

I'm trying to add custom tags to PowerPoint slides using OpenXml component in c#.
But after adding the tags dynamically, the presentation gets corrupted and is causing the powerpoint to stop working.
Following is the code I'm using for adding the tags:
public static void AddSlideTag(string presentationPath, string tagValue)
{
using (PresentationDocument presentation = PresentationDocument.Open(presentationPath, true))
{
var presPart = presentation.PresentationPart;
// Copy each slide in the source presentation, in order, to
// the destination presentation.
var slideIndex = 0;
foreach (var openXmlElement in presPart.Presentation.SlideIdList)
{
var slideId = (SlideId)openXmlElement;
// Create a unique relationship id.
var sp = (SlidePart)presPart.GetPartById(slideId.RelationshipId);
//create userDefinedTag
Tag slideObjectTag = new Tag() { Name = "CustomTag", Val = tagValue};
UserDefinedTagsPart userDefinedTagsPart1 = sp.AddNewPart<UserDefinedTagsPart>();
if (userDefinedTagsPart1.TagList == null)
userDefinedTagsPart1.TagList = new TagList();
userDefinedTagsPart1.TagList.Append(slideObjectTag);
//add tag to CustomerDataList element
var id = sp.GetIdOfPart(userDefinedTagsPart1);
if (sp.Slide.CommonSlideData == null)
{
sp.Slide.CommonSlideData = new CommonSlideData();
}
if (sp.Slide.CommonSlideData.CustomerDataList == null)
sp.Slide.CommonSlideData.CustomerDataList = new CustomerDataList();
CustomerDataTags tags = new CustomerDataTags();
tags.Id = id;
sp.Slide.CommonSlideData.CustomerDataList.AppendChild(tags);
slideIndex++;
sp.Slide.Save();
}
presPart.Presentation.Save();
}
}
It seems like the xml of the slides is broken after executing this code, but I cannot figure out what is the problem.
Does anyone know what is wrong with the code above, or what is causing the powerpoint to gets corrupted after running this code? Am I missing something here?
I'm using DocumentFormat.OpenXml.2.5 package.

Replace MergeFields in a Word 2003 document and keep style

I've been trying to create a library to replace the MergeFields on a Word 2003 document, everything works fine, except that I lose the style applied to the field when I replace it, is there a way to keep it?
This is the code I'm using to replace the fields:
private void FillFields2003(string template, Dictionary<string, string> values)
{
object missing = Missing.Value;
var application = new ApplicationClass();
var document = new Microsoft.Office.Interop.Word.Document();
try
{
// Open the file
foreach (Field mergeField in document.Fields)
{
if (mergeField.Type == WdFieldType.wdFieldMergeField)
{
string fieldText = mergeField.Code.Text;
string fieldName = Extensions.GetFieldName(fieldText);
if (values.ContainsKey(fieldName))
{
mergeField.Select();
application.Selection.TypeText(values[fieldName]);
}
}
}
document.Save();
}
finally
{
// Release resources
}
}
I tried using the CopyFormat and PasteFormat methods in the selection, also using the get_style and set_style but to no exent.
Instead of using TypeText over the top of your selection use the the Result property of the Field:
if (values.ContainsKey(fieldName))
{
mergeField.Result = (values[fieldName]);
}
This will ensure any formatting in the field is retained.

Categories