YASR - Yet another search and replace question - c#

Environment: asp.net c# openxml
Ok, so I've been reading a ton of snippets and trying to recreate the wheel, but I'm hoping that somone can help me get to my desination faster. I have multiple documents that I need to merge together... check... I'm able to do that with openxml sdk. Birds are singing, sun is shining so far. Now that I have the document the way I want it, I need to search and replace text and/or content controls.
I've tried using my own text - {replace this} but when I look at the xml (rename docx to zip and view the file), the { is nowhere near the text. So I either need to know how to protect that within the doucment so they don't diverge or I need to find another way to search and replace.
I'm able to search/replace if it is an xml file, but then I'm back to not being able to combine the doucments easily.
Code below... and as I mentioned... document merge works fine... just need to replace stuff.
* Update * changed my replace call to go after the tag instead of regex. I have the right info now, but the .Replace call doesn't seem to want to work. Last four lines are for validation that I was seeing the right tag contents. I simply want to replace those contents now.
protected void exeProcessTheDoc(object sender, EventArgs e)
{
string doc1 = Server.MapPath("~/Templates/doc1.docx");
string doc2 = Server.MapPath("~/Templates/doc2.docx");
string final_doc = Server.MapPath("~/Templates/extFinal.docx");
File.Delete(final_doc);
File.Copy(doc1, final_doc);
using (WordprocessingDocument myDoc = WordprocessingDocument.Open(final_doc, true))
{
string altChunkId = "AltChunkId2";
MainDocumentPart mainPart = myDoc.MainDocumentPart;
AlternativeFormatImportPart chunk = mainPart.AddAlternativeFormatImportPart(
AlternativeFormatImportPartType.WordprocessingML, altChunkId);
using (FileStream fileStream = File.Open(doc2, FileMode.Open))
chunk.FeedData(fileStream);
AltChunk altChunk = new AltChunk();
altChunk.Id = altChunkId;
mainPart.Document.Body.InsertAfter(altChunk, mainPart.Document.Body.Elements<Paragraph>().Last());
mainPart.Document.Save();
}
exeSearchReplace(final_doc);
}
public static void GetPropertyFromDocument(string document, string outdoc)
{
XmlDocument xmlProperties = new XmlDocument();
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, false))
{
ExtendedFilePropertiesPart appPart = wordDoc.ExtendedFilePropertiesPart;
xmlProperties.Load(appPart.GetStream());
}
XmlNodeList chars = xmlProperties.GetElementsByTagName("Company");
chars.Item(0).InnerText.Replace("{ClientName}", "Penn Inc.");
StreamWriter sw;
sw = File.CreateText(outdoc);
sw.WriteLine(chars.Item(0).InnerText);
sw.Close();
}
}
}

If I'm reading this right, you have something like "{replace me}" in a .docx and then when you loop through the XML, you're finding things like <t>{replace</t><t> me</><t>}</t> or some such havoc. Now, with XML like that, it's impossible to create a routine that will replace "{replace me}".
If that's the case, then it's very, very likely related to the fact that it's considered a proofing error. i.e. it's misspelled as far as Word is concerned. The cause of it is that you've opened the document in Word and have proofing turned on. As such, the text is marked as "isDirty" and split up into different runs.
The two ways about fixing this are:
Client-side. In Word, just make sure all proofing errors are either corrected or ignored.
Format-side. Use the MarkupSimplifier tool that is part of Open XML Package Editor Power Tool for Visual Studio 2010 to fix this outside of the client. Eric White has a great (and timely for you - just a few days old) write up here on it: Getting Started with Open XML PowerTools Markup Simplifier

If you want to search and replace text in a WordprocessingML document, there is a fairly easy algorithm that you can use:
Break all runs into runs of a single character. This includes runs that have special characters such as a line break, carriage return, or hard tab.
It is then pretty easy to find a set of runs that match the characters in your search string.
Once you have identified a set of runs that match, then you can replace that set of runs with a newly created run (which has the run properties of the run containing the first character that matched the search string).
After replacing the single-character runs with a newly created run, you can then consolidate adjacent runs with identical formatting.
I've written a blog post and recorded a screen-cast that walks through this algorithm.
Blog post: http://openxmldeveloper.org/archive/2011/05/12/148357.aspx
Screen cast: http://www.youtube.com/watch?v=w128hJUu3GM
-Eric

Related

Descendants<T> gets zero elements in Word doc

I am having trouble updating a Hyperlink in a Word doc (Q How to update the body and a hyperlink in a Word doc ) and am zooming in on the Descendants<T>() call not working. Here is my code:
using DocumentFormat.OpenXml.Packaging; //from NuGet ClosedXML
using DocumentFormat.OpenXml.Wordprocessing; //from NuGet ClosedXML
WordprocessingDocument doc = WordprocessingDocument.Open(...filename..., true);
MainDocumentPart mainPart = doc.MainDocumentPart;
IEnumerable<Hyperlink> hLinks = mainPart.Document.Body.Descendants<Hyperlink>();
The doc is opened OK because mainPart gets a value. But hLinks has no elements. If I open the Word doc in Word, a hyperlink is present and working.
In the Immediate Window I see the following values:
mainPart.Document.Body
-->
{DocumentFormat.OpenXml.Wordprocessing.Body}
ChildElements: {DocumentFormat.OpenXml.OpenXmlChildElements}
ExtendedAttributes: {DocumentFormat.OpenXml.EmptyEnumerable<DocumentFormat.OpenXml.OpenXmlAttribute>}
FirstChild: {DocumentFormat.OpenXml.OpenXmlUnknownElement}
HasAttributes: false
HasChildren: true
InnerText: "
lots of data, e.g:
...<w:t>100</w:t>...
mainPart.Document.Body.Descendants<Text>().First()
-->
Exception: "Sequence contains no elements"
If I cannot even find the text parts, how should I ever find and replace the hyperlink?
If you are sure there are elements in your file that you are searching with linq, and nothing is returning or you are getting exceptions, that typically points to a namespace problem.
If you post your entire file, I can better help you, but check to see if you can alias your namespace like so:
using W = DocumentFormat.OpenXml.Wordprocessing;
and then in your Descendants call you do something like this:
var hLinks = mainPart.Document.Body.Descendants<W.Hyperlink>();
This answer demonstrates another namespace trick to try also.
Something seems to be wrong with my Word doc; it was generated with a tool. Testing with another Word doc, created with Word, gives better results. I am working on it ...
With a regular Word doc, looking at
doc.MainDocumentPart.Document.Body.InnerXml
the value starts with:
<w:p w:rsidR=\"00455325\" w:rsidRDefault=\"00341915\"
xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\">
<w:r>
<w:t>Hello World!
but with the word doc I am testing with, which comes from a tool I myself made:
<w:body xmlns:w=\"http://schemas.openxmlforma...
This explains a lot.
I will have to fix my tool :-)
Update:
The fix was that this did not give the correct part of data to insert in the Word Doc:
string strDocumentXml = newWordContent.DocumentElement.InnerXml;
but instead this is the correct data:
string strDocumentXml = newWordContent.DocumentElement.FirstChild.OuterXml;
Inspection with the debugger of:
doc.MainDocumentPart.Document.Body.InnerXml
as mentioned above, confirmed it. The Descendants call now returns the expected data, and updating the hyperlink works.
Side note:
I clearly fixed a bug in my app, but, apart from updating the hyperlink, the app worked perfectly OK before, with that bug :-)

How to update the body and a hyperlink in a Word doc [duplicate]

I am having trouble updating a Hyperlink in a Word doc (Q How to update the body and a hyperlink in a Word doc ) and am zooming in on the Descendants<T>() call not working. Here is my code:
using DocumentFormat.OpenXml.Packaging; //from NuGet ClosedXML
using DocumentFormat.OpenXml.Wordprocessing; //from NuGet ClosedXML
WordprocessingDocument doc = WordprocessingDocument.Open(...filename..., true);
MainDocumentPart mainPart = doc.MainDocumentPart;
IEnumerable<Hyperlink> hLinks = mainPart.Document.Body.Descendants<Hyperlink>();
The doc is opened OK because mainPart gets a value. But hLinks has no elements. If I open the Word doc in Word, a hyperlink is present and working.
In the Immediate Window I see the following values:
mainPart.Document.Body
-->
{DocumentFormat.OpenXml.Wordprocessing.Body}
ChildElements: {DocumentFormat.OpenXml.OpenXmlChildElements}
ExtendedAttributes: {DocumentFormat.OpenXml.EmptyEnumerable<DocumentFormat.OpenXml.OpenXmlAttribute>}
FirstChild: {DocumentFormat.OpenXml.OpenXmlUnknownElement}
HasAttributes: false
HasChildren: true
InnerText: "
lots of data, e.g:
...<w:t>100</w:t>...
mainPart.Document.Body.Descendants<Text>().First()
-->
Exception: "Sequence contains no elements"
If I cannot even find the text parts, how should I ever find and replace the hyperlink?
If you are sure there are elements in your file that you are searching with linq, and nothing is returning or you are getting exceptions, that typically points to a namespace problem.
If you post your entire file, I can better help you, but check to see if you can alias your namespace like so:
using W = DocumentFormat.OpenXml.Wordprocessing;
and then in your Descendants call you do something like this:
var hLinks = mainPart.Document.Body.Descendants<W.Hyperlink>();
This answer demonstrates another namespace trick to try also.
Something seems to be wrong with my Word doc; it was generated with a tool. Testing with another Word doc, created with Word, gives better results. I am working on it ...
With a regular Word doc, looking at
doc.MainDocumentPart.Document.Body.InnerXml
the value starts with:
<w:p w:rsidR=\"00455325\" w:rsidRDefault=\"00341915\"
xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\">
<w:r>
<w:t>Hello World!
but with the word doc I am testing with, which comes from a tool I myself made:
<w:body xmlns:w=\"http://schemas.openxmlforma...
This explains a lot.
I will have to fix my tool :-)
Update:
The fix was that this did not give the correct part of data to insert in the Word Doc:
string strDocumentXml = newWordContent.DocumentElement.InnerXml;
but instead this is the correct data:
string strDocumentXml = newWordContent.DocumentElement.FirstChild.OuterXml;
Inspection with the debugger of:
doc.MainDocumentPart.Document.Body.InnerXml
as mentioned above, confirmed it. The Descendants call now returns the expected data, and updating the hyperlink works.
Side note:
I clearly fixed a bug in my app, but, apart from updating the hyperlink, the app worked perfectly OK before, with that bug :-)

Set BaseUrl of an existing Pdf Document

We're having trouble setting a BaseUrl using iTextSharp. We have used Adobes Implementation for this in the past, but we got some severe performance issues. So we switched to iTextSharp, which is aprox 10 times faster.
Adobe enabled us to set a base url for each document. We really need this in order to deploy our documents on different servers. But we cant seem to find the right code to do this.
This code is what we used with Adobe:
public bool SetBaseUrl(object jso, string baseUrl)
{
try
{
object result = jso.GetType().InvokeMember("baseURL", BindingFlags.SetProperty, null, jso, new Object[] {baseUrl });
return result != null;
}
catch
{
return false;
}
}
A lot of solutions describe how you can insert links in new or empty documents. But our documents already exist and do contain more than just text. We want to overlay specific words with a link that leads to one or more other documents. Therefore, its really important to us that we can insert a link without accessing the text itself. Maybe lay a box ontop of these words and set its position (since we know where the words are located in the document)
We have tried different implementations, using the setAction method, but it doesnt seem to work properly. The result was in most cases, that we saw out box, but there was no link inside or associated with it. (the cursor didn't change and nothing happend, when i clicked inside the box)
Any help is appreciated.
I've made you a couple of examples.
First, let's take a look at BaseURL1. In your comment, you referred to JavaScript, so I created a document to which I added a snippet of document-level JavaScript:
writer.addJavaScript("this.baseURL = \"http://itextpdf.com/\";");
This works perfectly in Adobe Acrobat, but when you try this in Adobe Reader, you get the following error:
NotAllowedError: Security settings prevent access to this property or
method. Doc.baseURL:1:Document-Level:0000000000000000
This is consistent with the JavaScript reference for Acrobat where it is clearly indicated that special permissions are needed to change the base URL.
So instead of following your suggested path, I consulted ISO-32000-1 (which was what I asked you to do, but... I've beaten you in speed).
I discovered that you can add a URI dictionary to the catalog with a Base entry. So I wrote a second example, BaseURL2, where I add this dictionary to the root dictionary of the PDF:
PdfDictionary uri = new PdfDictionary(PdfName.URI);
uri.put(new PdfName("Base"), new PdfString("http://itextpdf.com/"));
writer.getExtraCatalog().put(PdfName.URI, uri);
Now the BaseURL works in both Acrobat and Reader.
Assuming that you want to add a BaseURL to existing documents, I wrote BaseURL3. In this example, we add the same dictionary to the root dictionary of an existing PDF:
PdfReader reader = new PdfReader(src);
PdfDictionary uri = new PdfDictionary(PdfName.URI);
uri.put(new PdfName("Base"), new PdfString("http://itextpdf.com/"));
reader.getCatalog().put(PdfName.URI, uri);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
stamper.close();
Using this code, you can change a link that points to "index.php" (base_url.pdf) into a link that points to "http://itextpdf.com/index.php" (base_url_3.pdf).
Now you can replace your Adobe license with a less expensive iTextSharp license ;-)

Inserting word content into a VSTO document level customization

I have a VSTO document level customization that performs specific functionality when opened from within our application. Basically, we open normal documents from inside of our application and I copy the content from the normal docx file into the VSTO document file which is stored inside of our database.
var app = new Microsoft.Office.Interop.Word.Application();
var docs = app.Documents;
var vstoDoc = docs.Open(vstoDocPath);
var doc = docs.Open(currentDocPath);
doc.Range().Copy();
vstoDoc.Range().PasteAndFormat(WdRecoveryType.wdFormatOriginalFormatting);
Everything works great, however using the above code leaves out certain formatting related to the document. The code below fixes these issues, but there will most likely be more issues that I come across, as I come across them I could address them one by one ...
for (int i = 0; i < doc.Sections.Count; i++)
{
var footerFont = doc.Sections[i + 1].Footers.GetEnumerator();
var headerFont = doc.Sections[i + 1].Headers.GetEnumerator();
var footNoteFont = doc.Footnotes.GetEnumerator();
foreach (HeaderFooter foot in vstoDoc.Sections[i + 1].Footers)
{
footerFont.MoveNext();
foot.Range.Font.Name = ((HeaderFooter)footerFont.Current).Range.Font.Name;
}
foreach (HeaderFooter head in vstoDoc.Sections[i + 1].Headers)
{
headerFont.MoveNext();
head.Range.Font.Name = ((HeaderFooter)headerFont.Current).Range.Font.Name;
}
foreach (Footnote footNote in vstoDoc.Footnotes)
{
footNoteFont.MoveNext();
footNote.Range.Font.Name = ((Footnote)footNoteFont.Current).Range.Font.Name;
}
}
I need a fool proof safe way of copying the content of one docx file to another docx file while preserving formatting and eliminating the risk of corrupting the document. I've tried to use reflection to set the properties of the two documents to one another, the code does start to look a bit ugly and I always worry that certain properties that I'm setting may have undesirable side effects. I've also tried zipping and unzipping the docx files, editing the xml manually and then rezipping afterwards, this hasn't worked too well, I've ended up corrupting a few of the documents during this process.
If anyone has dealt with a similar issue in the past, please could you point me in the right direction.
Thank you for your time
This code copies and keeps source formatting.
bookmark.Range.Copy();
Document newDocument = WordInstance.Documents.Add();
newDocument.Activate();
newDocument.Application.CommandBars.ExecuteMso("PasteSourceFormatting");
There is one more elegant way to manage it based upon
Globals.ThisAddIn.Application.ActiveDocument.Range().ImportFragment(filePath);
or you can do the following
Globals.ThisAddIn.Application.Selection.Range.ImportFragment(filePath);
in order to obtain current range where filePath is a path to the document you are copping from.

How to find exact word from word document using Open XML in C#?

I need to find exact word which I want to replace from word document using Open XML in C#.
the purpose of replacing the personal details of user with some special character so that its not visible to reader.
For an example, the user has address mentioned in his form, which is stored in database
he also has one word document uploaded, the word document also contain following type of string which matches his address. my purpose is to match the address with ###
sign so that other users cant see the address.
e.g.
"422, Plot no. 1000/A, The Moon Residency II, Shree Nagrik Co. Op. Society, Sardarnagar, Ahmedabad.
Looking for an opportunity that surpasses in making me a personality that influences the masses and that too effectively. Organizationally, I would strive to work at a single
place with no professional switches being made and would love to work in an environment that demands constant evolution with variable domains incorporated to deal
with."
I want to replace "Co", "Op" with "#" sign.
My output would be this:
"422, Plot no. 1000/A, The Moon Residency II, Shree Nagrik #. #. Society, Sardarnagar, Ahmedabad.
Looking for an opportunity that surpasses in making me a personality that influences the masses and that too effectively. Organizationally, I would strive to work at a single
place with no professional switches being made and would love to work in an environment that demands constant evolution with variable domains incorporated to deal
with. "
Now i have several questions
1. How can i search for whole word, right now my code replaces opportunity word with ##portunity since this word has Op. Same with Constant it replaces with ##nstant.
I need to replace if the whole word matches.
how can i match the whole line in the word or may be the whole address, the address should be replace as whole, if not possible, it should replace 70-80%.
Currently my code is as bellow to replace word into word file.
MemoryStream m = new System.IO.MemoryStream();
//strResumeName contain my word file url
m = objBlob.GetResumeFile(strResumeName);
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(m, true))
{
body = wordDoc.MainDocumentPart.Document.Body;
colT = body.Descendants<DocumentFormat.OpenXml.Wordprocessing.Text>();
foreach (DocumentFormat.OpenXml.Wordprocessing.Text c in colT)
{
if (c.InnerText.Trim() != String.Empty)
{
sb.Append(c.InnerText.Trim() + " ");
}
}
string[] strParts = sb.ToString().Split(' ');
HyperLinkList = HyperLinksList(wordDoc);
redactionTags = GetReductionstrings(strParts);
}
using (Novacode.DocX document = Novacode.DocX.Load(m))
{
//objCandidateLogin.Address contain my address
if (!String.IsNullOrEmpty(objCandidateLogin.Address))
{
string[] strParts = objCandidateLogin.Address.Replace(",", " ").Split(' ');
for (int I = 0; I <= strParts.Length - 1; I++)
{
if (strParts[I].Trim().Length > 1)
{
document.ReplaceText(strParts[I].Trim(), "#############", false, RegexOptions.IgnoreCase);
}
}
}
}
You can use the method TextReplacer in PowerTools for Open XML to accomplish what you want. Then you can do something like this:
using DocumentFormat.OpenXml.Packaging;
using OpenXmlPowerTools;
using System.IO;
namespace SearchAndReplace
{
internal class Program
{
private static void Main(string[] args)
{
using (WordprocessingDocument doc = WordprocessingDocument.Open("Test01.docx", true))
TextReplacer.SearchAndReplace(wordDoc:doc, search:"the", replace:"this", matchCase:false);
}
}
}
To install the Nuget package for OpenXml Power Tools, run the following command in the Package Manager Console
PM > Install-Package OpenXmlPowerTools
You're using OpenXML with Novacode, you should consider using just OpenXML.
About the replacing text with "#". You will have to iterate through all paragraphs in the word document and check the Text elements within them to see if the text you're looking for exists and if it exists you can replace the text.
Nothing else to it. Hope this helps.
IEnumerable<Paragraph> paragraphs = document.Body.Descendants<Paragraph>();
foreach(Paragraph para in paragraphs)
{
String text = para.Descendents<Text>().FirstOrDefault();
//Code to replace text with "#"
}
I've written this code out of memory, but if you proceed on these lines, you will find your solution.
There is an OpenXML Power Tools class for searc and replace text in OpenXML Document.
Get it from here.
http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2011/08/04/introducing-textreplacer-a-new-class-for-powertools-for-open-xml.aspx
Hope this helps.

Categories