How to "walk" through Word document making changes to the content?

How to "walk" through Word document making changes to the content? - c#

I want to replace each character in file with another one.
Now I'm implementing it by using Find.Execute() method, but in this case it spends time for searching and then replaces it, then search for another character from the beginning of file again, so if I want to replace all the alphabetic letters it will go through the whole document 26 x2 (lower case and upper case) =48 times, but I want it to replace by 1 lookup, so like: It get the first character it is "a" replace with " a' ", if the next char is "c" replace with "s" etc, make it by one look up, so it goes through the whole document only one time.
I know I can implement it just by writing my own code, but I'm wondering may be there is some built-in class that can ease my life :)

What about:
using Word = Microsoft.Office.Interop.Word;
//...
Word.Application app = new Word.Application();
Word.Document myDoc = app.Documents.Add(pathToMyDoc);
for(int n = 0; n < myDoc.Characters.Count; ++n)
{
myDoc.Characters[n].Text = LookupReplacement(myDoc.Characters[n].Text);
}
Completely untested but might help you. Link I looked at:
http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.documentclass(v=office.11).aspx

Try this for reference , Hope this helps:
http://weblogs.asp.net/guystarbuck/archive/2008/05/13/automated-search-and-replace-in-multiple-word-2007-documents-with-c.aspx

Related

c# Interop: How to restart Selection.Find to beginning of doc?

Simple question that I'm not finding an answer to. My code below is in a loop and finds the first text matching "{{foo}}" in a Word doc. I then want to reset the Find so that it begins its next search at the beginning of the doc again. Currently, it picks up where after the "foo".
Selection sel = application.Selection;
sel.Find.ClearFormatting();
sel.Find.MatchWildcards = true;
sel.Find.Text = #"\{\{?#\}\}";
sel.Find.Forward = true;
sel.Find.Execute();
How do I reset the starting location of Find?

It's always "better" to use Range rather than Selection in Word, whenever possible. You can have only one selection, but code can work with multiple ranges. In addition, the screen is quieter and execution tends to be faster. There are situations where Selection is necessary, but this is not one of them.
To get the Range of the entire document
Word.Range rngDoc = document.Content;
To "find" using the range:
rngDoc.Find.ClearFormatting();
rngDoc.Find.MatchWildcards = true;
rngDoc.Find.Text = #"\{\{?#\}\}";
rngDoc.Find.Forward = true;
rngDoc.Find.Wrap = Word.WdFindWrap.wdFindStop //ensure Word won't entire an infinite loop
rngDoc.Find.Execute();
When "find" is successful, the Range (or Selection) contains only what was found. To "reset" to start again from the beginning of the document (including the whole document):
rngDoc = document.Content;
And (what people ask more frequently) to continue searching from just beyond the "found" term to the end of the document:
object oCollapseEnd = Word.WdCollapseDirection.wdCollapseEnd;
rngDoc.Collapse(ref oCollapseEnd); //go just beyond what was found
rngDoc.End = document.Content.End;

In VBA we'd use:
Selection.HomeKey Word.WdUnits.wdStory
So, in your C# code would that convert to?
Sel.HomeKey(Word.wdUnits.wdStory);

.Find.Execute always resets the range to the found range. Accordingly, you need to either re-set it to the entire document (as per Olivier's comment) or use the C# equivalent of .Wrap = wdFindContinue (VBA) to instruct Word to continue searching from the top after getting to the end of the document.

Optimize string.Replace method

I have a list of 200+ words that are not allowed on a website. The string.Replace method below takes ~80ms. If I increase s < 1000 by a factor of 10.00 to s < 10,000 this delay goes to ~834ms, a 10.43 increase. I am woried about the scalability of this function, especially if the list increases in size. I was told strings are immutable and text.Replace() is creating 200 new strings in memory. Is there something similar to a Stringbuilder for this?
List<string> FilteredWords = new List<string>();
FilteredWords.Add("RED");
FilteredWords.Add("GREEN");
FilteredWords.Add("BLACK");
for (int i = 1; i < 200; i++)
{ FilteredWords.Add("STRING " + i.ToString()); }
string text = "";
//simulate a large dynamically generated html page
for (int s = 1; s < 1000; s++)
{ text += #"Lorem ipsum dolor sit amet, minim BLACK cetero cu nam.
No vix platonem sententiae, pro wisi congue graecis id, GREEN assum interesset in vix.
Eum tamquam RED pertinacia ex."; }
// This is the function I seek to optimize
foreach (string s in FilteredWords)
{ text = text.Replace(s, "[REMOVED]"); }

Use StringBuilder.Replace and try to do it as a batch operation. That is to say you should try to only create the StringBuilder once as it has some overhead. It won't necessarily be a lot faster but it will be much more memory efficient.
You should also probably only do this sanitation once instead of every time data is requested. If you're reading the data from the database you should consider sanitizing it once when the data is inserted into the database, so there is less work to do when reading and displaying it to the page.

If you expect most of the text to be relatively nice than scanning whole text first for matching words could be better approach. You can also normalize words text at the same time to catch some standard replacements.
I.e. scan string by matching individual words (i.e. Regular expression like "\w+"), than for each detected word lookup (potentially normalized value) in dictionary of words to replace.
You can either simply scan first to get list of "words to replace" and than just replace individual word later, or scan and build resulting string at the same time (using StringBuilder or StreamWriter, obviously not String.Concat / +).
Note: Unicode provides large number of good characters to use, so don't expect your effort to be very successful. I.e. try to find "cool" in following text: "you are сооl".
Sample code (relying on Regex.Replace for tokenization and building the string and HashSet for matches).
var toFind = FilteredWords.Aggregate(
new HashSet<string>(), (c, i) => { c.Add(i); return c;});
text = new Regex(#"\w+")
.Replace(text, m => toFind.Contains(m.Value) ? "[REMOVED]" : m.Value));

There may be a better way, but this is how I would go about solving the problem.
You will need to create a tree structure that contains your dictionary of words to be replaced. The class may be something like:
public class Node
{
public Dictionary<char, Node> Children;
public bool IsWord;
}
Using a dictionary for the Children may not be the best choice, but it provides the easiest example here. Also, you will need a constructor to initialize the Children field. The IsWord field is used to deal with the possibility that a redacted "word" may be the prefix of another redacted "word". For example, if you want to remove both "red" and "redress".
You will build the tree from each character in each of the replacement words. For example:
public void AddWord ( string word )
{
// NOTE: this assumes word is non-null and contains at least one character...
Node currentNode = Root;
for (int iIndex = 0; iIndex < word.Length; iIndex++)
{
if (currentNode.Children.ContainsKey(word[iIndex])))
{
currentNode = currentNode.Children[word[iIndex];
continue;
}
Node newNode = new Node();
currentNode.Children.Add(word[iIndex], newNode);
currentNode = newNode;
}
// finished, mark the last node as being a complete word..
currentNode.IsWord = true;
}
You'll need to deal with case sensitivity somewhere in there. Also, you only need to build the tree once, afterwards you can use it from any number of threads without worrying about locking because you will be only reading from it. (Basically, I'm saying: store it in a static somewhere.)
Now, when you are ready to remove words from your string you will need to do the following:
Create a StringBuilder instance to store the result
Parse through your source string, looking for the start and stop of a "word". How you define "word" will matter. For simplicity I would suggest starting with Char.IsWhitespace as defining word separators.
Once you have determined that a range of character is a "word", starting from the root of the tree, locate the child node associated with the first character in "word".
If you do not find a child node, the entire word is added to the StringBuilder
If you find a child node, you continue with the next character matching against Children of the current node, until you either run out of characters or out of nodes.
If you reach the end of the "word", check the last node's IsWord field. If true the word is excluded, do not add it to the StringBuilder. If IsWord is false, the word is not replaced and you add it to the StringBuilder
Repeat until you have exhausted the input string.
You will also need to add word separators to the StringBuilder, hopefully that will be obvious as you parse the input string. If you are careful to only use the start and stop indices within the input string, you should be able to parse the entire string without creating any garbage strings.
When all of this is done, use StringBuilder.ToString() to get your final result.
You may also need to consider Unicode surrogate codepoints, but you can probably get away without worrying about it.
Beware, I typed this code here directly, so syntax errors, typos and other accidental misdirections are probably included.

The real regular expression solution would be:
var filteredWord = new Regex(#"\b(?:" + string.Join("|", FilteredWords.Select(Regex.Escape)) + #")\b", RegexOptions.Compiled);
text = filteredWord.Replace(text, "[REMOVED]");
I don’t know whether this is faster (but note that it also only replaces whole words).

Parsing a CSV file

We have an integration with another system that relies on passing CSV files back and forth (really old school).
The structure is generally:
ID, Name, PhoneNumber, comments, fathersname
1, tom, 555-1234, just some random text, bill
2, jill smith, 555-4234, other random text, richard
Every so often we see this:
3, jacked up, 999-1231, here
be dragons
amongst us, ted
The primary problem I care about is detecting that a line breaker (\n) occurs in the middle of the record when that is the record terminator.
Is there anyway I can preprocess this to reliably fix it?
Note that we have zero control over what the other system emits.

So you should be able to do something more or less like this:
for (int i = 0; i < lines.Count; i++)
{
var fields = lines[i].Split(',').ToList();
while (fields.Count < numFields)//here be dragons amonst us
{
i++;//include next line in this line
//check to make sure we haven't run out of lines.
//combine end of previous field with start of the next one,
//and add the line break back in.
var innerFields = lines[i].Split(',');
fields[fields.Count - 1] += "\n" + innerFields[0];
fields.AddRange(innerFields.Skip(1));
}
//we now know we have a "real" full line
processFields(fields);
}
(For simplicity I assumed all lines were read in at the start; I assume you could alter it to lazily fetch each line easily enough.)

Let me start and say that the CSV file in your example is invalid. If a line break occurs inside a string, it should be wrapped with double quote characters.
Now for the answer - In order to parse this invalid csv format you must do several assumptions. In this case I made 2 assumptions: 1) The ID column must be numeric 2) The comment field can not contain digits.
Based on these assumptions you can check the first character after the line break character. If it is digit, you assume its a new record. If not you should treat it as a continue value of the comment field.
I don't know if the second assumption is valid, if not, you can enhance the logic so it will cover the business rules of the system.
Good Luck!

Firstly I would recommend using a tool to manage reading and writing your csv files, I use the FileHelpers library which is great.
You can essentially type your records and it will do all the validation and such for you. Worth the effort.
To your question perhaps you can do some preprocessing on the file and use Regex to replace any line breaks with a space?
I do something similar (not with files but) try
line.Replace(Environment.NewLine, " ");
With FileHelpers you could write a custom converter to do this during processing, or hook into the BeforeRead event.

Replacing part of text in richtextbox

I need to compare a value in a string to what user typed in a richtextbox.
For example: if a richtextbox holds string rtbText = "aaaka" and I compare this to another variable string comparable = "ka"(I want it to compare backwards). I want the last 2 letters from rtbText (comparable has only 2 letters) to be replaced with something that was predetermined(doesn't really matter what).
So rtbText should look like this:
rtbText = "aaa(something)"
This doesn't really have to be compared it can just count letters in comparable and based on that it can remove 2 letters from rtbText and replace them with something else.
UPDATE:
Here is what I have:
int coLen = comparable.Length;
comparable = null;
TextPointer caretBack = rtb.CaretPosition.GetPositionAtOffset(coLen, LogicalDirection.Backward);
TextRange rtbText = new TextRange(rtb.CaretPosition, caretBack);
string text = rtbText.Text;
rtbText returns an empty string or I get an error for everything longer than 3 characters. What am I doing wrong?
Let me elaborate it a little bit further. I have a listbox that holds replacements for values that user types in rtb. The values(replacements) are coming from there, meaning that I don't really need to go through the whole text to check values. I just need to check the values right before caret. I am comparing these values to what I have stored in another variable (comparable).
Please let me know if you don't understand something.
I did my best to explain what needs to be done.
Thank you

You could use Regex.Replace.
// this replaces all occurances of "ka" with "Replacement"
Regex replace = new Regex("ka");
string result = replace.Replace("aaaka","Replacemenet");

gumenimeda, I had similar problems few weeks ago. I found my self doing the following (I asume you will have more than one occurance in the RichTextBox that you will need to change), note that I did it for Windows Forms where I have access directly to the Rtf text of the control, not quite sure if it will work well in your scenario:
I find all the occurancies of the string (using IndexOf for example) and store them in a List for example.
Sort the list in descending order (max index goes first, the one before him second, etc)
Start replacing the occurancies directly in the RichTextBox, by removing the characters I don't need and appending the characters I need.
The sorting in step 2 is necessary as we always want to start from the last occurance going up to the first. Starting from the first occurance or any other and going down will have an unpleasant surprise - if the length of the chunk you want to remove and the length of the chunk you want to append are different in length, the string will be modified and all other occurancies will be invalid (for example if the second occurance was in at position 12 and your new string is 2 characters longer than the original, it will become 14th). This is not an issue if we go from the last to the first occurance as the change in string will not affect the next occurance in the list).
Ofcourse I can not be sure that this is the fastest way that can be used to achieve the desired result. It's just what I came up with and what worked for me.
Good luck!

parsing words in a continuous string

If a have a string with words and no spaces, how should I parse those words given that I have a dictionary/list that contains those words?
For example, if my string is "thisisastringwithwords" how could I use a dictionary to create an output "this is a string with words"?
I hear that using the data structure Tries could help but maybe if someone could help with the pseudo code? For example, I was thinking that maybe you could index the dictionary into a trie structure, then follow each char down the trie; problem is, I'm unfamiliar with how to do this in (pseudo)code.

I'm assuming that you want an efficient solution, not the obvious one where you repeatedly check if your text starts with a dictionary word.
If the dictionary is small enough, I think you could try and modify the standard KMP algorithm. Basically, build a finite-state machine on your dictionary which consumes the text character by character and yields the constructed words.
EDIT: It appeared that I was reinventing tries.

I already did something similar. You cannot use a simple dictionary. The result will be messy. It depends if you only have to do this once or as whole program.
My solution was to:
Connect to a database with working
words from a dictionary list (for
example online dictionary)
Filter long and short words in dictionary and check if you want to trim stuff (for example don't use words with only one character like 'I')
Start with short words and compare your bigString with the database dictionary.
Now you need to create a "table of possibility". Because a lot of words can fit into 100% but are wrong. As longer the word as more sure you are, that this word is the right one.
It is cpu intensive but it can work precise in the result.
So lets say, you are using a small dictionary of 10,000 words and 3,000 of them are with a length of 8 characters, you need to compare your bigString at start with all 3,000 words and only if result was found, it is allowed to proceed to the next word. If you have 200 characters in your bigString you need about (2000chars / 8 average chars) = 250 full loops minimum with comparation.
For me, I also did a small verification of misspelled words into the comparation.
example of procedure (don't copy paste)
Dim bigString As String = "helloworld.thisisastackoverflowtest!"
Dim dictionary As New List(Of String) 'contains the original words. lets make it case insentitive
dictionary.Add("Hello")
dictionary.Add("World")
dictionary.Add("this")
dictionary.Add("is")
dictionary.Add("a")
dictionary.Add("stack")
dictionary.Add("over")
dictionary.Add("flow")
dictionary.Add("stackoverflow")
dictionary.Add("test")
dictionary.Add("!")
For Each word As String In dictionary
If word.Length < 1 Then dictionary.Remove(word) 'remove short words (will not work with for each in real)
word = word.ToLower 'make it case insentitive
Next
Dim ResultComparer As New Dictionary(Of String, Double) 'String is the dictionary word. Double is a value as percent for a own function to weight result
Dim i As Integer = 0 'start at the beginning
Dim Found As Boolean = False
Do
For Each word In dictionary
If bigString.IndexOf(word, i) > 0 Then
ResultComparer.Add(word, MyWeightOfWord) 'add the word if found, long words are better and will increase the weight value
Found = True
End If
Next
If Found = True Then
i += ResultComparer(BestWordWithBestWeight).Length
Else
i += 1
End If
Loop

I told you that it seems like an impossible task. But you can have a look at this related SO question - it may help you.

If you are sure you have all the words of the phrase in the dictionary, you can use that algo:
String phrase = "thisisastringwithwords";
String fullPhrase = "";
Set<String> myDictionary;
do {
foreach(item in myDictionary){
if(phrase.startsWith(item){
fullPhrase += item + " ";
phrase.remove(item);
break;
}
}
} while(phrase.length != 0);
There are so many complications, like, some items starting equally, so the code will be changed to use some tree search, BST or so.

This is the exact problem one has when trying to programmatically parse languages like Chinese where there are no spaces between words. One method that works with those languages is to start by splitting text on punctuation. This gives you phrases. Next you iterate over the phrases and try to break them into words starting with the length of the longest word in your dictionary. Let's say that length is 13 characters. Take the first 13 characters from the phrase and see if it is in your dictionary. If so, take it as a correct word for now, move forward in the phrase and repeat. Otherwise, shorten your substring to 12 characters, then 11 characters, etc.
This works extremely well, but not perfectly because we've accidentally put in a bias towards words that come first. One way to remove this bias and double check your result is to repeat the process starting at the end of the phrase. If you get the same word breaks you can probably call it good. If not, you have an overlapping word segment. For example, when you parse your sample phrase starting at the end you might get (backwards for emphasis)
words with string a Isis th
At first, the word Isis (Egyptian Goddess) appears to be the correct word. When you find that "th" is not in your dictionary, however, you know there is a word segmentation problem nearby. Resolve this by going with the forward segmentation result "this is" for the non-aligned sequence "thisis" since both words are in the dictionary.
A less common variant of this problem is when adjacent words share a sequence which could go either way. If you had a sequence like "archand" (to make something up), should it be "arc hand" or "arch and"? The way to determine is to apply a grammar checker to the results. This should be done to the whole text anyway.

Ok, I will make a hand wavy attempt at this. The perfect(ish) data structure for your problem is (as you've said a trie) made up of the words in the dictionary. A trie is best visualised as a DFA, a nice state machine where you go from one state to the next on every new character. This is really easy to do in code, a Java(ish) style class for this would be :
Class State
{
String matchedWord;
Map<char,State> mapChildren;
}
From hereon, building the trie is easy. Its like having a rooted tree structure with each node having multiple children. Each child is visited on one character transition. The use of a HashMap kind of structure trims down time to look up character to next State mappings. Alternately if all you have are 26 characters for the alphabet, a fixed size array of 26 would do the trick as well.
Now, assuming all of that made sense, you have a trie, your problem still isn't fully solved. This is where you start doing things like regular expressions engines do, walk down the trie, keep track of states which match to a whole word in the dictionary (thats what I had the matchedWord for in the State structure), use some backtracking logic to jump to a previous match state if the current trail hits a dead end. I know its general but given the trie structure, the rest is fairly straightforward.

If you have dictionary of words and need a quick implmentation this can be solved efficiently with dynamic programming in O(n^2) time, assuming the dictionary lookups are O(1). Below is some C# code, the substring extraction could and dictionary lookup could be improved.
public static String[] StringToWords(String str, HashSet<string> words)
{
//Index of char - length of last valid word
int[] bps = new int[str.Length + 1];
for (int i = 0; i < bps.Length; i++)
bps[i] = -1;
for (int i = 0; i < str.Length; i++)
{
for (int j = i + 1; j <= str.Length ; j++)
{
if (bps[j] == -1)
{
//Destination cell doesn't have valid backpointer yet
//Try with the current substring
String s = str.Substring(i, j - i);
if (words.Contains(s))
bps[j] = i;
}
}
}
//Backtrack to recovery sequence and then reverse
List<String> seg = new List<string>();
for (int bp = str.Length; bps[bp] != -1 ;bp = bps[bp])
seg.Add(str.Substring(bps[bp], bp - bps[bp]));
seg.Reverse();
return seg.ToArray();
}
Building a hastset with the word list from /usr/share/dict/words and testing with
foreach (var s in StringSplitter.StringToWords("thisisastringwithwords", dict))
Console.WriteLine(s);
I get the output "t hi sis a string with words". Because as others have pointed out this algorithm will return a valid segmentation (if one exists), however this may not be the segmentation you expect. The presence of short words is reducing the segmentation quality, you might be able to add heuristic to favour longer words if two valid sub-segmentation enter an element.
There are more sophisticated methods that finite state machines and language models that can generate multiple segmentations and apply probabilistic ranking.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.