read string in word docx with Microsoft.Office.Interop - c#

I find string in word docx, but I want to read next two string.
Exapmle:
[string id_string, 3, 1000]
I know [string id_string, ...] and I find this string, with this:
Microsoft.Office.Interop.Word.Range range = Document.Range();
(range.Find.Execute(FindText: "[string id_string, ")
How I can read next two string?
Thanks for your helps!

Regex to the Rescue
This seems like an opportunity for regular expression matching.
Add Imports System.Text.RegularExpressions at the start of your code to enable use of the Regex class.
Try adding the following code:
Dim docText = range.Text
Const regularExpression As String = "\[string id_string,\s[^\]]+\]"
Dim regex = New Regex(regularExpression)
Dim match = regex.Match(docText)
Dim foundString = match.Value
Assumptions
I'm assuming the following. If my assumptions are incorrect, the answer above may not be quite what you're looking for.
You are using Visual Basic.
If "[string id_string, " is encountered, this absolutely ensures you're at the string you want and there will be a closing bracket to complete the matched set of strings. (This helps keep the regular expression simple, but depending on the content of the text, it could return unexpected results.)
You want the matched [] brackets and all three strings. (This keeps the regular expression simpler than using look ahead/behind to ignore the brackets after pattern-matching.)
You just want the entire list of strings returned in one string, as opposed to the range, line number, position, etc. (You should be able to use the string.Split function to pull out the individual strings if needed.)
You just want the first match you come across. (You can use regex.Matches to get all matches if necessary.)
Reference
For a detailed presentation of using Regex in Word, see the following site from 2008:
http://www.codeproject.com/Articles/26922/Unleashing-the-Full-Power-of-Regular-Expressions-i

Related

Finding and replacing multiple matches in a single line with regex

I have a query regarding making multiple replacements within a string using a regular expression.
The platform is C#, so .NET's System.Text.RegularExpression implementation.
Let's say I have a string -- in this case, an XML fragment, but it could be any text at all, so no assumptions on the syntax:
<key val1="C:\SomeDir\SomePath\FOLDER1" val2="C:\SomeDir\SomePath\FOLDER2" />
I want to replace the last part of both of these paths -- let's say, change it to FOLDER3.
I currently have the expression (C:\\SomeDir\\SomePath)(\\\w*\\) which gives me two groups -- the first part of the path and the bit I want to replace.
I can use the replacement string ${1}\FOLDER3\ which properly replaces the part of the path I want to change.
However: this only works for the first match in the string. So, FOLDER1 will be replaced with FOLDER3 but FOLDER2 remains unchanged.
I thought I could apply the match/replace operation in a loop until the line no longer changed, but of course this doesn't work as the match regex always stops on the first match.
Any help greatly appreciated!
Use the replace method of the regex. The replace method does replace all matches:
string s = "<key val1=\"C:\\SomeDir\\SomePath\\FOLDER1\" val2=\"C:\\SomeDir\\SomePath\\FOLDER2\" />";
Regex regex = new Regex(#"(C:\\SomeDir\\SomePath)(\\\w*)");
string result = regex.Replace(s, x => x.Groups[1] + #"\FOLDER3");

Regex for ignoring consecutive quotation marks in string

I have built a parser in Sprache and C# for files using a format I don't control. Using it I can correctly convert:
a = "my string";
into
my string
The parser (for the quoted text only) currently looks like this:
public static readonly Parser<string> QuotedText =
from open in Parse.Char('"').Token()
from content in Parse.CharExcept('"').Many().Text().Token()
from close in Parse.Char('"').Token()
select content;
However the format I'm working with escapes quotation marks using "double doubles" quotes, e.g.:
a = "a ""string"".";
When attempting to parse this nothing is returned. It should return:
a ""string"".
Additionally
a = "";
should be parsed into a string.Empty or similar.
I've tried regexes unsuccessfully based on answers like this doing things like "(?:[^;])*", or:
public static readonly Parser<string> QuotedText =
from content in Parse.Regex("""(?:[^;])*""").Token()
This doesn't work (i.e. no matches are returned in the above cases). I think my beginners regex skills are getting in the way. Does anybody have any hints?
EDIT: I was testing it here - http://regex101.com/r/eJ9aH1
If I'm understanding you correctly, this is the kind of regex you're looking for:
"(?:""|[^"])*"
See the demo.
1. " matches an opening quote
2. (?:""|[^"])* matches two quotes or any chars that are not a quote (including newlines), repeating
3. " matches the closing quote.
But it's always going to boil down to whether your input is balanced. If not, you'll be getting false positives. And if you have a string such as "string"", which should be matched?"string"",""`, or nothing?... That's a tough decision, one that, fortunately, you don't have to make if you are sure of your input.
You can likely adapt your desired output from this pattern:
"(.+".+")"|(".+?")|("")
example:
http://regex101.com/r/lO1vZ4
If you only want to ignore consecutive double quotes, try this:
("{2,})
Live demo
This regex "("+) might help you to match extra unwanted double quotes.
here is the DEMO

Regex replace all matching words that do not contain a certain string

How can I use regex to replace matching strings that do not include a specific string?
input string
Keepword mywordsecond mythirdword myfourthwordKeep
string to replace
word
exclude string
Keep
Desired out put
Keepword mysecond mythird myfourthKeep
Will there ever be more than one word in a word? If there are more than one, do you want to replace all of them? If not, this should sort you out:
Regex r = new Regex(#"\b((?:(?!Keep|word)\w)*)word((?:(?!Keep)\w)*)\b");
s1 = r.Replace(s0, "$1$2");
to explain:
First, \b((?:(?!Keep|word)\w)*) captures whatever text precedes the first occurrence of word or Keep.
The next thing it sees must be word, If it sees Keep or the end of the string instead, the match attempt immediately fails.
Then ((?:(?!Keep)\w)*)\b captures the remainder of the text in order to ensure it doesn't contain Keep.
When faced with a problem like this, most users' first impulse is to match (in the sense of consuming) only the part of the string they're interested in, using lookarounds to establish the context. It's usually much easier to write the regex so that it always moves forward through the string as it matches. You capture the parts you want to retain so you can plug them back into the result string by means of group references ($1, $2, etc.).
Given that you're using C#, you could use the lookaround approach:
Regex r = new Regex(#"(?<!Keep\w*)word(?!\w*Keep)");
s1 = r.Replace(s0, "");
But please don't. There are very few regex flavors that support unrestricted lookbehinds like .NET does, and most problems don't work so neatly as this one anyway.
string str = "Keepword mywordsecond mythirdword myfourthwordKeep";
str = Regex.Replace(str, "(?<!Keep)word", "");
And I'm going to link you to a one of good Regular Expressions Cheat sheet here
This works in notepad++:
(?<!Keep)word(?!Keep)
It uses "look ahead".
You can use negative look-behind assertion if you want to remove all "word" that are not proceeded by "Keep":
String input = "Keepword mywordsecond mythirdword myfourthwordKeep";
String pattern = "(?<!Keep)word";
String output = Regex.Replace(input, pattern, "");

C# .NET Regex remove all quotes of quotes excluding one instance in a sentance

I have description field which is:
16" Alloy Upgrade
In CSV format it appears like this:
"16"" Alloy Upgrade "
What would be the best use of regex to maintain the original format? As I'm learning I would appreciate it being broke down for my understanding.
I'm already using Regex to split some text separating 2 fields which are: code, description. I'm using this:
,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))
My thoughts are to remove the quotes, then remove the delimiter excluding use in sentences.
Thanks in advance.
If you don't want to/can't use a standard CSV parser (which I'd recommend), you can strip all non-doubled quotes using a regex like this:
Regex.Replace(text, #"(?!="")""(?!"")",string.Empty)
That regex will match every " character not preceded or followed by another ".
I wouldn't use regex since they are usually confusing and totally unclear what they do (like the one in your question for example). Instead this method should do the trick:
public string CleanField(string input)
{
if (input.StartsWith("\"") && input.EndsWith("\""))
{
string output = input.Substring(1,input.Length-2);
output = output.Replace("\"\"","\"");
return output;
}
else
{
//If it doesn't start and end with quotes then it doesn't look like its been escaped so just hand it back
return input;
}
}
It may need tweaking but in essence it checks if the string starts and ends with a quote (which it should if it is an escaped field) and then if so takes the inside part (with the substring) and then replaces double quotes with single quotes. The code is a bit ugly due to all the escaping but there is no avoiding that.
The nice thing is this can be used easily with a bit of Linq to take an existing array and convert it.
processedFieldArray = inputfieldArray.Select(CleanField).ToArray();
I'm using arrays here purely because your linked page seems to use them where you are wanting this solution.

C# Trouble with Regex.Replace

Been scratching my head all day about this one!
Ok, so I have a string which contains the following:
?\"width=\"1\"height=\"1\"border=\"0\"style=\"display:none;\">');
I want to convert that string to the following:
?\"width=1height=1border=0style=\"display:none;\">');
I could theoretically just do a String.Replace on "\"1\"" etc. But this isn't really a viable option as the string could theoretically have any number within the expression.
I also thought about removing the string "\"", however there are other occurrences of this which I don't want to be replaced.
I have been attempting to use the Regex.Replace method as I believe this exists to solve problems along my lines. Here's what I've got:
chunkContents = Regex.Replace(chunkContents, "\".\"", ".");
Now that really messes things up (It replaces the correct elements, but with a full stop), but I think you can see what I am attempting to do with it. I am also worrying that this will only work for single numbers (\"1\" rather than \"11\").. So that led me into thinking about using the "*" or "+" expression rather than ".", however I foresaw the problem of this picking up all of the text inbetween the desired characters (which are dotted all over the place) whereas I obviously only want to replace the ones with numeric characters in between them.
Hope I've explained that clearly enough, will be happy to provide any extra info if needed :)
Try this
var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');";
str = Regex.Replace(str , "\"(\\d+)\"", "$1");
(\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.
This works
String input = #"?\""width=\""1\""height=\""1\""border=\""0\""style=\""display:none;\"">');";
//replace the entire match of the regex with only what's captured (the number)
String result = Regex.Replace(input, #"\\""(\d+)\\""", match => match.Result("$1"));
//control string for excpected result
String shouldBe = #"?\""width=1height=1border=0style=\""display:none;\"">');";
//prints true
Console.WriteLine(result.Equals(shouldBe).ToString());

Categories