Regex Problems, extracting data to groups - c#

How I love regex!
I have a string which will be a mangled form of XML, like:
<Category>DIR</Category><Location>DL123A</Location><Reason>Because</Reason><Qty>42</Qty><Description>Some Desc</Description><IPAddress>127.0.0.1</IPAddress>
Everything will all be on one line, however the 'headers' will often be different.
So what I need to do is extract all information from the string above, putting it into a Dictionary/Hashtable
--
string myString = #"<Category>DIR</Category><Location>DL123A</Location><Reason>Because</Reason><Qty>42</Qty><Description>Some Desc</Description><IPAddress>127.0.0.1</IPAddress>";
//this will extract the name of the label in the header
Regex r = new Regex(#"(?<header><[A-Za-z]+>?)");
//Create a collection of matches
MatchCollection mc = r.Matches(myString);
foreach (Match m in mc)
{
headers.Add(m.Groups["header"].Value);
}
//this will try and get the values.
r = new Regex(#"(?'val'>[A-Za-z0-9\s]*</?)");
mc = r.Matches(myString);
foreach (Match m in mc)
{
string match = m.Groups["val"].Value;
if (string.IsNullOrEmpty(match) || match == "><" || match == "> <")
continue;
else
values.Add(match);
}
--
I hacked that together from previous work with regexes to the closest I could.
But it doesnt really work the way I want it.
the 'header' also pulls the angle brackets in.
The 'value' pulls in a lot of empties (hence the dodgy if statement in the loop). It also doesnt work on strings with periods, commas, spaces, etc.
It would also be much better if I could combine the two statements so I dont have to loop through the regex twice.
Can anyone give me some info where I can improve it?

If it looks like XML, why not use the XML parser functionalities of .net? All you need to do is to add a root element around it:
string myString = #"<Category>DIR</Category><Location>DL123A</Location><Reason>Because</Reason><Qty>42</Qty><Description>Some Desc</Description><IPAddress>127.0.0.1</IPAddress>";
var values = new Dictionary<string, string>();
var xml = XDocument.Parse("<root>" + myString + "</root>");
foreach(var e in xml.Root.Elements()) {
values.Add(e.Name.ToString(), e.Value);
}

This should strip the angle brackets:
Regex r = new Regex(#"<(?<header>[A-Za-z]+)>");
and this should get rid of empty spaces:
r = new Regex(#">\s*(?'val'[A-Za-z0-9\s]*)\s*</");

This will match the headers without <>:
(?<=<)(?<header>[A-Za-z]+)(?=>)
This will get all values (i'm not sure about what can be accepted as a value) :
(?<=>)(?'val'[^<]*)(?=</)
However this is all xml so You can :
XDocument doc = XDocument.Parse(string.Format("<root>{0}</root>",myString));
var pairs = doc.Root.Descendants().Select(node => new KeyValuePair<string, string>(node.Name.LocalName, node.Value));

Related

Fixing RegEx Split() function - Empty string as first entry

upfront the code to visualize a bit the problem I am facing:
This is the text that needs to be split.
:20:0444453880181732
:21:0444453880131350
:22:CANCEL/ABCDEF0131835055
:23:BUY/CALL/E/EUR
:82A:ABCDEFZZ80A
:87A:4444655604
:30:061123
:31G:070416/1000/USNY
:31E:070418
:26F:PRINCIPAL
:32B:EUR1000000,00
:36:1,31000000
:33B:USD1310000,00
:37K:PCT1,60000000
:34P:061127USD16000,00
:57A:ABCDEFZZ80A
This is my Regex
Regex r = new Regex(#"\:\d{2}\w*\:", RegexOptions.Multiline);
MatchCollection matches = r.Matches(Content);
string[] items = r.Split(Content);
// ----- Fix for first entry being empty string.
int index = items[0] == string.Empty ? 1 : 0;
foreach (Match match in matches)
{
MessageField field = new MessageField();
field.FieldIdExtended = match.Value;
field.Content = items[index];
Fields.Add(field);
index++;
}
As you can see from the comments the problem occurs with the splitting of the string.
It returns as first item an empty string.
Is there any elegant way to solve this?
Thanks, Dimi
The reason that you are getting this behaviour is that your first delimiter from the split has nothing before it and this the first entry is blank.
The way to solve this properly is probably to capture the value that you want in the regular expression and then just get it from your match set.
At a rough first guess you probably want something like:
Regex r = new Regex(#"^:(?<id>\d{2}\w*):(?<content>.*)$", RegexOptions.Multiline);
MatchCollection matches = r.Matches(Content);
foreach (Match match in matches)
{
MessageField field = new MessageField();
field.FieldIdExtended = match.Groups["id"].ToString()
field.Content = match.Groups["content"].ToString();
Fields.Add(field);
}
The use of named capture groups makes it easy to extract stuff. You may need to tweak the regex to be more as you want. Currently it gets 20 as id and 0444453880181732 as content. I wasn't 100% clear on what you needed to capture but you look ok with regex so I assume that isn't a problem. :)
Essentially here you are not really trying to split stuff but match stuff and pull it out.
use:
string[] items = r.Split(Content, StringSplitOptions.RemoveEmptyEntries);
to remove empty entries.

Regex to strip characters except given ones?

I would like to strip strings but only leave the following:
[a-zA-Z]+[_a-zA-Z0-9-]*
I am trying to output strings that start with a character, then can have alphanumeric, underscores, and dashes. How can I do this with RegEx or another function?
Because everything in the second part of the regex is in the first part, you could do something like this:
String foo = "_-abc.!##$5o993idl;)"; // your string here.
//First replace removes all the characters you don't want.
foo = Regex.Replace(foo, "[^_a-zA-Z0-9-]", "");
//Second replace removes any characters from the start that aren't allowed there.
foo = Regex.Replace(foo, "^[^a-zA-Z]+", "");
So start out by paring it down to only the allowed characters. Then get rid of any allowed characters that can't be at the beginning.
Of course, if your regex gets more complicated, this solution falls apart fairly quickly.
Assuming that you've got the strings in a collection, I would do it this way:
foreach element in the collection try match the regex
if !success, remove the string from the collection
Or the other way round - if it matches, add it to a new collection.
If the strings are not in a collection can you add more details as to what your input looks like ?
If you want to pull out all of the identifiers matching your regular expression, you can do it like this:
var input = " _wontmatch f_oobar0 another_valid ";
var re = new Regex( #"\b[a-zA-Z][_a-zA-Z0-9-]*\b" );
foreach( Match match in re.Matches( input ) )
Console.WriteLine( match.Value );
Use MatchCollection matchColl = Regex.Matches("input string","your regex");
Then use:
string [] outStrings = new string[matchColl.Count]; //A string array to contain all required strings
for (int i=0; i < matchColl.Count; i++ )
outStrings[i] = matchColl[i].ToString();
You will have all the required strings in outStrings. Hope this helps.
Edited
var s = Regex.Matches(input_string, "[a-z]+(_*-*[a-z0-9]*)*", RegexOptions.IgnoreCase);
string output_string="";
foreach (Match m in s)
{
output_string = output_string + m;
}
MessageBox.Show(output_string);

Get string between quotes using regex

I have a string that is basically an XML node, and I need to extract the values of the attributes. I am trying to use the following C# code to accomplish this:
string line = "<log description="Reset Controls - MFB - SkipSegment = True" start="09/13/2011 10:29:58" end="09/13/2011 10:29:58" timeMS="0" serviceCalls="0">"
string pattern = "\"[\\w ]*\"";
Regex r = new Regex(pattern);
foreach (Match m in Regex.Matches(line, pattern))
{
MessageBox.Show(m.Value.Substring(1, m.Value.Length - 2));
}
The problem is that this is only returning the last occurrence from the string ("0" in the above example), when each string contains 5 occurrences. How do I get every occurrence using C#?
Don't try to parse XML with regular expressions. Use an XML API instead. It's simply a really bad idea to try to hack together "just enough of an XML parser" - you'll end up with incredibly fragile code.
Now your line isn't actually a valid XML element at the moment - but if you add a </log> it will be.
XElement element = XElement.Parse(line + "</log>");
foreach (XAttribute attribute in element.Attributes())
{
Console.WriteLine("{0} = {1}", attribute.Name, attribute.Value);
}
That's slightly hacky, but it's better than trying to fake XML parsing yourself.
As an aside, you need to escape your string for double-quotes and add a semi-colon:
string line = "<log description=\"Reset Controls - MFB - SkipSegment = True\" start=\"09/13/2011 10:29:58\" end=\"09/13/2011 10:29:58\" timeMS=\"0\" serviceCalls=\"0\">";
To actually answer your question, your pattern should probably be "\"[^\"]*\""because \w won't match spaces, symbols, etc.

Increasing Regex Efficiency

I have about 100k Outlook mail items that have about 500-600 chars per Body. I have a list of 580 keywords that must search through each body, then append the words at the bottom.
I believe I've increased the efficiency of the majority of the function, but it still takes a lot of time. Even for 100 emails it takes about 4 seconds.
I run two functions for each keyword list (290 keywords each list).
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
foreach (string currWord in _keywordList)
{
bool isMatch = Regex.IsMatch(nSearch.InnerHtml, "\\b" + #currWord + "\\b",
RegexOptions.IgnoreCase);
if (isMatch)
{
wordFound.Add(currWord);
}
}
return wordFound;
}
Is there anyway I can increase the efficiency of this function?
The other thing that might be slowing it down is that I use HTML Agility Pack to navigate through some nodes and pull out the body (nSearch.InnerHtml). The _keywordList is a List item, and not an array.
I assume that the COM call nSearch.InnerHtml is pretty slow and you repeat the call for every single word that you are checking. You can simply cache the result of the call:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
foreach (string currWord in _keywordList)
{
bool isMatch = Regex.IsMatch(innerHtml, "\\b" + #currWord + "\\b",
RegexOptions.IgnoreCase);
if (isMatch)
{
wordFound.Add(currWord);
}
}
return wordFound;
}
Another optimization would be the one suggested by Jeff Yates. E.g. by using a single pattern:
string pattern = #"(\b(?:" + string.Join("|", _keywordList) + #")\b)";
I don't think this is a job for regular expressions. You might be better off searching each message word by word and checking each word against your word list. With the approach you have, you're searching each message n times where n is the number of words you want to find - it's no wonder that it takes a while.
Most of the time comes form matches that fail, so you want to minimize failures.
If the search keyword are not frequent, you can test for all of them at the same time (with regexp \b(aaa|bbb|ccc|....)\b), then you exclude the emails with no matches. The one that have at least one match, you do a thorough search.
one thing you can easily do is match agaist all the words in one go by building an expression like:
\b(?:word1|word2|word3|....)\b
Then you can precompile the pattern and reuse it to look up all occurencesfor each email (not sure how you do this with .Net API, but there must be a way).
Another thing is instead of using the ignorecase flag, if you convert everything to lowercase, that might give you a small speed boost (need to profile it as it's implementation dependent). Don't forget to warm up the CLR when you profile.
This may be faster. You can leverage Regex Groups like this:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
string pattern = "(\\b" + string.Join("\\b)|(\\b", _keywordList) + "\\b)";
Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection myMatches = myRegex.Matches(innerHtml);
foreach (Match myMatch in myMatches)
{
// Group 0 represents the entire match so we skip that one
for (int i = 1; i < myMatch.Groups.Count; i++)
{
if (myMatch.Groups[i].Success)
wordFound.Add(_keywordList[i-1]);
}
}
return wordFound;
}
This way you're only using one regular expression. And the indices of the Groups should correlate with your _keywordList by an offset of 1, hence the line wordFound.Add(_keywordList[i-1]);
UPDATE:
After looking at my code again I just realized that putting the matches into Groups is really unnecessary. And Regex Groups have some overhead. Instead, you could remove the parenthesis from the pattern, and then simply add the matches themselves to the wordFound list. This would produce the same effect, but it'd be faster.
It'd be something like this:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
string pattern = "\\b(?:" + string.Join("|", _keywordList) + ")\\b";
Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection myMatches = myRegex.Matches(innerHtml);
foreach (Match myMatch in myMatches)
{
wordFound.Add(myMatch.Value);
}
return wordFound;
}
Regular expressions can be optimized quite a bit when you just want to match against a fixed set of constant strings. Instead of several matches, e.g. against "winter", "win" or "wombat", you can just match against "w(in(ter)?|ombat)", for example (Jeffrey Friedl's book can give you lots of ideas like this). This kind of optimisation is also built into some programs, notably emacs ('regexp-opt'). I'm not too familiar with .NET, but I assume someone has programmed similar functionality - google for "regexp optimization".
If the regular expression is indeed the bottle neck, and even optimizing it (by concatenating the search words to one expression) doesn’t help, consider using a multi-pattern search algorithm, such as Wu-Manber.
I’ve posted a very simple implementation here on Stack Overflow. It’s written in C++ but since the code is straightforward it should be easy to translate it to C#.
Notice that this will find words anywhere, not just at word boundaries. However, this can be easily tested after you’ve checked whether the text contains any words; either once again with a regular expression (now you only test individual emails – much faster) or manually by checking the characters before and after the individual hits.
If your problem is about searching for outlook items containing certain string, you should get a gain from using outlooks search facilities...
see:
http://msdn.microsoft.com/en-us/library/bb644806.aspx
If your keyword search is straight literals, ie do not contain further regex pattern matches, then other method may be more appropriate. The following code demonstrates one such method, this code only goes through each email once, your code went through each email 290 time( twice)
public List<string> FindKeywords(string emailbody, List<string> keywordList)
{
// may want to clean up the input a bit, such as replacing '.' and ',' with a space
// and remove double spaces
string emailBodyAsUppercase = emailbody.ToUpper();
List<string> emailBodyAsList = new List<string>(emailBodyAsUppercase.Split(' '));
List<string> foundKeywords = new List<string>(emailBodyAsList.Intersect(keywordList));
return foundKeywords;
}
If you can use .Net 3.5+ and LINQ you could do something like this.
public static class HtmlNodeTools
{
public static IEnumerable<string> MatchedKeywords(
this HtmlNode nSearch,
IEnumerable<string> keywordList)
{
//// as regex
//var innerHtml = nSearch.InnerHtml;
//return keywordList.Where(kw =>
// Regex.IsMatch(innerHtml,
// #"\b" + kw + #"\b",
// RegexOptions.IgnoreCase)
// );
//would be faster if you don't need the pattern matching
var innerHtml = ' ' + nSearch.InnerHtml + ' ';
return keywordList.Where(kw => innerHtml.Contains(kw));
}
}
class Program
{
static void Main(string[] args)
{
var keyworkList = new string[] { "hello", "world", "nomatch" };
var h = new HtmlNode()
{
InnerHtml = "hi there hello other world"
};
var matched = h.MatchedKeywords(keyworkList).ToList();
//hello, world
}
}
... reused regex example ...
public static class HtmlNodeTools
{
public static IEnumerable<string> MatchedKeywords(
this HtmlNode nSearch,
IEnumerable<KeyValuePair<string, Regex>> keywordList)
{
// as regex
var innerHtml = nSearch.InnerHtml;
return from kvp in keywordList
where kvp.Value.IsMatch(innerHtml)
select kvp.Key;
}
}
class Program
{
static void Main(string[] args)
{
var keyworkList = new string[] { "hello", "world", "nomatch" };
var h = new HtmlNode()
{
InnerHtml = "hi there hello other world"
};
var keyworkSet = keyworkList.Select(kw =>
new KeyValuePair<string, Regex>(kw,
new Regex(
#"\b" + kw + #"\b",
RegexOptions.IgnoreCase)
)
).ToArray();
var matched = h.MatchedKeywords(keyworkSet).ToList();
//hello, world
}
}

.NET regular expression find the number and group the number

I have a question about .NET regular expressions.
Now I have several strings in a list, there may be a number in the string, and the rest part of string is same, just like
string[] strings = {"var1", "var2", "var3", "array[0]", "array[1]", "array[2]"}
I want the result is {"var$i" , "array[$i]"}, and I have a record of the number which record the number matched, like a dictionary
var$i {1,2,3} &
array[$i] {0, 1 ,2}
I defined a regex like this
var numberReg = new Regex(#".*(<number>\d+).*");
foreach(string str in strings){
var matchResult = numberReg.Match(name);
if(matchResult.success){
var number = matchResult.Groups["number"].ToString();
//blablabla
But the regex here seems to be not work(never match success), I am new at regex, and I want to solve this problem ASAP.
Try this as your regex:
(?<number>\d+)
It is not clear to me what exactly you want. However looking into your code, I assume you have to somehow extract the numbers (and maybe variable names) from your list of values. Try this:
// values
string[] myStrings = { "var1", "var2", "var3", "array[0]", "array[1]", "array[2]" };
// matches
Regex x = new Regex(#"(?<pre>\w*)(?<number>\d+)(?<post>\w*)");
MatchCollection matches = x.Matches(String.Join(",", myStrings));
// get the numbers
foreach (Match m in matches)
{
string number = m.Groups["number"].Value;
...
}

Categories