I have the following string:
String myNarrative = "ID: 4393433 This is the best narration";
I want to split this into 2 strings;
myId = "ID: 4393433";
myDesc = "This is the best narration";
How do I do this in Regex.Split()?
Thanks for your help.
If it is a fixed format as shown, use Regex.Match with Capturing Groups (see Matched Subexpressions). Split is useful for dividing up a repeating sequence with unbound multiplicity; the input does not represent such a sequence but rather a fixed set of fields/values.
var m = Regex.Match(inp, #"ID:\s+(\d+)\s+(.*)\s+");
if (m.Success) {
var number = m.Groups[1].Value;
var rest = m.Groups[2].Value;
} else {
// Failed to match.
}
Alternatively, one could use Named Groups and have a read through the Regular Expression Language quick-reference.
upfront the code to visualize a bit the problem I am facing:
This is the text that needs to be split.
:20:0444453880181732
:21:0444453880131350
:22:CANCEL/ABCDEF0131835055
:23:BUY/CALL/E/EUR
:82A:ABCDEFZZ80A
:87A:4444655604
:30:061123
:31G:070416/1000/USNY
:31E:070418
:26F:PRINCIPAL
:32B:EUR1000000,00
:36:1,31000000
:33B:USD1310000,00
:37K:PCT1,60000000
:34P:061127USD16000,00
:57A:ABCDEFZZ80A
This is my Regex
Regex r = new Regex(#"\:\d{2}\w*\:", RegexOptions.Multiline);
MatchCollection matches = r.Matches(Content);
string[] items = r.Split(Content);
// ----- Fix for first entry being empty string.
int index = items[0] == string.Empty ? 1 : 0;
foreach (Match match in matches)
{
MessageField field = new MessageField();
field.FieldIdExtended = match.Value;
field.Content = items[index];
Fields.Add(field);
index++;
}
As you can see from the comments the problem occurs with the splitting of the string.
It returns as first item an empty string.
Is there any elegant way to solve this?
Thanks, Dimi
The reason that you are getting this behaviour is that your first delimiter from the split has nothing before it and this the first entry is blank.
The way to solve this properly is probably to capture the value that you want in the regular expression and then just get it from your match set.
At a rough first guess you probably want something like:
Regex r = new Regex(#"^:(?<id>\d{2}\w*):(?<content>.*)$", RegexOptions.Multiline);
MatchCollection matches = r.Matches(Content);
foreach (Match match in matches)
{
MessageField field = new MessageField();
field.FieldIdExtended = match.Groups["id"].ToString()
field.Content = match.Groups["content"].ToString();
Fields.Add(field);
}
The use of named capture groups makes it easy to extract stuff. You may need to tweak the regex to be more as you want. Currently it gets 20 as id and 0444453880181732 as content. I wasn't 100% clear on what you needed to capture but you look ok with regex so I assume that isn't a problem. :)
Essentially here you are not really trying to split stuff but match stuff and pull it out.
use:
string[] items = r.Split(Content, StringSplitOptions.RemoveEmptyEntries);
to remove empty entries.
Input string is something like this: OU=TEST:This001. We need extra "This001". Best in C#.
What about :
/OU=.*?:(.*)/
Here is how it works:
OU= // Must contain OU=
. // Any character
* // Repeated but not mandatory
? // Ungreedy (lazy) (Don't try to match everything)
: // Match the colon
( // Start to capture a group
. // Any character
* // Repeated but not mandatory
) // End of the group
For the / they're delimiters to know where the regex start and where it ends (and for adding options).
The captured group will contain This001.
But it would be faster with a simple Substring().
yourString.Substring(yourString.IndexOf(":")+1);
Resources :
regular-expressions.info
"OU=" smells like you're doing an Active Directory or LDAP search and responding to the results. While regex is a brilliant tool, I just wanted to make sure that you're also aware of the excellent System.DirectoryServices.Protocols classes that were made for parsing, filtering and manipulating just this sort of data.
The SearchResult, SearchResultEntry and DirectoryAttribute in particular would be the friends you might be looking for. I don't doubt that you can regex or substring as cleverly as the next guy but it's also nice to have another good tool in the toolbox.
Have you tried these classes?
A solution without regex:
var str = "OU=TEST:This00:1";
var result = str.Split(new char[] { ':' }, 2)[1];
// result == This00:1
Regex vs Split vs IndexOf
Split
var str = "OU=TEST:This00:1";
var sw = new Stopwatch();
sw.Start();
var result = str.Split(new char[] { ':' }, 2)[1];
sw.Stop();
// sw.ElapsedTicks == 15
Regex
var str = "OU=TEST:This00:1";
var sw = new Stopwatch();
sw.Start();
var result = (new Regex(":(.*)", RegexOptions.Compiled)).Match(str).Groups[1];
sw.Stop();
// sw.ElapsedTicks == 7000 (Compiled)
IndexOf
var str = "OU=TEST:This00:1";
var sw = new Stopwatch();
sw.Start();
var result = str.Substring(str.IndexOf(":") + 1);
sw.Stop();
// sw.ElapsedTicks == 40
Winner: Split
Links
Split
IndexOf
Regex
if the OU=TEST: is your requirement before the string you want to match, use this regex:
(?<=OU\s*=\s*TEST\s*:\s*).*
that regex matches any length of text after the colon, whereas any text before the colon is just a requirement.
You can replace TEST with [A-Za-z]+ to match any text other than TEST, or you can replace TEST with [\w]+ to match any length of any combination of alphabet and numbers.
\s* means it might be any number of whitespaces or nothing in that position, remove it if you don't need such a check.
I have about 100k Outlook mail items that have about 500-600 chars per Body. I have a list of 580 keywords that must search through each body, then append the words at the bottom.
I believe I've increased the efficiency of the majority of the function, but it still takes a lot of time. Even for 100 emails it takes about 4 seconds.
I run two functions for each keyword list (290 keywords each list).
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
foreach (string currWord in _keywordList)
{
bool isMatch = Regex.IsMatch(nSearch.InnerHtml, "\\b" + #currWord + "\\b",
RegexOptions.IgnoreCase);
if (isMatch)
{
wordFound.Add(currWord);
}
}
return wordFound;
}
Is there anyway I can increase the efficiency of this function?
The other thing that might be slowing it down is that I use HTML Agility Pack to navigate through some nodes and pull out the body (nSearch.InnerHtml). The _keywordList is a List item, and not an array.
I assume that the COM call nSearch.InnerHtml is pretty slow and you repeat the call for every single word that you are checking. You can simply cache the result of the call:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
foreach (string currWord in _keywordList)
{
bool isMatch = Regex.IsMatch(innerHtml, "\\b" + #currWord + "\\b",
RegexOptions.IgnoreCase);
if (isMatch)
{
wordFound.Add(currWord);
}
}
return wordFound;
}
Another optimization would be the one suggested by Jeff Yates. E.g. by using a single pattern:
string pattern = #"(\b(?:" + string.Join("|", _keywordList) + #")\b)";
I don't think this is a job for regular expressions. You might be better off searching each message word by word and checking each word against your word list. With the approach you have, you're searching each message n times where n is the number of words you want to find - it's no wonder that it takes a while.
Most of the time comes form matches that fail, so you want to minimize failures.
If the search keyword are not frequent, you can test for all of them at the same time (with regexp \b(aaa|bbb|ccc|....)\b), then you exclude the emails with no matches. The one that have at least one match, you do a thorough search.
one thing you can easily do is match agaist all the words in one go by building an expression like:
\b(?:word1|word2|word3|....)\b
Then you can precompile the pattern and reuse it to look up all occurencesfor each email (not sure how you do this with .Net API, but there must be a way).
Another thing is instead of using the ignorecase flag, if you convert everything to lowercase, that might give you a small speed boost (need to profile it as it's implementation dependent). Don't forget to warm up the CLR when you profile.
This may be faster. You can leverage Regex Groups like this:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
string pattern = "(\\b" + string.Join("\\b)|(\\b", _keywordList) + "\\b)";
Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection myMatches = myRegex.Matches(innerHtml);
foreach (Match myMatch in myMatches)
{
// Group 0 represents the entire match so we skip that one
for (int i = 1; i < myMatch.Groups.Count; i++)
{
if (myMatch.Groups[i].Success)
wordFound.Add(_keywordList[i-1]);
}
}
return wordFound;
}
This way you're only using one regular expression. And the indices of the Groups should correlate with your _keywordList by an offset of 1, hence the line wordFound.Add(_keywordList[i-1]);
UPDATE:
After looking at my code again I just realized that putting the matches into Groups is really unnecessary. And Regex Groups have some overhead. Instead, you could remove the parenthesis from the pattern, and then simply add the matches themselves to the wordFound list. This would produce the same effect, but it'd be faster.
It'd be something like this:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
string pattern = "\\b(?:" + string.Join("|", _keywordList) + ")\\b";
Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection myMatches = myRegex.Matches(innerHtml);
foreach (Match myMatch in myMatches)
{
wordFound.Add(myMatch.Value);
}
return wordFound;
}
Regular expressions can be optimized quite a bit when you just want to match against a fixed set of constant strings. Instead of several matches, e.g. against "winter", "win" or "wombat", you can just match against "w(in(ter)?|ombat)", for example (Jeffrey Friedl's book can give you lots of ideas like this). This kind of optimisation is also built into some programs, notably emacs ('regexp-opt'). I'm not too familiar with .NET, but I assume someone has programmed similar functionality - google for "regexp optimization".
If the regular expression is indeed the bottle neck, and even optimizing it (by concatenating the search words to one expression) doesn’t help, consider using a multi-pattern search algorithm, such as Wu-Manber.
I’ve posted a very simple implementation here on Stack Overflow. It’s written in C++ but since the code is straightforward it should be easy to translate it to C#.
Notice that this will find words anywhere, not just at word boundaries. However, this can be easily tested after you’ve checked whether the text contains any words; either once again with a regular expression (now you only test individual emails – much faster) or manually by checking the characters before and after the individual hits.
If your problem is about searching for outlook items containing certain string, you should get a gain from using outlooks search facilities...
see:
http://msdn.microsoft.com/en-us/library/bb644806.aspx
If your keyword search is straight literals, ie do not contain further regex pattern matches, then other method may be more appropriate. The following code demonstrates one such method, this code only goes through each email once, your code went through each email 290 time( twice)
public List<string> FindKeywords(string emailbody, List<string> keywordList)
{
// may want to clean up the input a bit, such as replacing '.' and ',' with a space
// and remove double spaces
string emailBodyAsUppercase = emailbody.ToUpper();
List<string> emailBodyAsList = new List<string>(emailBodyAsUppercase.Split(' '));
List<string> foundKeywords = new List<string>(emailBodyAsList.Intersect(keywordList));
return foundKeywords;
}
If you can use .Net 3.5+ and LINQ you could do something like this.
public static class HtmlNodeTools
{
public static IEnumerable<string> MatchedKeywords(
this HtmlNode nSearch,
IEnumerable<string> keywordList)
{
//// as regex
//var innerHtml = nSearch.InnerHtml;
//return keywordList.Where(kw =>
// Regex.IsMatch(innerHtml,
// #"\b" + kw + #"\b",
// RegexOptions.IgnoreCase)
// );
//would be faster if you don't need the pattern matching
var innerHtml = ' ' + nSearch.InnerHtml + ' ';
return keywordList.Where(kw => innerHtml.Contains(kw));
}
}
class Program
{
static void Main(string[] args)
{
var keyworkList = new string[] { "hello", "world", "nomatch" };
var h = new HtmlNode()
{
InnerHtml = "hi there hello other world"
};
var matched = h.MatchedKeywords(keyworkList).ToList();
//hello, world
}
}
... reused regex example ...
public static class HtmlNodeTools
{
public static IEnumerable<string> MatchedKeywords(
this HtmlNode nSearch,
IEnumerable<KeyValuePair<string, Regex>> keywordList)
{
// as regex
var innerHtml = nSearch.InnerHtml;
return from kvp in keywordList
where kvp.Value.IsMatch(innerHtml)
select kvp.Key;
}
}
class Program
{
static void Main(string[] args)
{
var keyworkList = new string[] { "hello", "world", "nomatch" };
var h = new HtmlNode()
{
InnerHtml = "hi there hello other world"
};
var keyworkSet = keyworkList.Select(kw =>
new KeyValuePair<string, Regex>(kw,
new Regex(
#"\b" + kw + #"\b",
RegexOptions.IgnoreCase)
)
).ToArray();
var matched = h.MatchedKeywords(keyworkSet).ToList();
//hello, world
}
}
please help me this problem.
I want to split "-action=1" to "action" and "1".
string pattern = #"^-(\S+)=(\S+)$";
Regex regex = new Regex(pattern);
string myText = "-action=1";
string[] result = regex.Split(myText);
I don't know why result have length=4.
result[0] = ""
result[1] = "action"
result[2] = "1"
result[3] = ""
Please help me.
P/S: I am using .NET 2.0.
Thanks.
Hello, I tested with string: #"-destination=C:\Program Files\Release" but it have inaccurate result, I don't understand why result's length = 1. I think because it has a white space in string.
I want to split it to "destination" & "C:\Program Files\Release"
More info: This is my requirement:
-string1=string2 -> split it to: string1 & string2.
In string1 & string2 don't contain characters: '-', '=', but they can contain white space.
Please help me. Thanks.
Don't use split, just use Match, and then get the results from the Groups collection by index (index 1 and 2).
Match match = regex.Match(myText);
if (!match.Success) {
// the regex didn't match - you can do error handling here
}
string action = match.Groups[1].Value;
string number = match.Groups[2].Value;
Try this (updated to add Regex.Split):
string victim = "-action=1";
string[] stringSplit = victim.Split("-=".ToCharArray());
string[] regexSplit = Regex.Split(victim, "[-=]");
EDIT: Using your example:
string input = #"-destination=C:\Program Files\Release -action=value";
foreach(Match match in Regex.Matches(input, #"-(?<name>\w+)=(?<value>[^=-]*)"))
{
Console.WriteLine("{0}", match.Value);
Console.WriteLine("\tname = {0}", match.Groups["name" ].Value);
Console.WriteLine("\tvalue = {0}", match.Groups["value"].Value);
}
Console.ReadLine();
Of course, this code have issues if your path contains - character
In .NET Regex, you can name your groups.
string pattern = #"^-(?<MyKey>\S+)=(?<MyValue>\S+)$";
Regex regex = new Regex(pattern);
string myText = "-action=1";
Then do a "Match" and get the values by your group names.
Match theMatch = regex.Match(myText);
if (theMatch.Success)
{
Console.Write(theMatch.Groups["MyKey"].Value); // This is "action"
Console.Write(theMatch.Groups["MyValue"].Value); // This is "1"
}
What's wrong with using string.split()?
string test = "-action=1";
string[] splitUp = test.Split("-=".ToCharArray());
I admit, though that this still gives you possibly more parameters than you'd like to see in the split up array...
[0] = ""
[1] = "action"
[2] = "1"
In his talk Regular Expression Mastery, Mark Dominus attributes the following helpful rule to Learning Perl author (and fellow StackOverflow user) Randal Schwartz:
Use capturing or m//g [or regex.Match(...)] when you know what you want to keep.
Use split when you know what you want to throw away.