Fixing RegEx Split() function - Empty string as first entry - c#

upfront the code to visualize a bit the problem I am facing:
This is the text that needs to be split.
:20:0444453880181732
:21:0444453880131350
:22:CANCEL/ABCDEF0131835055
:23:BUY/CALL/E/EUR
:82A:ABCDEFZZ80A
:87A:4444655604
:30:061123
:31G:070416/1000/USNY
:31E:070418
:26F:PRINCIPAL
:32B:EUR1000000,00
:36:1,31000000
:33B:USD1310000,00
:37K:PCT1,60000000
:34P:061127USD16000,00
:57A:ABCDEFZZ80A
This is my Regex
Regex r = new Regex(#"\:\d{2}\w*\:", RegexOptions.Multiline);
MatchCollection matches = r.Matches(Content);
string[] items = r.Split(Content);
// ----- Fix for first entry being empty string.
int index = items[0] == string.Empty ? 1 : 0;
foreach (Match match in matches)
{
MessageField field = new MessageField();
field.FieldIdExtended = match.Value;
field.Content = items[index];
Fields.Add(field);
index++;
}
As you can see from the comments the problem occurs with the splitting of the string.
It returns as first item an empty string.
Is there any elegant way to solve this?
Thanks, Dimi

The reason that you are getting this behaviour is that your first delimiter from the split has nothing before it and this the first entry is blank.
The way to solve this properly is probably to capture the value that you want in the regular expression and then just get it from your match set.
At a rough first guess you probably want something like:
Regex r = new Regex(#"^:(?<id>\d{2}\w*):(?<content>.*)$", RegexOptions.Multiline);
MatchCollection matches = r.Matches(Content);
foreach (Match match in matches)
{
MessageField field = new MessageField();
field.FieldIdExtended = match.Groups["id"].ToString()
field.Content = match.Groups["content"].ToString();
Fields.Add(field);
}
The use of named capture groups makes it easy to extract stuff. You may need to tweak the regex to be more as you want. Currently it gets 20 as id and 0444453880181732 as content. I wasn't 100% clear on what you needed to capture but you look ok with regex so I assume that isn't a problem. :)
Essentially here you are not really trying to split stuff but match stuff and pull it out.

use:
string[] items = r.Split(Content, StringSplitOptions.RemoveEmptyEntries);
to remove empty entries.

Related

I'm having trouble with a multiline regex in C#, how do I fix this?

I have the following code to attempt to extract the content of li tags.
string blah = #"<ul>
<li>foo</li>
<li>bar</li>
<li>oof</li>
</ul>";
string liRegexString = #"(?:.)*?<li>(.*?)<\/li>(?:.?)*";
Regex liRegex = new Regex(liRegexString, RegexOptions.Multiline);
Match liMatches = liRegex.Match(blah);
if (liMatches.Success)
{
foreach (var group in liMatches.Groups)
{
Console.WriteLine(group);
}
}
Console.ReadLine();
The Regex started much simpler and without the multiline option, but I've been tweaking it to try to make it work.
I want results foo, bar and oof but instead I get <li>foo</li> and foo.
On top of this I it seems to work fine in Regex101, https://regex101.com/r/jY6rnz/1
Any thoughts?
I will start by saying that I think as mentioned in comments you should be parsing HTML with a proper HTML parser such as the HtmlAgilityPack. Moving on to actually answer your question though...
The problem is that you are getting a single match because liRegex.Match(blah); only returns a single match. What you want is liRegex.Matches(blah) which will return all matches.
So your use would be:
var liMatches = liRegex.Matches(blah);
foreach(Match match in liMatches)
{
Console.WriteLine(match.Groups[1].Value);
}
Your regex produces multiple matches when matched with blah. The method Match only returns the first match, which is the foo one. You are printing all groups in that first match. That will get you 1. the whole match 2. group 1 of the match.
If you want to get foo and bar, then you should print group 1 of each match. To do this you should get all the matches using Matches first. Then iterate over the MatchCollection and print Groups[1]:
string blah = #"<ul>
<li>foo</li>
<li>bar</li>
<li>oof</li>
</ul>";
string liRegexString = #"(?:.)*?<li>(.*?)<\/li>(?:.?)*";
Regex liRegex = new Regex(liRegexString, RegexOptions.Multiline);
MatchCollection liMatches = liRegex.Matches(blah);
foreach (var match in liMatches.Cast<Match>())
{
Console.WriteLine(match.Groups[1]);
}

How can I split a regex into exact words?

I need a little help regarding Regular Expressions in C#
I have the following string
"[[Sender.Name]]\r[[Sender.AdditionalInfo]]\r[[Sender.Street]]\r[[Sender.ZipCode]] [[Sender.Location]]\r[[Sender.Country]]\r"
The string could also contain spaces and theoretically any other characters. So I really need do match the [[words]].
What I need is a text array like this
"[[Sender.Name]]",
"[[Sender.AdditionalInfo]]",
"[[Sender.Street]]",
// ... And so on.
I'm pretty sure that this is perfectly doable with:
var stringArray = Regex.Split(line, #"\[\[+\]\]")
I'm just too stupid to find the correct Regex for the Regex.Split() call.
Anyone here that can tell me the correct Regular Expression to use in my case?
As you can tell I'm not that experienced with RegEx :)
Why dont you split according to "\r"?
and you dont need regex for that just use the standard string function
string[] delimiters = {#"\r"};
string[] split = line.Split(delimiters,StringSplitOptions.None);
Do matching if you want to get the [[..]] block.
Regex rgx = new Regex(#"\[\[.*?\]\]");
foreach (Match m in rgx.Matches(input))
Console.WriteLine(m.Groups[0].Value);
IDEONE
The regex you are using (\[\[+\]\]) will capture: literal [s 2 or more, then 2 literal ]s.
A regex solution is capturing all the non-[s inside doubled [ and ]s (and the string inside the brackets should not be empty, I guess?), and cast MatchCollection to a list or array (here is an example with a list):
var str = "[[Sender.Name]]\r[[Sender.AdditionalInfo]]\r[[Sender.Street]]\r[[Sender.ZipCode]] [[Sender.Location]]\r[[Sender.Country]]\r";
var rgx22 = new Regex(#"\[\[[^]]+?\]\]");
var res345 = rgx22.Matches(str).Cast<Match>().ToList();
Output:

what is a good pattern to processes each individual regex match through a method

I'm trying to figure out a pattern where I run a regex match on a long string, and each time it finds a match, it runs a replace on it. The thing is, the replace will vary depending on the matched value. This new value will be determined by a method. For example:
var matches = Regex.Match(myString, myPattern);
while(matches.Success){
Regex.Replace(myString, matches.Value, GetNewValue(matches.Groups[1]));
matches = matches.NextMatch();
}
The problem (i think) is that if I run the Regex.Replace, all of the match indexes get messed up so the result ends up coming out wrong. Any suggestions?
If you replace each pattern with a fixed string, Regex.replace does that for you. You don't need to iterate the matches:
Regex.Replace(myString, myPattern, "replacement");
Otherwise, if the replacement depends upon the matched value, use the MatchEvaluator delegate, as the 3rd argument to Regex.Replace. It receives an instance of Match and returns string. The return value is the replacement string. If you don't want to replace some matches, simply return match.Value:
string myString = "aa bb aa bb";
string myPattern = #"\w+";
string result = Regex.Replace(myString, myPattern,
match => match.Value == "aa" ? "0" : "1" );
Console.WriteLine(result);
// 0 1 0 1
If you really need to iterate the matches and replace them manually, you need to start replacement from the last match towards the first, so that the index of the string is not ruined for the upcoming matches. Here's an example:
var matches = Regex.Matches(myString, myPattern);
var matchesFromEndToStart = matches.Cast<Match>().OrderByDescending(m => m.Index);
var sb = new StringBuilder(myString);
foreach (var match in matchesFromEndToStart)
{
if (IsGood(match))
{
sb.Remove(match.Index, match.Length)
.Insert(match.Index, GetReplacementFor(match));
}
}
Console.WriteLine(sb.ToString());
Just be careful, that your matches do not contain nested instances. If so, you either need to remove matches which are inside another match, or rerun the regex pattern to generate new matches after each replacement. I still recommend the second approach, which uses the delegates.
If I understand your question correctly, you want to perform a replace based on a constant Regular Expression, but the replacement text you use will change based on the actual text that the regex matches on.
The Captures property of the Match Class (not the Match method) returns a collection of all the matches with your regex within the input string. It contains information like the position within the string, the matched value and the length of the match. If you iterate over this collection with a foreach loop you should be able to treat each match individually and perform some string manipulations where you can dynamically modify the replacement value.
I would use something like
Regex regEx = new Regex("some.*?pattern");
string input = "someBLAHpattern!";
foreach (Match match in regEx.Matches(input))
{
DoStuffWith(match.Value);
}

Regex to strip characters except given ones?

I would like to strip strings but only leave the following:
[a-zA-Z]+[_a-zA-Z0-9-]*
I am trying to output strings that start with a character, then can have alphanumeric, underscores, and dashes. How can I do this with RegEx or another function?
Because everything in the second part of the regex is in the first part, you could do something like this:
String foo = "_-abc.!##$5o993idl;)"; // your string here.
//First replace removes all the characters you don't want.
foo = Regex.Replace(foo, "[^_a-zA-Z0-9-]", "");
//Second replace removes any characters from the start that aren't allowed there.
foo = Regex.Replace(foo, "^[^a-zA-Z]+", "");
So start out by paring it down to only the allowed characters. Then get rid of any allowed characters that can't be at the beginning.
Of course, if your regex gets more complicated, this solution falls apart fairly quickly.
Assuming that you've got the strings in a collection, I would do it this way:
foreach element in the collection try match the regex
if !success, remove the string from the collection
Or the other way round - if it matches, add it to a new collection.
If the strings are not in a collection can you add more details as to what your input looks like ?
If you want to pull out all of the identifiers matching your regular expression, you can do it like this:
var input = " _wontmatch f_oobar0 another_valid ";
var re = new Regex( #"\b[a-zA-Z][_a-zA-Z0-9-]*\b" );
foreach( Match match in re.Matches( input ) )
Console.WriteLine( match.Value );
Use MatchCollection matchColl = Regex.Matches("input string","your regex");
Then use:
string [] outStrings = new string[matchColl.Count]; //A string array to contain all required strings
for (int i=0; i < matchColl.Count; i++ )
outStrings[i] = matchColl[i].ToString();
You will have all the required strings in outStrings. Hope this helps.
Edited
var s = Regex.Matches(input_string, "[a-z]+(_*-*[a-z0-9]*)*", RegexOptions.IgnoreCase);
string output_string="";
foreach (Match m in s)
{
output_string = output_string + m;
}
MessageBox.Show(output_string);

Regex Problems, extracting data to groups

How I love regex!
I have a string which will be a mangled form of XML, like:
<Category>DIR</Category><Location>DL123A</Location><Reason>Because</Reason><Qty>42</Qty><Description>Some Desc</Description><IPAddress>127.0.0.1</IPAddress>
Everything will all be on one line, however the 'headers' will often be different.
So what I need to do is extract all information from the string above, putting it into a Dictionary/Hashtable
--
string myString = #"<Category>DIR</Category><Location>DL123A</Location><Reason>Because</Reason><Qty>42</Qty><Description>Some Desc</Description><IPAddress>127.0.0.1</IPAddress>";
//this will extract the name of the label in the header
Regex r = new Regex(#"(?<header><[A-Za-z]+>?)");
//Create a collection of matches
MatchCollection mc = r.Matches(myString);
foreach (Match m in mc)
{
headers.Add(m.Groups["header"].Value);
}
//this will try and get the values.
r = new Regex(#"(?'val'>[A-Za-z0-9\s]*</?)");
mc = r.Matches(myString);
foreach (Match m in mc)
{
string match = m.Groups["val"].Value;
if (string.IsNullOrEmpty(match) || match == "><" || match == "> <")
continue;
else
values.Add(match);
}
--
I hacked that together from previous work with regexes to the closest I could.
But it doesnt really work the way I want it.
the 'header' also pulls the angle brackets in.
The 'value' pulls in a lot of empties (hence the dodgy if statement in the loop). It also doesnt work on strings with periods, commas, spaces, etc.
It would also be much better if I could combine the two statements so I dont have to loop through the regex twice.
Can anyone give me some info where I can improve it?
If it looks like XML, why not use the XML parser functionalities of .net? All you need to do is to add a root element around it:
string myString = #"<Category>DIR</Category><Location>DL123A</Location><Reason>Because</Reason><Qty>42</Qty><Description>Some Desc</Description><IPAddress>127.0.0.1</IPAddress>";
var values = new Dictionary<string, string>();
var xml = XDocument.Parse("<root>" + myString + "</root>");
foreach(var e in xml.Root.Elements()) {
values.Add(e.Name.ToString(), e.Value);
}
This should strip the angle brackets:
Regex r = new Regex(#"<(?<header>[A-Za-z]+)>");
and this should get rid of empty spaces:
r = new Regex(#">\s*(?'val'[A-Za-z0-9\s]*)\s*</");
This will match the headers without <>:
(?<=<)(?<header>[A-Za-z]+)(?=>)
This will get all values (i'm not sure about what can be accepted as a value) :
(?<=>)(?'val'[^<]*)(?=</)
However this is all xml so You can :
XDocument doc = XDocument.Parse(string.Format("<root>{0}</root>",myString));
var pairs = doc.Root.Descendants().Select(node => new KeyValuePair<string, string>(node.Name.LocalName, node.Value));

Categories