Capture two blocks in a string - c#

I have a string that's in this format:
Message: Something bad happened in This.Place < Description> Some sort of information here< /Description>< Error> Some other stuff< /Error>< Message> Some message here.
I can't seem to figure out how to match everything in the Description block and also everything in the Message block using regex.
My question is in two parts: 1.) Is regex the right choice for this?
2.) If so, how can I match those two blocks and exclude the rest?
I can match the first part with a simple < Description>.*< /Description>, but can't match < Message>. I've tried excluding everything inbetween by trying to use what's described here http://blog.codinghorror.com/excluding-matches-with-regular-expressions/

With all the disclaimers about parsing xml in regex, it's still good do know how to do this with regex.
For instance, if you had your back against the wall, this would works for the < Description> tag (adapt it for the other tag).
(?<=< Description>).*?(?=< /Description>)
Some things you need to know:
The (?<=< Description>) is a lookbehind that asserts that at that position in the string, what precedes is < Description>. So if you change the spaces in your tag, all bets are off. To handle potential typing errors (depending on the origin of your text), you can insert optional spaces: (?<=< *Description *>) where the * repeats the space character zero or more times. The lookbehind is only an assertion, it does not consume any characters.
The .*? lazily eats up all characters until it can find what follows...
Which is the (?=< /Description>) lookahead that asserts that at that position in the string, what follows is < /Description>
In code, this becomes something like:
description = Regex.Match(yourstring, "(?<=< *Description *>).*?(?=< */Description *>)").Value;

This is how I'd parse it. Caveat: I've written the regex assuming the format shown in the example you've provided is pretty rigid; if the data varies a little (say, there isn't always a space after the '<' characters), you'll need to tweak it a little. But this should get you going.
var text = "Message: Something bad happened in This.Place < Description> Some"+
" sort of information here< /Description>< Error> Some other stuff"+
"< /Error>< Message> Some message here.";
var regex = new Regex(
"^.*?<\\sDescription\\>(?<description>.*?)<\\s/Description\\>"+
".*?<\\sMessage\\>(?<message>.*?)$",
RegexOptions.IgnoreCase | RegexOptions.Singleline
);
var matches = regex.Match(text);
if (matches.Success) {
var desc = matches.Groups["description"].Value;
// " Some sort of information here"
var msg = matches.Groups["message"].Value;
// " Some message here."
}

It was fairly difficult to try to remove the non-XML-formatted data from the text, so IndexOf and Substring ended up being what I used. IndexOf will find the index of a specified character or string, and Substring captures characters based on a starting point and a count of how many it should capture.
int descriptionBegin = 0;
int descriptionEnd = 0;
int messageBegin = 0;
int messageEnd = 0;
foreach (string j in errorList)
{
descriptionBegin = j.IndexOf("<Description>") + 13; // starts after the opening tag
descriptionEnd = j.IndexOf("</Description>") - 13; // ends before the closing tag
messageBegin = j.IndexOf("<Message>") + 9; // starts after the opening tag
messageEnd = j.IndexOf("</Message>") - 9; // ends before the closing tag
descriptionDiff = descriptionEnd - descriptionBegin; // amount of chars between tags
messageDiff = messageEnd - messageBegin; // amount of chars between tags
string description = j.Substring(descriptionBegin, descriptionDiff); // grabs only specified amt of chars
string message = j.Substring(messageBegin, messageDiff); // grabs only specified amt of chars
}
Thanks #Lucius for the suggestion.
#Darryl that actually looks like it might work. Thanks for the thorough answer...I might try that out for other stuff in the future (non-XML of course :))

Related

Replacing first occurence of a word enclosed in spaces in a string

// There's some similar questions on SO, but none of them seem to cover both replacing a whole word (enclosed in spaces) and its first occurence. Using both at the same time is what is causing me problems.
I want to replace the first occurence of a word surrounded by spaces and I'm running into some problems.
I have a string in range.Text that contains a long string. I want to find a words alike "#val1" "#val2" etc. and replace them with values from my values list. Here is how I do that :
while(i < valueCount && range.Text.Contains("#val"))
{
for (int j = 0; j < valueLimit; j++)
{
string pattern = $#"\b#val{ j + 1 }\b";
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
Match match = regex.Match(range.Text);
if (match.Success)
{
range.Text = regex.Replace(range.Text, values[j], 1);
i++;
}
}
}
Now the problem is that for some reason match.Success is never true, even though I'm pretty sure that there's plenty of values like those I search for in it.
// Example string -
"1\t#val1\r2\t#val2\r3\t#val3\r4\t#val4\r5\t#val5\r6\t#val6\r7\t#val7\r8\t#val8\r9\t#val9\r10\t#val10\r11\t#val11\r12\t#val12\r13\t#val13\r14\t#val14\r15\t#val15\r\r"
The \t s and \r s I expect to be ignored, but spaces are what is important to me. Otherwise I'll have #val110 replaced when loop is at #val11 or #val10. Two vals will never be separated with just a tab. They will always be enclosed in two spaces in the long string.
The issue appears to be the leading \b in your pattern. With that in place, the match always fails.
The trailing one is essential so that #val1 doesn't also incorrectly match #val10, but I'm not seeing what the leading one is for, and it's causing the match to fail.
Try changing:
string pattern = $#"\b#val{ j + 1 }\b";
to
string pattern = $#"#val{ j + 1 }\b";
Other than that, the code seems to achieve what you describe.
Hope this helps

C#: Remove Excess Text From String

Okay, so after looking around here on SO, I have found a solution that meets about 95% of my requirement, although I believe it may need to be redone at this point.
ISSUE
Say I have a value range supplied as "1000 - 1009 ABC1 ABC SOMETHING ELSE" where I just need the 1000 - 1009 part. I need to be able to remove excess characters from the string supplied, even if they truly are accepted characters, but only if they are part of secondary strings with text. (Sorry if that description seems odd, my mind isn't full power today.)
CURRENT SOLUTION
I currently have a simple method utilizing Linq to return only accepted characters, however this will return "1000 - 10091" which is not the range I am needing. I've thought about looping through the strings individual characters and comparing to previous characters as I go using IsDigit and IsLetter to my advantage, but then comes the issue of replacing the unacceptable characters or removing them. I think if I gave it a day or two I could figure it out with a clear mind, but it needs to be done by the end of the day, and I am banging my head against the keyboard.
void RemoveExcessText(ref string val) {
string allowedChars = "0123456789-+>";
val = new string(val.Where(c => allowedChars.Contains(c)).ToArray());
}
// Alternatively?
char previousChar = ' ';
for (int i = 0; i < val.Length; i++) {
if (char.IsLetter(val[i])) {
previousChar = val[i];
val.Remove(i, 1);
} else if (char.IsDigit(val[i])) {
if (char.IsLetter(previousChar)) {
val.Remove(i, 1);
}
}
}
But how do I calculate white space and leave in the +, -, and > charactrers? I am losing my mind on this one today.
Why not use a regular expression?
Regex.Match("1000 - 1009 ABC1 ABC SOMETHING ELSE", #"^(\d+)([\s\-]+)(\d+)");
Should give you what you want
I made a fiddle
You use a regular expression with a capturing group:
Regex r = new Regex("^(?<v>[-0-9 ]+?)");
This means "from the start of the input string (^) match [0 to 9 or space or hyphen] and keep going for as many occurrences of these characters as are available (+?) and store it into variable v (?)"
We get it out like this:
r.Matches(input)[0].Groups["v"].Value
Note though that if the input string doesn't match, the match collection will be 0 long and a call to [0] will crash. To this end you might want to robust it up with some extra error checking:
MatchCollection mc = r.Matches(input);
if(mc.Length > 0)
MessageBox.Show(mc[0].Groups["v"].Value;
You could match this with a regular expression. \d{1,4} means match a decimal digit at least once up to 4 times. Followed by space, hyphen, space, and 1 to 4 digits again, then anything else. Only the part inside parenthesis is output in your results.
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var pattern = #"(^\d{1,4} - \d{1,4}).*";
string input = ("1000 - 1009 ABC1 ABC SOMETHING ELSE");
string replacement = "$1";
string result = Regex.Replace(input, pattern, replacement);
Console.WriteLine(result);
}
}
https://dotnetfiddle.net/cZGlX4

Using Regex.Replace to keep characters that can be vary

I have the following:
string text = "version=\"1,0\"";
I want to replace the comma for a dot, while keeping the 1 and 0, BUT keeping in mind that they be different in different situations! It could be version="2,3" .
The smart ass and noob-unworking way to do it would be:
for (int i = 0; i <= 9; i++)
{
for (int z = 0; z <= 9; z++)
{
text = Regex.Replace(text, "version=\"i,z\"", "version=\"i.z\"");
}
}
But of course.. it's a string, and I dont want i and z be behave as a string in there.
I could also try the lame but working way:
text = Regex.Replace(text, "version=\"1,", "version=\"1.");
text = Regex.Replace(text, "version=\"2,", "version=\"2.");
text = Regex.Replace(text, "version=\"3,", "version=\"3.");
And so on.. but it would be lame.
Any hints on how to single-handedly handle this?
Edit: I have other commas that I don't wanna replace, so text.Replace(",",".") can't do
You need a regex like this to locate the comma
Regex reg = new Regex("(version=\"[0-9]),([0-9]\")");
Then do the repacement:
text = reg.Replace(text, "$1.$2");
You can use $1, $2, etc. to refer to the matching groups in order.
(?<=version=")(\d+),
You can try this.See demo.Replace by $1.
https://regex101.com/r/sJ9gM7/52
You can perhaps use capture groups to keep the numbers in front and after for replacement afterwards for a more 'traditional way' to do it:
string text = "version=\"1,0\"";
var regex = new Regex(#"version=""(\d*),(\d*)""");
var result = regex.Replace(text, "version=\"$1.$2\"");
Using parens like the above in a regex is to create a capture group (so the matched part can be accessed later when needed) so that in the above, the digits before and after the comma will be stored in $1 and $2 respectively.
But I decided to delve a little bit further and let's consider the case if there are more than one comma to replace in the version, i.e. if the text was version="1,1,0". It would actually be tedious to do the above, and you would have to make one replace for each 'type' of version. So here's one solution that is sometimes called a callback in other languages (not a C# dev, but I fiddled around lambda functions and it seems to work :)):
private static string SpecialReplace(string text)
{
var result = text.Replace(',', '.');
return result;
}
public static void Main()
{
string text = "version=\"1,0,0\"";
var regex = new Regex(#"version=""[\d,]*""");
var result = regex.Replace(text, x => SpecialReplace(x.Value));
Console.WriteLine(result);
}
The above gives version="1.0.0".
"version=""[\d,]*""" will first match any sequence of digits and commas within version="...", then pass it to the next line for the replace.
The replace takes the matched text, passes it to the lambda function which takes it to the function SpecialReplace, where a simple text replace is carried out only on the matched part.
ideone demo

How to find repeatable characters

I can't understand how to solve the following problem:
I have input string "aaaabaa" and I'm trying to search for string "aa" (I'm looking for positions of characters)
Expected result is
0 1 2 5
aa aabaa
a aa abaa
aa aa baa
aaaab aa
This problem is already solved by me using another approach (non-RegEx).
But I need a RegEx I'm new to RegEx so google-search can't help me really.
Any help appreciated! Thanks!
P.S.
I've tried to use (aa)* and "\b(\w+(aa))*\w+" but those expressions are wrong
You can solve this by using a lookahead
a(?=a)
will find every "a" that is followed by another "a".
If you want to do this more generally
(\p{L})(?=\1)
This will find every character that is followed by the same character. Every found letter is stored in a capturing group (because of the brackets around), this capturing group is then reused by the positive lookahead assertion (the (?=...)) by using \1 (in \1 there is the matches character stored)
\p{L} is a unicode code point with the category "letter"
Code
String text = "aaaabaa";
Regex reg = new Regex(#"(\p{L})(?=\1)");
MatchCollection result = reg.Matches(text);
foreach (Match item in result) {
Console.WriteLine(item.Index);
}
Output
0
1
2
5
The following code should work with any regular expression without having to change the actual expression:
Regex rx = new Regex("(a)\1"); // or any other word you're looking for.
int position = 0;
string text = "aaaaabbbbccccaaa";
int textLength = text.Length;
Match m = rx.Match(text, position);
while (m != null && m.Success)
{
Console.WriteLine(m.Index);
if (m.Index <= textLength)
{
m = rx.Match(text, m.Index + 1);
}
else
{
m = null;
}
}
Console.ReadKey();
It uses the option to change the start index of a regex search for each consecutive search. The actual problem comes from the fact that the Regex engine, by default, will always continue searching after the previous match. So it will never find a possible match within another match, unless you instruct it to by using a Look ahead construction or by manually setting the start index.
Another, relatively easy, solution is to just stick the whole expression in a forward look ahead:
string expression = "(a)\1"
Regex rx2 = new Regex("(?=" + expression + ")");
MatchCollection ms = rx2.Matches(text);
var indexes = ms.Cast<Match>().Select(match => match.Index);
That way the engine will automatically advance the index by one for every match it finds.
From the docs:
When a match attempt is repeated by calling the NextMatch method, the regular expression engine gives empty matches special treatment. Usually, NextMatch begins the search for the next match exactly where the previous match left off. However, after an empty match, the NextMatch method advances by one character before trying the next match. This behavior guarantees that the regular expression engine will progress through the string. Otherwise, because an empty match does not result in any forward movement, the next match would start in exactly the same place as the previous match, and it would match the same empty string repeatedly.
Try this:
How can I find repeated characters with a regex in Java?
It is in java, but the regex and non-regex way is there. C# Regex is very similar to the Java way.

C# spliting string with three pieces

Hello Everybody i asked this question few hours ago C# get username from string. split
Now i have difficult problem. Trying to get Acid Player And m249 from this string
L 02/28/2012 - 06:14:22: "Acid<1><VALVE_ID_PENDING><CT>"
killed "Player<2><VALVE_ID_PENDING><TERRORIST>" with "m249"
I tried this
int start = Data.ToString().IndexOf('"') + 1;
int end = Data.ToString().IndexOf('<');
var Killer = Data.ToString().Substring(start, end - start);
int start1 = Data.ToString().IndexOf("killed") + 1;
int end1 = Data.ToString().IndexOf('<') + 4;
var Victim = Data.ToString().Substring(start1, end1 - start1);
but its show this exception on last line
Length cannot be less than zero.
Parameter name: length
Does it possible to get Both player name and last string (m249)
Tanks
Here is a simple example of how you can do it with regex. Depending on how much the string varies, this one may work for you. I'm assuming that quotes (") are consistent as well as the text between them. You'll need to add this line at the top:
Using System.Text.RegularExpressions;
Code:
string input = "L 02/28/2012 - 06:14:22: \"Acid<1><VALVE_ID_PENDING><CT>\" killed \"Player<2><VALVE_ID_PENDING><TERRORIST>\" with \"m249\"";
Regex reg = new Regex("[^\"]+\"([^<]+)<[^\"]+\" killed \"([A-Za-z0-9]+)[^\"]+\" with \"([A-Za-z0-9]+)\"");
Match m = reg.Match(input);
if (m.Success)
{
string player1 = m.Groups[1].ToString();
string player2 = m.Groups[2].ToString();
string weapon = m.Groups[3].ToString();
}
The syntax breakdown for the regex is this:
[^\"]+
means, go till we hit a double quote (")
\"
means take the quote as the next part of the string, since the previous term brings us to it, but doesn't go past it.
([^<]+)<
The parenthesis means we are interested in the results of this part, we will seek till we hit a less than (<). since this is the first "group" we're looking to extract, it's referred to as Groups[1] in the match. Again we have the character we were searching for to consume it and continue our search.
<[^\"]+\" killed \"
This will again search, without keeping the results due to no parenthesis, till we hit the next quote mark. We then manually specify the string of (" killed ") since we're interested in what's after that.
([A-Za-z0-9]+)
This will capture any characters for our Group[2] result that are alphanumeric, upper or lowercase.
[^\"]+\"
Search and ignore the rest till we hit the next double quote
with \"
Another literal string that we're using as a marker
([A-Za-z0-9]+)
Same as above, return alphanumeric as our Group[3] with the parenthesis
\"
End it off with the last quote.
Hopefully this explains it. A google for "Regular Expressions Cheat Sheet" is very useful for remembering these rules.
Should be super easy to parse. I recognized that it was CS. Take a look at Valve's documentation here:
https://developer.valvesoftware.com/wiki/HL_Log_Standard#057._Kills
Update:
If you're not comfortable with regular expressions, this implementation will do what you want as well and is along the lines of what you attempted to do:
public void Parse(string killLog)
{
string[] parts = killLog.Split(new[] { " killed ", " with " }, StringSplitOptions.None);
string player1 = parts[0].Substring(1, parts[0].IndexOf('<') - 1);
string player2 = parts[1].Substring(1, parts[1].IndexOf('<') - 1);
string weapon = parts[2].Replace("\"", "");
}
Personally, I would use a RegEx.

Categories