I'm a bit lost creating a RegEx under C#.NET.
I'm doing something like parser, so I use Regex.Replace to search text for certain "variables" and replace them with their "values".
Each variable starts with ampersand ("&") and ends with ampersand (begining of another variable) or dot.
Each variable (as well as text surrounding variables) can only consist of alphanumerical characters and certain "special" characters, that being "$", "#", "#" and "-".
Nor variables, nor the rest of the text could contain space characters (" ").
Now, the problem is that I'm trying to figure out a RegEx replacing one possible ending character ("."), while not replacing the other possible ending character ("&").
Which happanes to be quite an issue:
"&"+variable+"[^A-Za-z0-9##$]" does what I want, except for it also replaces "&" - not acceptable.
"&"+variable+"(.)?\b" replaces dot, but only if followed by literal character - not if it's followed by \&\##\$\- and that could occur, so this doesn't work either.
"&"+variable+"(.)?(?!A-Za-z0-9)" does exactly what i want as for the ending characters, except it doesn't recognize true end of variable - this way, search-and-replace for "&DEN" also replaces that part in another variable, called "&DENV" - of which "&DEN" is a substring. This would create false/misleading results - totally unacceptable.
These were all the possibilities I could think of (and search of); is it possible to do the task I require with one RegEx at all? Under C#.NET RegEx parser?
Just to illustrate desired function:
string variable="DEN";
string replaceWith="28";
string replText;
string regex = "<desired regex>";
replText = Regex.Replace(replText, "&"+variable+regex, replaceWith);
replText="&DEN";
=> replaced => repltext=="28"
replText="&DENV"
=> not replaced => repltext=="&DENV"
replText="&DEN&DEN"
=> replaced => repltext=="2828"
replText="&DEN&DENV"
=> replaced, not replaced => repltext=="28&DENV"
replText="&DEN.anything"
=> replaced and dot removed => repltext=="28anything"
replText="&DEN..anything"
=> replaced and first dot removed => repltext=="28.anything"
variable could also be like "#DE#N-$".
The following works correctly on all of your examples. I assumed that a variable &FOO should only be replaced if it's followed by ., &, or end-of-string $. If it's followed by anything else, it's not replaced.
In order to match but not capture a terminating &, I used a lookahead assertion (?=&). Assertions force the string to match the regex, but they don't consume any characters, so those characters aren't replaced. Trailing . are still captured and replaced as part of the variable, however.
Finally, a MatchEvaluator is specified to use the captured pattern to do a lookup in the replacements dictionary for the replacement value. If the pattern (variable name) is not found, the text is effectively untouched (the full original capture is returned).
class Program
{
static string ReplaceVariables(Dictionary<string, string> replacements, string input)
{
return Regex.Replace(input, #"&([\w\d$##-]+)(\.|(?=&)|$)", m =>
{
string replacement = null;
return replacements.TryGetValue(m.Groups[1].Value, out replacement)
? replacement
: m.Groups[0].Value;
});
}
static void Main(string[] args)
{
string[] tests = new[]
{
"&DEN", "&DENV", "&DEN&DEN",
"&DEN&DENV", "&DEN.anything",
"&DEN..anything", "&DEN Foo",
"&DEN&FOO&DEN"
};
var replace = new Dictionary<string, string>
{
{ "DEN", "28" },
{ "FOO", "42" }
};
foreach (var test in tests)
{
Console.WriteLine("{0} -> {1}", test, ReplaceVariables(replace, test));
}
}
}
Ok, I think I finally found it, using ORs. Regex
(.)?([^A-Za-z0-9#\#\$\&\,\;\:-\<>()\ ]|(?=\&)|\b)
seems to work fine. I'm just posting this if anyone found it helpfull.
EDIT: sorry, I haven't refreshed the page and thus reacted without knowing there is a better answer provided by Chris Schmich
Related
We have a requirement to extract and manipulate strings in C#. Net. The requirement is - we have a string
($name$:('George') AND $phonenumer$:('456456') AND
$emailaddress$:("test#test.com"))
We need to extract the strings between the character - $
Therefore, in the end, we need to get a list of strings containing - name, phonenumber, emailaddress.
What would be the ideal way to do it? are there any out of the box features available for this?
Regards,
John
The simplest way is to use a regular expression to match all non-whitespace characters between $ :
var regex=new Regex(#"\$\w+\$");
var input = "($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:(\"test#test.com\"))";
var matches=regex.Matches(input);
This will return a collection of matches. The .Value property of each match contains the matching string. \$ is used because $ has special meaning in regular expressions - it matches the end of a string. \w means a non-whitespace character. + means one or more.
Since this is a collection, you can use LINQ on it to get eg an array with the values:
var values=matches.OfType<Match>().Select(m=>m.Value).ToArray();
That array will contain the values $name$,$phonenumer$,$emailaddress$.
Capture by name
You can specify groups in the pattern and attach names to them. For example, you can group the field name values:
var regex=new Regex(#"\$(?<name>\w+)\$");
var names=regex.Matches(input)
.OfType<Match>()
.Select(m=>m.Groups["name"].Value);
This will return name,phonenumer,emailaddress. Parentheses are used for grouping. (?<somename>pattern) is used to attach a name to the group
Extract both names and values
You can also capture the field values and extract them as a separate field. Once you have the field name and value, you can return them, eg as an object or anonymous type.
The pattern in this case is more comples:
#"\$(?<name>\w+)\$:\(['""](?<value>.+?)['""]\)"
Parentheses are escaped because we want them to match the values. Both ' and " characters are used in values, so ['"] is used to specify a choice of characters. The pattern is a literal string (ie starts with #) so the double quotes have to be escaped: ['""] . Any character has to be matched .+ but only up to the next character in the pattern .+?. Without the ? the pattern .+ would match everything to the end of the string.
Putting this together:
var regex = new Regex(#"\$(?<name>\w+)\$:\(['""](?<value>.+?)['""]\)");
var myValues = regex.Matches(input)
.OfType<Match>()
.Select(m=>new { Name=m.Groups["name"].Value,
Value=m.Groups["value"].Value
})
.ToArray()
Turn them into a dictionary
Instead of ToArray() you could convert the objects to a dictionary with ToDictionary(), eg with .ToDictionary(it=>it.Name,it=>it.Value). You could omit the select step and generate the dictionary from the matches themselves :
var myDict = regex.Matches(input)
.OfType<Match>()
.ToDictionary(m=>m.Groups["name"].Value,
m=>m.Groups["value"].Value);
Regular expressions are generally fast because they don't split the string. The pattern is converted to efficient code that parses the input and skips non-matching input immediatelly. Each match and group contain only the index to their starting and ending character in the input string. A string is only generated when .Value is called.
Regular expressions are thread-safe, which means a single Regex object can be stored in a static field and reused from multiple threads. That helps in web applications, as there's no need to create a new Regex object for each request
Because of these two advantages, regular expressions are used extensively to parse log files and extract specific fields. Compared to splitting, performance can be 10 times better or more, while memory usage remains low. Splitting can easily result in memory usage that's multiple times bigger than the original input file.
Can it go faster?
Yes. Regular expressions produce parsing code that may not be as efficient as possible. A hand-written parser could be faster. In this particular case, we want to start capturing text if $ is detected up until the first $. This can be done with the following method :
IEnumerable<string> GetNames(string input)
{
var builder=new StringBuilder(20);
bool started=false;
foreach(var c in input)
{
if (started)
{
if (c!='$')
{
builder.Append(c);
}
else
{
started=false;
var value=builder.ToString();
yield return value;
builder.Clear();
}
}
else if (c=='$')
{
started=true;
}
}
}
A string is an IEnumerable<char> so we can inspect one character at a time without having to copy them. By using a single StringBuilder with a predetermined capacity we avoid reallocations, at least until we find a key that's larger than 20 characters.
Modifying this code to extract values though isn't so easy.
Here's one way to do it, but certainly not very elegant. Basically splitting the string on the '$' and taking every other item will give you the result (after some additional trimming of unwanted characters).
In this example, I'm also grabbing the value of each item and then putting both in a dictionary:
var input = "($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:(\"test#test.com\"))";
var inputParts = input.Replace(" AND ", "")
.Trim(')', '(')
.Split(new[] {'$'}, StringSplitOptions.RemoveEmptyEntries);
var keyValuePairs = new Dictionary<string, string>();
for (int i = 0; i < inputParts.Length - 1; i += 2)
{
var key = inputParts[i];
var value = inputParts[i + 1].Trim('(', ':', ')', '"', '\'', ' ');
keyValuePairs[key] = value;
}
foreach (var kvp in keyValuePairs)
{
Console.WriteLine($"{kvp.Key} = {kvp.Value}");
}
// Wait for input before closing
Console.WriteLine("\nDone!\nPress any key to exit...");
Console.ReadKey();
Output
In C# I have two strings: [I/text] and [S/100x20].
So, the first one is [I/ followed by text and ending in ].
And the second is [S/ followed by an integer, then x, then another integer, and ending in ].
I need to check if a given string is a match of one of this formats. I tried the following:
(?<word>.*?) and (?<word>[0-9]x[0-9])
But this does not seem to work and I am missing the [I/...] and [S/...] parts.
How can I do this?
This should do nicely:
Regex rex = new Regex(#"\[I/[^\]]+\]|\[S/\d+x\d+\]");
If the text in [I/text] is supposed to include only alphanumeric characters then #Oleg's use of the \w instead of [^\]] would be better. Also using + means there needs to be at least one of the preceding character class, and the * allows class to be optional. Adjust as needed..
And use:
string testString1 = "[I/text]";
if(rex.IsMatch(testString1))
{
// should match..
}
string testString2 = "[S/100x20]";
if(rex.IsMatch(testString2))
{
// should match..
}
Following regex does it. Matches the whole string
"(\[I/\w+\])|(\[S/\d+x\d+\])"
([I/\w+])
(S/\d+x\d+])
the above works.
use http://regexr.com?34543 to play with your expressions
I know what is going on, but i was trying to make it so that my .Split() ignores certain characters.
sample:
1|2|3|This is a string|type:1
the parts "This is a string" is user input The user could enter in a splitting character, | in this case, so i wanted to escape it with \|. It still seems to split based on that. This is being done on the web, so i was thinking that a smart move might actually be just JSON.encode(user_in) to get around it?
1|2|3| This is \|a string|type:1
Still splits on the escaped character because i didnt define it as a special case. How would i get around this issue?
you could use Regex.Split instead and then split on | not preceded by a .
// -- regex for | not preceded by a \
string input = #"1|2|3|This is a string\|type:1";
string pattern = #"(?<!\\)[|]";
string[] substrings = Regex.Split(input, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
You can replace your delimiter with something special first, next split it and finally replace it back.
var initial = #"1|2|3|This is \| a string|type:1";
var modified = initial.Replace(#"\|", "###");
IEnumerable<string> result = modified.Split('|');
result = result.Select(i => i.Replace("###", #"\|"));
I have the following patterns:
private static Regex rgxDefinitionDoMatch = new Regex(#"d:(?<value>(?:(?!c:|d:|p:).)+)", RegexOptions.Compiled);
private static Regex rgxDefinitionDontMatch = new Regex(#"\!d:(?<value>(?:(?!c:|d:|p:).)+)", RegexOptions.Compiled);
private static Regex rgxDefinitionExactDoMatch = new Regex(#"d:(?<value>\""(?:(?!c:|d:|p:).)+)\""", RegexOptions.Compiled);
private static Regex rgxDefinitionExactDontMatch = new Regex(#"\!d:(?<value>\""(?:(?!c:|d:|p:).)+)\""", RegexOptions.Compiled);
Here is an example string to match:
c:matchThis !c:dontMatchThis p:matchThis !p:dontMatchThis d:def !d:defDont d:"def" !d:"defDont"
Now here are some issues:
When I use rgxDefinitionDontMatch, I get both !d:defDont and d:"defDont"
When I use rgxDefinitionDoMatch it is even worse... I get !d:defDont, d:"defDont",
!d:def and d:"def".
For number 2, I have tried different combinations to ignore the exclamation mark on the front of rgxDefinitionDoMatch ^(?!\!) for example, but it then just doesn't match anything. I'm not sure what to do.
I will also need a way of ignoring quotes for both problems 1. and 2.
Can anyone help? I've been trying for some time now.
Is this what you're looking for?
Regex[] rgxs = {
new Regex(#"(?<!\S)d:(?:""(?<value>[^""]+)""|(?<value>\S+))"),
new Regex(#"(?<!\S)!d:(?:""(?<value>[^""]+)""|(?<value>\S+))")
};
string input = #"c:matchThis !c:dontMatchThis p:matchThis !p:dontMatchThis d:def !d:defDont d:""def"" !d:""defDont""";
foreach (Regex r in rgxs)
{
Console.WriteLine(r.ToString());
foreach (Match m in r.Matches(input))
{
foreach (String name in r.GetGroupNames())
{
Console.WriteLine("{0,-6} => {1}", name, m.Groups[name].Value);
}
}
Console.WriteLine();
}
(?<!\S)d:(?:"(?<value>[^"]+)"|(?<value>\S+))
0 => d:def
value => def
0 => d:"def"
value => def
(?<!\S)!d:(?:"(?<value>[^"]+)"|(?<value>\S+))
0 => !d:defDont
value => defDont
0 => !d:"defDont"
value => defDont
As I was trying to figure out what you were asking, I finally decided the simplest course was to post my code and get your feedback. I'll try to refine it as needed, and (of course) explain it. :D
EDIT: Here's the separate regexes you asked for in the comments:
Regex[] rgxs = {
new Regex(#"(?<!\S)d:(?<value>\S+)"),
new Regex(#"(?<!\S)!d:(?<value>\S+)"),
new Regex(#"(?<!\S)d:""(?<value>[^""]+)"""),
new Regex(#"(?<!\S)!d:""(?<value>[^""]+)""")
};
Combining them the way I did, it doesn't matter if the "value" part is quoted or not, it's still captured--without the quotes, if they're present. (I thought that's what you meant by "ignoring quotes".) What's interesting about the combined form is how I used the same group name twice in the same regex-- something few regex flavors support.
(?<!\S), a negative lookbehind for a non-whitespace character, solves the question you posed in your comment: it insures that every match starts either at the beginning of the string or after a whitespace character. Similarly, the \S+ insures that the match continues ends at the end of the string or before the next whitespace character.
"[^"]+", obviously, matches anything enclosed in quotes, except other quotes. It permits the value to contain whitespace, which I presumed was the reason for the separate regexes. But I mainly wanted to point out that you didn't need to use backslashes to escape the quotes. In a C# verbatim string, it's the extra quote that does the escaping: #"""[^""]+""".
i have strings in the form [abc].[some other string].[can.also.contain.periods].[our match]
i now want to match the string "our match" (i.e. without the brackets), so i played around with lookarounds and whatnot. i now get the correct match, but i don't think this is a clean solution.
(?<=\.?\[) starts with '[' or '.['
([^\[]*) our match, i couldn't find a way to not use a negated character group
`.*?` non-greedy did not work as expected with lookarounds,
it would still match from the first match
(matches might contain escaped brackets)
(?=\]$) string ends with an ]
language is .net/c#. if there is an easier solution not involving a regex i'd be also happy to know
what really irritates me is the fact, that i cannot use (.*?) to capture the string, as it seems non-greedy does not work with lookbehinds.
i also tried: Regex.Split(str, #"\]\.\[").Last().TrimEnd(']');, but i'm not really pround of this solution either
The following should do the trick. Assuming the string ends after the last match.
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
var search = new Regex("\\.\\[(.*?)\\]$", RegexOptions.RightToLeft);
string ourMatch = search.Match(input).Groups[1]);
Assuming you can guarantee the input format, and it's just the last entry you want, LastIndexOf could be used:
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
int lastBracket = input.LastIndexOf("[");
string result = input.Substring(lastBracket + 1, input.Length - lastBracket - 2);
With String.Split():
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
char[] seps = {'[',']','\\'};
string[] splitted = input.Split(seps,StringSplitOptions.RemoveEmptyEntries);
you get "out match" in splitted[7] and can.also.contain.periods is left as one string (splitted[4])
Edit: the array will have the string inside [] and then . and so on, so if you have a variable number of groups, you can use that to get the value you want (or remove the strings that are just '.')
Edited to add the backslash to the separator to treat cases like '\[abc\]'
Edit2: for nested []:
string input = #"[abc].[some other string].[can.also.contain.periods].[our [the] match]";
string[] seps2 = { "].["};
string[] splitted = input.Split(seps2, StringSplitOptions.RemoveEmptyEntries);
you our [the] match] in the last element (index 3) and you'd have to remove the extra ]
You have several options:
RegexOptions.RightToLeft - yes, .NET regex can do this! Use it!
Match the whole thing with greedy prefix, use brackets to capture the suffix that you're interested in
So generally, pattern becomes .*(pattern)
In this case, .*\[([^\]]*)\], then extract what \1 captures (see this on rubular.com)
References
regular-expressions.info/Grouping with brackets