Regex Issues (In C#)

Regex Issues (In C#) - c#

I have the following patterns:
private static Regex rgxDefinitionDoMatch = new Regex(#"d:(?<value>(?:(?!c:|d:|p:).)+)", RegexOptions.Compiled);
private static Regex rgxDefinitionDontMatch = new Regex(#"\!d:(?<value>(?:(?!c:|d:|p:).)+)", RegexOptions.Compiled);
private static Regex rgxDefinitionExactDoMatch = new Regex(#"d:(?<value>\""(?:(?!c:|d:|p:).)+)\""", RegexOptions.Compiled);
private static Regex rgxDefinitionExactDontMatch = new Regex(#"\!d:(?<value>\""(?:(?!c:|d:|p:).)+)\""", RegexOptions.Compiled);
Here is an example string to match:
c:matchThis !c:dontMatchThis p:matchThis !p:dontMatchThis d:def !d:defDont d:"def" !d:"defDont"
Now here are some issues:
When I use rgxDefinitionDontMatch, I get both !d:defDont and d:"defDont"
When I use rgxDefinitionDoMatch it is even worse... I get !d:defDont, d:"defDont",
!d:def and d:"def".
For number 2, I have tried different combinations to ignore the exclamation mark on the front of rgxDefinitionDoMatch ^(?!\!) for example, but it then just doesn't match anything. I'm not sure what to do.
I will also need a way of ignoring quotes for both problems 1. and 2.
Can anyone help? I've been trying for some time now.

Is this what you're looking for?
Regex[] rgxs = {
new Regex(#"(?<!\S)d:(?:""(?<value>[^""]+)""|(?<value>\S+))"),
new Regex(#"(?<!\S)!d:(?:""(?<value>[^""]+)""|(?<value>\S+))")
};
string input = #"c:matchThis !c:dontMatchThis p:matchThis !p:dontMatchThis d:def !d:defDont d:""def"" !d:""defDont""";
foreach (Regex r in rgxs)
{
Console.WriteLine(r.ToString());
foreach (Match m in r.Matches(input))
{
foreach (String name in r.GetGroupNames())
{
Console.WriteLine("{0,-6} => {1}", name, m.Groups[name].Value);
}
}
Console.WriteLine();
}
(?<!\S)d:(?:"(?<value>[^"]+)"|(?<value>\S+))
0 => d:def
value => def
0 => d:"def"
value => def
(?<!\S)!d:(?:"(?<value>[^"]+)"|(?<value>\S+))
0 => !d:defDont
value => defDont
0 => !d:"defDont"
value => defDont
As I was trying to figure out what you were asking, I finally decided the simplest course was to post my code and get your feedback. I'll try to refine it as needed, and (of course) explain it. :D
EDIT: Here's the separate regexes you asked for in the comments:
Regex[] rgxs = {
new Regex(#"(?<!\S)d:(?<value>\S+)"),
new Regex(#"(?<!\S)!d:(?<value>\S+)"),
new Regex(#"(?<!\S)d:""(?<value>[^""]+)"""),
new Regex(#"(?<!\S)!d:""(?<value>[^""]+)""")
};
Combining them the way I did, it doesn't matter if the "value" part is quoted or not, it's still captured--without the quotes, if they're present. (I thought that's what you meant by "ignoring quotes".) What's interesting about the combined form is how I used the same group name twice in the same regex-- something few regex flavors support.
(?<!\S), a negative lookbehind for a non-whitespace character, solves the question you posed in your comment: it insures that every match starts either at the beginning of the string or after a whitespace character. Similarly, the \S+ insures that the match continues ends at the end of the string or before the next whitespace character.
"[^"]+", obviously, matches anything enclosed in quotes, except other quotes. It permits the value to contain whitespace, which I presumed was the reason for the separate regexes. But I mainly wanted to point out that you didn't need to use backslashes to escape the quotes. In a C# verbatim string, it's the extra quote that does the escaping: #"""[^""]+""".

Related

How to remove substring after occurence of certain characters in a string?

I have the requirement as follows:
input => "Employee.Addresses[].Address.City"
output => "Empolyee.Addresses[].City"
(Address is removed which is present after [].)
input => "Employee.Addresses[].Address.Lanes[].Lane.Name"
output => "Employee.Addresses[].Lanes[].Name"
(Address is removed which is present after []. and Lane is removed which is present after [].)
How to do this in C#?

private static IEnumerable<string> Filter(string input)
{
var subWords = input.Split('.');
bool skip = false;
foreach (var word in subWords)
{
if (skip)
{
skip = false;
}
else
{
yield return word;
}
if (word.EndsWith("[]"))
{
skip = true;
}
}
}
And now you use it like this:
var filtered = string.Join(".", Filter(input));

How about a regular expression?
Regex rgx = new Regex(#"(?<=\[\])\..+?(?=\.)");
string output = rgx.Replace(input, String.Empty);
Explanation:
(?<=\[\]) //positive lookbehind for the brackets
\. //match literal period
.+? //match any character at least once, but as few times as possible
(?=\.) //positive lookahead for a literal period

Your description of what you need is lacking. Please correct me if I have understood it incorrectly.
You need to find the pattern "[]." and then remove everything after this pattern until the next dot .
If this is the case, I believe using a Regular Expression could solve the problem easily.
So, the pattern "[]." can be written in a regular expression as
"\[\]\."
Then you need to find everything after this pattern until the next dot: ".*?\." (The .*? means every character as many times as possible but in a non-greedy way, i.e. stopping at the first dot it finds).
So, the whole pattern would be:
var regexPattern = #"\[\]\..*?\.";
And you want to replace all matches of this pattern with "[]." (i.e. removing what was match after the brackets until the dot).
So you call the Replace method in the Regex class:
var result = Regex.Replace(input, regexPattern, "[].");

C# creating a string which will be parsed, based on user input fails when they enter a tokenizing character

I know what is going on, but i was trying to make it so that my .Split() ignores certain characters.
sample:
1|2|3|This is a string|type:1
the parts "This is a string" is user input The user could enter in a splitting character, | in this case, so i wanted to escape it with \|. It still seems to split based on that. This is being done on the web, so i was thinking that a smart move might actually be just JSON.encode(user_in) to get around it?
1|2|3| This is \|a string|type:1
Still splits on the escaped character because i didnt define it as a special case. How would i get around this issue?

you could use Regex.Split instead and then split on | not preceded by a .
// -- regex for | not preceded by a \
string input = #"1|2|3|This is a string\|type:1";
string pattern = #"(?<!\\)[|]";
string[] substrings = Regex.Split(input, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}

You can replace your delimiter with something special first, next split it and finally replace it back.
var initial = #"1|2|3|This is \| a string|type:1";
var modified = initial.Replace(#"\|", "###");
IEnumerable<string> result = modified.Split('|');
result = result.Select(i => i.Replace("###", #"\|"));

Multifunction RegEx for parsing JCL variables - out of working solutions

I'm a bit lost creating a RegEx under C#.NET.
I'm doing something like parser, so I use Regex.Replace to search text for certain "variables" and replace them with their "values".
Each variable starts with ampersand ("&") and ends with ampersand (begining of another variable) or dot.
Each variable (as well as text surrounding variables) can only consist of alphanumerical characters and certain "special" characters, that being "$", "#", "#" and "-".
Nor variables, nor the rest of the text could contain space characters (" ").
Now, the problem is that I'm trying to figure out a RegEx replacing one possible ending character ("."), while not replacing the other possible ending character ("&").
Which happanes to be quite an issue:
"&"+variable+"[^A-Za-z0-9##$]" does what I want, except for it also replaces "&" - not acceptable.
"&"+variable+"(.)?\b" replaces dot, but only if followed by literal character - not if it's followed by \&\##\$\- and that could occur, so this doesn't work either.
"&"+variable+"(.)?(?!A-Za-z0-9)" does exactly what i want as for the ending characters, except it doesn't recognize true end of variable - this way, search-and-replace for "&DEN" also replaces that part in another variable, called "&DENV" - of which "&DEN" is a substring. This would create false/misleading results - totally unacceptable.
These were all the possibilities I could think of (and search of); is it possible to do the task I require with one RegEx at all? Under C#.NET RegEx parser?
Just to illustrate desired function:
string variable="DEN";
string replaceWith="28";
string replText;
string regex = "<desired regex>";
replText = Regex.Replace(replText, "&"+variable+regex, replaceWith);
replText="&DEN";
=> replaced => repltext=="28"
replText="&DENV"
=> not replaced => repltext=="&DENV"
replText="&DEN&DEN"
=> replaced => repltext=="2828"
replText="&DEN&DENV"
=> replaced, not replaced => repltext=="28&DENV"
replText="&DEN.anything"
=> replaced and dot removed => repltext=="28anything"
replText="&DEN..anything"
=> replaced and first dot removed => repltext=="28.anything"
variable could also be like "#DE#N-$".

The following works correctly on all of your examples. I assumed that a variable &FOO should only be replaced if it's followed by ., &, or end-of-string $. If it's followed by anything else, it's not replaced.
In order to match but not capture a terminating &, I used a lookahead assertion (?=&). Assertions force the string to match the regex, but they don't consume any characters, so those characters aren't replaced. Trailing . are still captured and replaced as part of the variable, however.
Finally, a MatchEvaluator is specified to use the captured pattern to do a lookup in the replacements dictionary for the replacement value. If the pattern (variable name) is not found, the text is effectively untouched (the full original capture is returned).
class Program
{
static string ReplaceVariables(Dictionary<string, string> replacements, string input)
{
return Regex.Replace(input, #"&([\w\d$##-]+)(\.|(?=&)|$)", m =>
{
string replacement = null;
return replacements.TryGetValue(m.Groups[1].Value, out replacement)
? replacement
: m.Groups[0].Value;
});
}
static void Main(string[] args)
{
string[] tests = new[]
{
"&DEN", "&DENV", "&DEN&DEN",
"&DEN&DENV", "&DEN.anything",
"&DEN..anything", "&DEN Foo",
"&DEN&FOO&DEN"
};
var replace = new Dictionary<string, string>
{
{ "DEN", "28" },
{ "FOO", "42" }
};
foreach (var test in tests)
{
Console.WriteLine("{0} -> {1}", test, ReplaceVariables(replace, test));
}
}
}

Ok, I think I finally found it, using ORs. Regex
(.)?([^A-Za-z0-9#\#\$\&\,\;\:-\<>()\ ]|(?=\&)|\b)
seems to work fine. I'm just posting this if anyone found it helpfull.
EDIT: sorry, I haven't refreshed the page and thus reacted without knowing there is a better answer provided by Chris Schmich

c# regex to extract link after =

Couldn't find better title but i need a Regex to extract link from sample below.
snip... flashvars.image_url = 'http://domain.com/test.jpg' ..snip
assuming regex is the best way.
thanks

Consider the following sample code. It shows how one might extract from your supplied string. But I have expanded upon the string some. Generally, the use of .* is too all inclusive (as the example below demonstrates).
The main point, is there are several ways to do what you are asking, the first answer given uses "look-around" while the second suggests the "Groups" approach. The choice mainly depend upon your actual data.
string[] tests = {
#"snip... flashvars.image_url = 'http://domain.com/test.jpg' ..snip",
#"snip... flashvars.image_url = 'http://domain.com/test.jpg' flashvars2.image_url = 'http://someother.domain.com/test.jpg'",
};
string[] patterns = {
#"(?<==\s')[^']*(?=')",
#"=\s*'(.*)'",
#"=\s*'([^']*)'",
};
foreach (string pattern in patterns)
{
Console.WriteLine();
foreach (string test in tests)
foreach (Match m in Regex.Matches(test, pattern))
{
if (m.Groups.Count > 1)
Console.WriteLine("{0}", m.Groups[1].Value);
else
Console.WriteLine("{0}", m.Value);
}
}

A simple regex for this would be #"=\s*'(.*)'".

Edit: New regex matching your edited question:
You need to match what's between quotes, after a =, right?
#"(?<==\s*')[^']*(?=')"
should do.
(?<==\s*') asserts that there is a =, optionally followed by whitespace, followed by a ', just before our current position (positive lookbehind).
[^']* matches any number of non-' characters.
(?=') asserts that the match stops before the next '.
This regex doesn't check if there is indeed a URL inside those quotes. If you want to do that, use
#"(?<==\s*')(?=(?:https?|ftp|mailto)\b)[^']*(?=')"

Formatting sentences in a string using C#

I have a string with multiple sentences. How do I Capitalize the first letter of first word in every sentence. Something like paragraph formatting in word.
eg ."this is some code. the code is in C#. "
The ouput must be "This is some code. The code is in C#".
one way would be to split the string based on '.' and then capitalize the first letter and then rejoin.
Is there a better solution?

In my opinion, when it comes to potentially complex rules-based string matching and replacing - you can't get much better than a Regex-based solution (despite the fact that they are so hard to read!). This offers the best performance and memory efficiency, in my opinion - you'll be surprised at just how fast this'll be.
I'd use the Regex.Replace overload that accepts an input string, regex pattern and a MatchEvaluator delegate. A MatchEvaluator is a function that accepts a Match object as input and returns a string replacement.
Here's the code:
public static string Capitalise(string input)
{
//now the first character
return Regex.Replace(input, #"(?<=(^|[.;:])\s*)[a-z]",
(match) => { return match.Value.ToUpper(); });
}
The regex uses the (?<=) construct (zero-width positive lookbehind) to restrict captures only to a-z characters preceded by the start of the string, or the punctuation marks you want. In the [.;:] bit you can add the extra ones you want (e.g. [.;:?."] to add ? and " characters.
This means, also, that your MatchEvaluator doesn't have to do any unnecessary string joining (which you want to avoid for performance reasons).
All the other stuff mentioned by one of the other answerers about using the RegexOptions.Compiled is also relevant from a performance point of view. The static Regex.Replace method does offer very similar performance benefits, though (there's just an additional dictionary lookup).
Like I say - I'll be surprised if any of the other non-regex solutions here will work better and be as fast.
EDIT
Have put this solution up against Ahmad's as he quite rightly pointed out that a look-around might be less efficient than doing it his way.
Here's the crude benchmark I did:
public string LowerCaseLipsum
{
get
{
//went to lipsum.com and generated 10 paragraphs of lipsum
//which I then initialised into the backing field with #"[lipsumtext]".ToLower()
return _lowerCaseLipsum;
}
}
[TestMethod]
public void CapitaliseAhmadsWay()
{
List<string> results = new List<string>();
DateTime start = DateTime.Now;
Regex r = new Regex(#"(^|\p{P}\s+)(\w+)", RegexOptions.Compiled);
for (int f = 0; f < 1000; f++)
{
results.Add(r.Replace(LowerCaseLipsum, m => m.Groups[1].Value
+ m.Groups[2].Value.Substring(0, 1).ToUpper()
+ m.Groups[2].Value.Substring(1)));
}
TimeSpan duration = DateTime.Now - start;
Console.WriteLine("Operation took {0} seconds", duration.TotalSeconds);
}
[TestMethod]
public void CapitaliseLookAroundWay()
{
List<string> results = new List<string>();
DateTime start = DateTime.Now;
Regex r = new Regex(#"(?<=(^|[.;:])\s*)[a-z]", RegexOptions.Compiled);
for (int f = 0; f < 1000; f++)
{
results.Add(r.Replace(LowerCaseLipsum, m => m.Value.ToUpper()));
}
TimeSpan duration = DateTime.Now - start;
Console.WriteLine("Operation took {0} seconds", duration.TotalSeconds);
}
In a release build, the my solution was about 12% faster than the Ahmad's (1.48 seconds as opposed to 1.68 seconds).
Interestingly, however, if it was done through the static Regex.Replace method, both were about 80% slower, and my solution was slower than Ahmad's.

Here's a regex solution that uses the punctuation category to avoid having to specify .!?" etc. although you should certainly check if it covers your needs or set them explicitly. Read up on the "P" category under the "Supported Unicode General Categories" section located on the MSDN Character Classes page.
string input = #"this is some code. the code is in C#? it's great! In ""quotes."" after quotes.";
string pattern = #"(^|\p{P}\s+)(\w+)";
// compiled for performance (might want to benchmark it for your loop)
Regex rx = new Regex(pattern, RegexOptions.Compiled);
string result = rx.Replace(input, m => m.Groups[1].Value
+ m.Groups[2].Value.Substring(0, 1).ToUpper()
+ m.Groups[2].Value.Substring(1));
If you decide not to use the \p{P} class you would have to specify the characters yourself, similar to:
string pattern = #"(^|[.?!""]\s+)(\w+)";
EDIT: below is an updated example to demonstrate 3 patterns. The first shows how all punctuations affect casing. The second shows how to pick and choose certain punctuation categories by using class subtraction. It uses all punctuations while removing specific punctuation groups. The third is similar to the 2nd but using different groups.
The MSDN link doesn't spell out what some of the punctuation categories refer to, so here's a breakdown:
P: all punctuations (comprises all of the categories below)
Pc: underscore _
Pd: dash -
Ps: open parenthesis, brackets and braces ( [ {
Pe: closing parenthesis, brackets and braces ) ] }
Pi: initial single/double quotes (MSDN says it "may behave like Ps/Pe depending on usage")
Pf: final single/double quotes (MSDN Pi note applies)
Po: other punctuation such as commas, colons, semi-colons and slashes ,, :, ;, \, /
Carefully compare how the results are affected by these groups. This should grant you a great degree of flexibility. If this doesn't seem desirable then you may use specific characters in a character class as shown earlier.
string input = #"foo ( parens ) bar { braces } foo [ brackets ] bar. single ' quote & "" double "" quote.
dash - test. Connector _ test. Comma, test. Semicolon; test. Colon: test. Slash / test. Slash \ test.";
string[] patterns = {
#"(^|\p{P}\s+)(\w+)", // all punctuation chars
#"(^|[\p{P}-[\p{Pc}\p{Pd}\p{Ps}\p{Pe}]]\s+)(\w+)", // all punctuation chars except Pc/Pd/Ps/Pe
#"(^|[\p{P}-[\p{Po}]]\s+)(\w+)" // all punctuation chars except Po
};
// compiled for performance (might want to benchmark it for your loop)
foreach (string pattern in patterns)
{
Console.WriteLine("*** Current pattern: {0}", pattern);
string result = Regex.Replace(input, pattern,
m => m.Groups[1].Value
+ m.Groups[2].Value.Substring(0, 1).ToUpper()
+ m.Groups[2].Value.Substring(1));
Console.WriteLine(result);
Console.WriteLine();
}
Notice that "Dash" is not capitalized using the last pattern and it's on a new line. One way to make it capitalized is to use the RegexOptions.Multiline option. Try the above snippet with that to see if it meets your desired result.
Also, for the sake of example, I didn't use RegexOptions.Compiled in the above loop. To use both options OR them together: RegexOptions.Compiled | RegexOptions.Multiline.

You have a few different options:
Your approach of splitting the string, capitalizing and then re-joining
Using regular expressions to perform a replace of the expressions (which can be a bit tricky for case)
Write a C# iterator that iterates over each character and yields a new IEnumerable<char> with the first letter after a period in upper case. May offer benefit of a streaming solution.
Loop over each char and upper-case those that appear immediately after a period (whitespace ignored) - a StringBuffer may make this easier.
The code below uses an iterator:
public static string ToSentenceCase( string someString )
{
var sb = new StringBuilder( someString.Length );
bool wasPeriodLastSeen = true; // We want first letter to be capitalized
foreach( var c in someString )
{
if( wasPeriodLastSeen && !c.IsWhiteSpace )
{
sb.Append( c.ToUpper() );
wasPeriodLastSeen = false;
}
else
{
if( c == '.' ) // you may want to expand this to other punctuation
wasPeriodLastSeen = true;
sb.Append( c );
}
}
return sb.ToString();
}

I don't know why, but I decided to give yield return a try, based on what LBushkin had suggested. Just for fun.
static IEnumerable<char> CapitalLetters(string sentence)
{
//capitalize first letter
bool capitalize = true;
char lastLetter;
for (int i = 0; i < sentence.Length; i++)
{
lastLetter = sentence[i];
yield return (capitalize) ? Char.ToUpper(sentence[i]) : sentence[i];
if (Char.IsWhiteSpace(lastLetter) && capitalize == true)
continue;
capitalize = false;
if (lastLetter == '.' || lastLetter == '!') //etc
capitalize = true;
}
}
To use it:
string sentence = new String(CapitalLetters("this is some code. the code is in C#.").ToArray());

Do your work in a StringBuffer.
Lowercase the whole thing.
Loop through and uppercase leading chars.
Call ToString.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex Issues (In C#) - c#

Related

How to remove substring after occurence of certain characters in a string?

C# creating a string which will be parsed, based on user input fails when they enter a tokenizing character

Multifunction RegEx for parsing JCL variables - out of working solutions

c# regex to extract link after =

Formatting sentences in a string using C#

Categories

Resources