Regular Expression question

Regular Expression question - c#

RegexBuddy shows the matches are OK, but in C# when I try use replace, a semicolon and a curly bracket are not replaced.
The expression I am using is the following:
#"({\\)(.+?)(}+)|(\s?\\)(.+?)(\b)|}$"
and the input text (rtf) is included in the screenshot.
This the code:
Regex reg2 = new Regex(#"\\b([\s\S]+?)\\b0");
MatchCollection matches = reg2.Matches(text);
foreach (Match match in matches)
{
string output = reg.Replace(match.Value, "");
MessageBox.Show(output);
}

You are trying to match nested structures with regular expressions. Look at your screenshot: in the first line there are three opening braces and one closing brace, in your third line you have one opening and two closing braces etc.
While .NET does provide ways to do nested pattern matching with regexes, your regex is not using them (and it's extremely mystifying to me what exactly you're hoping to achieve).
You most certainly need to use a different way to parse RTF files; unfortunately I don't know whether the .NET libraries provide an RTF parser.

Related

Regular expression with backreferences

I am attempting to write a regular expression in a C# application to find "{value}", along with a backreference to the text before it up to "[[", and another backreference to the text after it up to "]]". For example:
This is some text [[backreference one {value}
backreference two]]
Would match "[[backreference one ", "{value}", and "\r\nbackreference two]]".
I have tried modified versions of the following with no luck. I believe I am missing word boundaries, and may be having trouble because of "{" in the text I am trying to find.
\[\[(^[\{value\}]+)\{value\}(^\]\]+)\]\]
I'm not sure if it would be possible with regular expressions, but it would be ideal if it could find the matching closing bracket, for example the following would find "[[backreferenc[[e]] one ", "{value}", and "ba[[ckref[[e]]rence t]]wo]]":
This is some text [[backreferenc[[e]] one {value}
ba[[ckref[[e]]rence t]]wo]]

You need to use the MatchEvaluator on Regex replace. Also it would make your life easier by breaking up the matches into named capture groups to help with the match evaluator processing. Let me explain.
What the MatchEvaluator does, is it allows one to intercede in the match process with a C# delegate and return what should be replaced when a match happens by examining the actual match captured. That way you can do your text processing as needed.
Here is a basic example where it handles the sections in a basic way, but the structure is there to add your business logic:
string text = #"This is some text [[Name: {name}]] at [[Address: {address}]].";
Regex.Replace(text,
#"(?:\[\[)(?<Section>[^\:]+)(?:\:)(?<Data>[^\]]+)(?:\]\])",
new MatchEvaluator((mtch) =>
{
if (mtch.Groups["Section"].Value == "Name")
return "Jabberwocky";
return "120 Main";
}));
The result of Regex Replace is:
This is some text Jabberwocky at 120 Main.

To the first part of you question try this:
\[\[(.*)({value})(.*)\]\]

Why does changing this regex class to .+ not provide any match?

If I use this
string showPattern = #"return new_lightox\(this\);"">[a-zA-Z0-9(\s),!\?\-:'&%]+</a>";
MatchCollection showMatches = Regex.Matches(pageSource, showPattern);
I get some matches but I want to get rid of [a-zA-Z0-9(\s),!\?\-:'&%]+and use any char .+
but if do this I get no match at all.
What am I doing wrong?

By default "." does not match newlines, but the class \s does.

To let . match newline, turn on SingleLine/DOTALL mode - either using a flag in the function call (as Abel's answer shows), or using the inline modifier (?s), like this for the whole expression:
"(?s)return new_lightox\(this\);"">.+</a>"
Or for just the specific part of it:
"return new_lightox\(this\);"">(?s:.+)</a>"
It might be better to take that a step further and do this:
"return new_lightox\(this\);"">(?s:(?!</?a).+)</a>"
Which should prevent the closing </a> from belonging to a different link.
However, you need to be very wary here - it's not clear what you're doing overall, but regex is not a good tool for parsing HTML with, and can cause all sorts of problems. Look at using a HTML DOM parser instead, such as HtmlAgilityPack.

You're matching a tag, so you probably want something along these lines, instead of .+:
string showPattern = #"return new_lightox\(this\);"">[^<]+</a>";
The reason that the match doesn't hit is possibly because you are missing the multiline/singleline flag and the closing tag is on the next line. In other words, this should work too:
// SingleLine option changes the dot (.) to match newlines too
MatchCollection showMatches = Regex.Matches(
pageSource,
showPattern,
RegexOptions.SingleLine);

Regular Expression to reject special characters other than commas

I am working in asp.net. I am using Regular Expression Validator
Could you please help me in creating a regular expression for not allowing special characters other than comma. Comma has to be allowed.
I checked in regexlib, however I could not find a match. I treid with ^(a-z|A-Z|0-9)*[^#$%^&*()']*$ . When I add other characters as invalid, it does not work.
Also could you please suggest me a place where I can find a good resource of regular expressions? regexlib seems to be big; but any other place which lists very limited but most used examples?
Also, can I create expressions using C# code? Any articles for that?

[\w\s,]+
works fine, as you can see bellow.
RegExr is a great place to test your regular expressions with real time results, it also comes with a very complete list of common expressions.
[] character class \w Matches any word character (alphanumeric & underscore). \s
Matches any whitespace character (spaces, tabs, line breaks). , include comma + is greedy match; which will match the previous 1 or more times.

[\d\w\s,]*
Just a guess

To answer on any articles, I got started here, find it to be an excellent resource:
http://www.regular-expressions.info/
For your current problem, try something like this:
[\w\s,]*
Here's a breakdown:
Match a single character present in the list below «[\w\s,]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
A word character (letters, digits, etc.) «\w»
A whitespace character (spaces, tabs, line breaks, etc.) «\s»
The character “,” «,»

For a single character that is not a comma, [^,] should work perfectly fine.

You can try [\w\s,] regular expression. This regex will match only alpha-numeric characters and comma. If any other character appears within text, then this wont match.
For your second question regarding regular expression resource, you can goto
http://www.regular-expressions.info/
This website has lot of tutorials on regex, plus it has lot of usefult information.
Also, can I create expressions using
C# code? Any articles for that?
By this, do you mean to say you want to know which class and methods for regular expression execution? Or you want tool that will create regular expression for you?

You can create expressions with C#, something like this usually does the trick:
Regex regex = new Regex(#"^[a-z | 0-9 | /,]*$", RegexOptions.IgnoreCase);
System.Console.Write("Enter Text");
String s = System.Console.ReadLine();
Match match = regex.Match(s);
if (match.Success == true)
{
System.Console.WriteLine("True");
}
else
{
System.Console.WriteLine("False");
}
System.Console.ReadLine();
You need to import the System.Text.RegularExpressions;
The regular expression above, accepts only numbers, letters (both upper and lower case) and the comma.
For a small introduction to Regular Expressions, I think that the book for MCTS 70-536 can be of a big help, I am pretty sure that you can either download it from somewhere or obtain a copy.
I am assuming that you never messed around with regular expressions in C#, hence I provided the code above.
Hope this helps.

Thank you, all..
[\w\s,]* works
Let me go through regular-expressions.info and come back if I need further support.
Let me try the C# code approach and come back if I need further support.
[This forum is awesome. Quality replies so qucik..]
Thanks again

(…) is denoting a grouping and not a character set that’s denoted with […]. So try this:
^[a-zA-Z0-9,]*$
This will only allow alphanumeric characters and the comma.

Regex for string enclosed in <*>, C#

I am trying to get all strings enclosed in <*> by using following Regex:
Regex regex = new Regex(#"\<(?<name>\S+)\>", RegexOptions.IgnoreCase);
string name = e.Match.Groups["name"].Value;
But in some cases where I have text like :
<Vendors><Vtitle/> <VSurname/></Vendors>
It's returning two strings instead of four, i.e. above Regex outputs
<Vendors><Vtitle/> //as one string and
<VSurname/></Vendors> //as second string
Where as I am expecting four strings:
<Vendors>
<Vtitle/>
<VSurname/>
</Vendors>
Could you please guide me what change I need to make to my Regex.
I tried adding '\b' to specify word boundry
new Regex(#"\b\<(?<name>\S+)\>\b", RegexOptions.IgnoreCase);
, but that didn't help.

You'll get most of what what you want by using the regex /<([^>]*)>/. (No need to escape the angle brackets' as angle brackets aren't special characters in most regex engines, including the .NET engine.) The regex I provided will also capture trailing whitespace and any attributes on the tag--parsing those things reliably is way, way beyond the scope of a reasonable regex.
However, be aware that if you're trying to parse XML/HTML with a regex, that way lies madness

Regexes are the wrong tool for parsing XML. Try using the System.Xml.Linq (XElement) API.

Your regex is using \S+ as the wildcard. In english, this is "a series of one or more characters, none of which is non-whitespace". In other words, when the regex <(?<name>\S+)> is applied to this string: '`, the regex will match the entire string. angle brackets are non-whitespace.
I think what you want is "a series of one or more characters, none of which is an angle bracket".
The regex for that is <(?<name>[^>]+)> .
Ahhh, regular expressions. The language designed to look like cartoon swearing.

Capture a single match only - Regex

i want to capture only the first match through the expression
<p>.*?</p>
i have tried <p>.*?</p>{1} but it is not working it returns all the p tags which are in the html document, please help

It looks like you are using a method which returns every match in the string given a regex, that being the case you need to anchor the regex to the beggining of the string so it doesn't return every match, but only the first one:
^.*?<p>.*?</p>
Use parentheses to capture what you want to capture.
PS: Here goes the standard 'avoid using regex to parse HTML, use a proper HTML parser' advice. This simple regex will fail for nested <p> sections (which I don't recall if are valid in HTML, but still you can probably get them even if they aren't).

The Regex.Match method does this by default, and the regular expression is correct.
Regex regex = new Regex("<p>(.*?)</p>");
Match match = regex.Match("<p>1</p><p>2</p>");
Console.WriteLine("{0}", match.Value);
Running this program will print 1.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular Expression question - c#

Related

Regular expression with backreferences

Why does changing this regex class to .+ not provide any match?

Regular Expression to reject special characters other than commas

Regex for string enclosed in <*>, C#

Capture a single match only - Regex

Categories

Resources