c# regex to match specific text - c#

I'm looking to match all text in the format foo:12345 that is not contained within an HTML anchor. For example, I'd like to match lines 1 and 3 from the following:
foo:123456
foo:123456
foo:123456
I've tried these regexes with no success:
Negative lookahead attempt ( incorrectly matches, but doesn't include the last digit )
foo:(\d+)(?!</a>)
Negative lookahead with non-capturing grouping
(?:foo:(\d+))(?!</a>)
Negative lookbehind attempt ( wildcards don't seem to be supported )
(?<!<a[^>]>)foo:(\d+)

If you want to start analysing HTML like this then you probably want to actually parse HTML instead of using regular expressions. The HTML Agility Pack is the usual first port of call. Using Regular Expressions it becomes hard to deal with things like <a></a>foo:123456<a></a> which of course should pull out the middle bit but its extremely hard to write a regex that will do that.
I should add that I am assuming that you do in fact have a block of HTML rather than just individual short strings such as your each line above. Partly I ruled it out becasue matching it if it is the only thing on the line is pretty easy so I figured you'd have got it if you wanted that. :)

Regex is usually not the best tool for the job, but if your case is very specific like in your example you could use:
foo:((?>\d+))(?!</a>)
Your first expression didn't work because \d+ would backtrack till (?!</a>) matches. This can be fixed by not allowing \d+ to backtrack, as above with help of an atomic/nonbacktracking group, or you could also make the lookahead fail in case \d+ backtracks, like:
foo:((?>\d+))(?!</a>|\d)
Altho that is not as efficient.

Note, that lookbehind will not work with differnt string length inside, you may work it out differently
for example
Find and mark all foo-s that are contained in anchor
Find and do your goal with all other
Remove marks

This is prob a long winded way of doing this but you could simply bring back all occurences of foo:some digits then exclude them afterwards..
string pattern = #"foo:\d+ |" +
#"foo:\d+[<]";
Then use matchcollection
MatchCollection m0 = Regex.Matches(file, pattern, RegexOptions.Singleline);
Then loop through each occurrence:
foreach (Match m in m0)
{
. . . exclude the matches that contain the "<"
}

I would use linq and treat the html like xml, for example:
var query = MyHtml.Descendants().ToArray();
foreach (XElement result in query)
{
if (Regex.IsMatch(result.value, #"foo:123456") && result.Name.ToString() != "a")
{
//do something...
}
}
perhaps there's a better way, but i don't know it...this seems pretty straight forward to me :P

Related

My regular expression isn't returning what I need

I have a block of text as such.
google.sbox.p50 && google.sbox.p50(["how to",[["how to tie a tie",0],["how to train your dragon 2 trailer",0],["how to do the cup song",0],["how to get a six pack in 3 minutes",0],["how to make a paper gun that shoots",0],["how to basic",0],["how to love lil wayne",0],["how to sing like your favorite artist",0],["how to be a heartbreaker marina and the diamonds",0],["how to tame a horse in minecraft",0]],{"q":"XJW--0IKH6sqOp0ME-x5B7b_5wY","j":"5","k":1}])
Using \\[([^]]+)\\] I am able to get everything I need, but with a little extra that I don't. I do not need the ["how to",[[. I only need the blocks that are formatted like,
["how to tie a tie",0]
Can someone please help me modify my expression to only get what I need? I've been at it for hours and I can't grasp the idea of RegEx.
Put both the opening and closing square brackets in the negated character class?
\\[([^][]+)\\]
\\[ matches a literal [
\\] matches a literal ]
[^][] is a negated class, which for instance matches any character except ][. It might be a little difficult to see it, but it's equivalent to [^\\]\\[]. Here the double escapes are not required because you are using a character class (just like \\. is equivalent to [.])
([^][]+) captures everything within square brackets, making sure there's no ] or [ inside.
In C#, you can use the # symbol to avoid having to double escape everytime and using this makes the regex like that:
var regex = new Regex(#"\[([^][]+)\]");
Note: This regex will capture everything within square brackets. If you wish to specificly get the format ["how to tie a tie",0], you can be more precise. After all, the regex will only match stuff you make it match:
var regex = new Regex(#"\["[^"]+",0\]");
Here, we have another negated character class: [^"]. This will match any character which is not a quote character.
This one assumes that the digit is always 0, as depicted in your sample text block. If you have multiple possibilities of numbers, you can use the character class [0-9]+:
var regex = new Regex(#"\["[^"]+",[0-9]+\]");
You can use \d+ as well, but this character class also matches other characters which may or may not render the regex worse. If you want to be more even cautious by allowing possible spaces, tabs, newlines, form feeds in between the characters, you can use this regex:
var regex = new Regex(#"\[\s*"[^"]+"\s*,\s*[0-9]+\s*\]");
Conclusion, there might be many regexes which suit what you need, just make sure you know how your data is coming through so you can pick one which has the right amount of freeway.
I think this is what you are looking for to match the format of ["how to tie a tie",0]:
(\["[^"]+",\d\])
( ) - around the whole thing so it all gets captured in this group
\[" - find ["
[^"]+ - find one or more of anything except "
", - find ",
\d - find a number, if you want more than just a single digit, do \d+
\] - match the ending ]
The only variable things in this regex are whatever is within the quotes ([^"]+) and the number (\d+).
Demo
If you don't want the square brackets in the capture group, you can do it like this:
\[("[^"]+",\d+)\]
I assume you don't want to match if there are quotes within your quotes as it would probably break whatever purpose you are using it for, but if you do, this should work:
\[("[^[\]]+",\d+)\]
You must use this pattern
#"\[[^][]+\]"
More informations about square brackets here.
I think you need this one: (\[[^\[^]+?])
What you did mis is the ? (smallest match) and exclude any [ or ]
Seemingly the text in the outer brackets is a JSON representation of an object. Instead of a regular expression I'd just:
strip off the stuff before the bracket + first bracket (google.sbox.p50 && google.sbox.p50() plus strip off the trailing bracket ). There are more ways to do this, and it can be more efficient than regex.
JSON parse the remaining inner part.
From that point you have the object representation, you can leave out the first element of the array what you don't need, plus you have everything else in a traversable form.
There's the session information at the end along with parameters anyway (in {} brackets), so in the end you may end up parsing stuff anyway. Better not to reinvent the wheel (JSON parsing).

Regex in a string

I need some help on a problem.
In fact I search to check for an image type by the hexadecimal code.
string JpgHex = "FF-D8-FF-E0-xx-xx-4A-46-49-46-00";
Then I have a condition on
string.StartsWith(pngHex).
The problem is that the "x" characters presents in my "JpgHex" string can be whatever I want.
I think I need a regex to check that but I don't know how!!
Thanks a lot!
I'm not quite clear what exactly you want to do, but the dot '.' character represents any character in Regex.
So the regex "^FF-D8-FF-E0-..-..-4A-46-49-46-00" will probably do the trick. '^' = Start of input.
If you want to allow only hex chars you can use "^FF-D8-FF-E0-[0-9A-F]{2}-[0-9A-F]{2}-4A-46-49-46-00".
Like I said, I'd need a better idea of what pattern you need to match.
Here are some examples:
Regex rgx =
new Regex(#"^FF-D8-FF-E0-[a-zA-Z0-9]{2}-[a-zA-Z0-9]{2}-4A-46-49-46-00$");
rgx.IsMatch(pngHex); // is match will return a bool.
I use [a-zA-Z0-9]{2} to denote two instances of a character, caps or small or a number. So the above regex would match :
FF-D8-FF-E0-aa-zZ-4A-46-49-46-00
FF-D8-FF-E0-11-22-4A-46-49-46-00
.. etc
Based on your need change the regex accordingly so for capitals and numbers only you change to [A-Z0-9]. The {2} denotes two occurrences.
The ^ denotes the string should start with FF and $ means the string should end with 00.
Lets say you wanted to only match two numbers, so you would use \d{2}, the whole thing would look like this:
Regex rgx = new Regex(#"^FF-D8-FF-E0-\d{2}-\d{2}-4A-46-49-46-00$");
rgx.IsMatch(pngHex);
How do I know of these magical characters? Simple, there are docs everywhere. See this MSDN page for some basic regex patterns. This page shows some quantifiers, those are things like match one or more or match only one.
Cheat-sheets also come in handy.
A regex would help you; you can use the following tool to help you test and learn: -
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
I recommend you have a play because then you'll learn!
To simply match any character in place of the x, the following should work: -
"^FF-D8-FF-E0-..-..-4A-46-49-46-00$"
In C#, it would be something like this: -
var test = "FF-D8-FF-E0-AB-CD-4A-46-49-46-00";
var foo = new Regex("^FF-D8-FF-E0-..-..-4A-46-49-46-00$");
if (foo.IsMatch(test))
{
// Do magic
}
You will need to read up on regular expressions to understand some of the characters that may not look familiar, i.e. ^ and $. See http://www.regular-expressions.info/

Why does changing this regex class to .+ not provide any match?

If I use this
string showPattern = #"return new_lightox\(this\);"">[a-zA-Z0-9(\s),!\?\-:'&%]+</a>";
MatchCollection showMatches = Regex.Matches(pageSource, showPattern);
I get some matches but I want to get rid of [a-zA-Z0-9(\s),!\?\-:'&%]+and use any char .+
but if do this I get no match at all.
What am I doing wrong?
By default "." does not match newlines, but the class \s does.
To let . match newline, turn on SingleLine/DOTALL mode - either using a flag in the function call (as Abel's answer shows), or using the inline modifier (?s), like this for the whole expression:
"(?s)return new_lightox\(this\);"">.+</a>"
Or for just the specific part of it:
"return new_lightox\(this\);"">(?s:.+)</a>"
It might be better to take that a step further and do this:
"return new_lightox\(this\);"">(?s:(?!</?a).+)</a>"
Which should prevent the closing </a> from belonging to a different link.
However, you need to be very wary here - it's not clear what you're doing overall, but regex is not a good tool for parsing HTML with, and can cause all sorts of problems. Look at using a HTML DOM parser instead, such as HtmlAgilityPack.
You're matching a tag, so you probably want something along these lines, instead of .+:
string showPattern = #"return new_lightox\(this\);"">[^<]+</a>";
The reason that the match doesn't hit is possibly because you are missing the multiline/singleline flag and the closing tag is on the next line. In other words, this should work too:
// SingleLine option changes the dot (.) to match newlines too
MatchCollection showMatches = Regex.Matches(
pageSource,
showPattern,
RegexOptions.SingleLine);

Regular Expression to reject special characters other than commas

I am working in asp.net. I am using Regular Expression Validator
Could you please help me in creating a regular expression for not allowing special characters other than comma. Comma has to be allowed.
I checked in regexlib, however I could not find a match. I treid with ^(a-z|A-Z|0-9)*[^#$%^&*()']*$ . When I add other characters as invalid, it does not work.
Also could you please suggest me a place where I can find a good resource of regular expressions? regexlib seems to be big; but any other place which lists very limited but most used examples?
Also, can I create expressions using C# code? Any articles for that?
[\w\s,]+
works fine, as you can see bellow.
RegExr is a great place to test your regular expressions with real time results, it also comes with a very complete list of common expressions.
[] character class \w Matches any word character (alphanumeric & underscore). \s
Matches any whitespace character (spaces, tabs, line breaks). , include comma + is greedy match; which will match the previous 1 or more times.
[\d\w\s,]*
Just a guess
To answer on any articles, I got started here, find it to be an excellent resource:
http://www.regular-expressions.info/
For your current problem, try something like this:
[\w\s,]*
Here's a breakdown:
Match a single character present in the list below «[\w\s,]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
A word character (letters, digits, etc.) «\w»
A whitespace character (spaces, tabs, line breaks, etc.) «\s»
The character “,” «,»
For a single character that is not a comma, [^,] should work perfectly fine.
You can try [\w\s,] regular expression. This regex will match only alpha-numeric characters and comma. If any other character appears within text, then this wont match.
For your second question regarding regular expression resource, you can goto
http://www.regular-expressions.info/
This website has lot of tutorials on regex, plus it has lot of usefult information.
Also, can I create expressions using
C# code? Any articles for that?
By this, do you mean to say you want to know which class and methods for regular expression execution? Or you want tool that will create regular expression for you?
You can create expressions with C#, something like this usually does the trick:
Regex regex = new Regex(#"^[a-z | 0-9 | /,]*$", RegexOptions.IgnoreCase);
System.Console.Write("Enter Text");
String s = System.Console.ReadLine();
Match match = regex.Match(s);
if (match.Success == true)
{
System.Console.WriteLine("True");
}
else
{
System.Console.WriteLine("False");
}
System.Console.ReadLine();
You need to import the System.Text.RegularExpressions;
The regular expression above, accepts only numbers, letters (both upper and lower case) and the comma.
For a small introduction to Regular Expressions, I think that the book for MCTS 70-536 can be of a big help, I am pretty sure that you can either download it from somewhere or obtain a copy.
I am assuming that you never messed around with regular expressions in C#, hence I provided the code above.
Hope this helps.
Thank you, all..
[\w\s,]* works
Let me go through regular-expressions.info and come back if I need further support.
Let me try the C# code approach and come back if I need further support.
[This forum is awesome. Quality replies so qucik..]
Thanks again
(…) is denoting a grouping and not a character set that’s denoted with […]. So try this:
^[a-zA-Z0-9,]*$
This will only allow alphanumeric characters and the comma.

Regex with < and >

ok i have a file that may or may not be newlined or carriage fed. frankly i need to ignore that. I need to search the document find all the < and matching > tags and remove everything inside them. I've been trying to get this to work for a bit my current regex is:
private Regex BracketBlockRegex = new Regex("<.*>", RegexOptions.Singleline);
....
resultstring = BracketBlockRegex.Replace(filecontents, "");
but this doesn't seem to be working because it catches WAY to much. any clues? is there something wierd with the < and > symbols in c#?
Replace
<.*>
with
<.*?>
Try a non-greedy variant of your regex:
<[^>]*>
What you have, <.*>, will match the first < followed by everything up to the last >, whereas what you want is to match to the first one.
Regular expressions are greedy and you've got a period which equates to ANYTHING which just so happens to include the greater than and less than characters.
Try this...
<[^<>]*>
Arguably the best Regular Expression resource on the Internet.
Try:
private Regex BracketBlockRegex = new Regex("<.*?>", RegexOptions.Singleline);
Note you may need to add some parsing qualifiers about how to interrupt the source data.
An HTML tag can be split up at white space onto different lines.
<IMGSRC="blah.jpg"ALT="blah">
Some regular expression parsers may, or may not, match . to \r or \n depending on settings.

Categories