ok i have a file that may or may not be newlined or carriage fed. frankly i need to ignore that. I need to search the document find all the < and matching > tags and remove everything inside them. I've been trying to get this to work for a bit my current regex is:
private Regex BracketBlockRegex = new Regex("<.*>", RegexOptions.Singleline);
....
resultstring = BracketBlockRegex.Replace(filecontents, "");
but this doesn't seem to be working because it catches WAY to much. any clues? is there something wierd with the < and > symbols in c#?
Replace
<.*>
with
<.*?>
Try a non-greedy variant of your regex:
<[^>]*>
What you have, <.*>, will match the first < followed by everything up to the last >, whereas what you want is to match to the first one.
Regular expressions are greedy and you've got a period which equates to ANYTHING which just so happens to include the greater than and less than characters.
Try this...
<[^<>]*>
Arguably the best Regular Expression resource on the Internet.
Try:
private Regex BracketBlockRegex = new Regex("<.*?>", RegexOptions.Singleline);
Note you may need to add some parsing qualifiers about how to interrupt the source data.
An HTML tag can be split up at white space onto different lines.
<IMGSRC="blah.jpg"ALT="blah">
Some regular expression parsers may, or may not, match . to \r or \n depending on settings.
Related
I'm trying to crawl a web page and get all interesting elements with a regex including the following term:
<font\s+face=""Arial"">(?<value>.+)</font>
I don't understand very well why there is an "?" before my "< value >", if someone could explain me (this syntax works).
for each matching expression, I get my value like that:
var value = m.Groups["value"].Value;
My only problem is when my < value > includes a CRLF this is not matching even if I specify "RegexOptions.Multiline" in C#.
Thank's for your answers.
The parenthesis are the matching part of the regex, (?<name>pattern) assigns a name to the matching parenthesis, that is why you can refer to the match with ...Groups["value"]... instead of the number of the match, as is otherwise usual with regexps
Use RegexOptions.SingleLine to solve your problem; (DOTALL in other regexp flavours).
To clarify: RegexOption.MultiLine changes the meaning of ^ and $, RegexOptions.SingleLine the meaning of .; I found a full list here: http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions.aspx
I solved my problem using this syntax:
(?<value>.+(\n.*)?)
but now I don't understand an other thing. Why when I have this string:
style='font-family:Arial; font-size:10pt; mso-bidi-font-size:10.0pt;mso-bidi-font-family:"Times New Roman"'>Milord</span></b></p>
The term "Milord" is not matching in < value > with this pattern:
style='font\-family\:Arial;\s+font\-size\:10pt;\s+mso\-bidi\-font\-size\:10\.0pt;mso\-bidi\-font-family\:\n?"Times\s+New\s+Roman"'>(<font\s+face="Arial">?)(?<value>.+(\n.*)?)(</font>?)</span></b></p>
while I have specified these strings as optional
(<font\s+face="Arial">?)
(</font>?)
I really don't understand, I tried so many syntax with different places for the "?" and nothing is my expected result!
Dialects of Regex differ, but for your newline issue, look for a Regex flags called either MULTILINE and/or DOTALL.
If the only problem is with line breaks, one of those should fix it.
I can't answer the angle brackets part, I think it's specific to your dialect of Regex as well (in C#)
I'm looking to match all text in the format foo:12345 that is not contained within an HTML anchor. For example, I'd like to match lines 1 and 3 from the following:
foo:123456
foo:123456
foo:123456
I've tried these regexes with no success:
Negative lookahead attempt ( incorrectly matches, but doesn't include the last digit )
foo:(\d+)(?!</a>)
Negative lookahead with non-capturing grouping
(?:foo:(\d+))(?!</a>)
Negative lookbehind attempt ( wildcards don't seem to be supported )
(?<!<a[^>]>)foo:(\d+)
If you want to start analysing HTML like this then you probably want to actually parse HTML instead of using regular expressions. The HTML Agility Pack is the usual first port of call. Using Regular Expressions it becomes hard to deal with things like <a></a>foo:123456<a></a> which of course should pull out the middle bit but its extremely hard to write a regex that will do that.
I should add that I am assuming that you do in fact have a block of HTML rather than just individual short strings such as your each line above. Partly I ruled it out becasue matching it if it is the only thing on the line is pretty easy so I figured you'd have got it if you wanted that. :)
Regex is usually not the best tool for the job, but if your case is very specific like in your example you could use:
foo:((?>\d+))(?!</a>)
Your first expression didn't work because \d+ would backtrack till (?!</a>) matches. This can be fixed by not allowing \d+ to backtrack, as above with help of an atomic/nonbacktracking group, or you could also make the lookahead fail in case \d+ backtracks, like:
foo:((?>\d+))(?!</a>|\d)
Altho that is not as efficient.
Note, that lookbehind will not work with differnt string length inside, you may work it out differently
for example
Find and mark all foo-s that are contained in anchor
Find and do your goal with all other
Remove marks
This is prob a long winded way of doing this but you could simply bring back all occurences of foo:some digits then exclude them afterwards..
string pattern = #"foo:\d+ |" +
#"foo:\d+[<]";
Then use matchcollection
MatchCollection m0 = Regex.Matches(file, pattern, RegexOptions.Singleline);
Then loop through each occurrence:
foreach (Match m in m0)
{
. . . exclude the matches that contain the "<"
}
I would use linq and treat the html like xml, for example:
var query = MyHtml.Descendants().ToArray();
foreach (XElement result in query)
{
if (Regex.IsMatch(result.value, #"foo:123456") && result.Name.ToString() != "a")
{
//do something...
}
}
perhaps there's a better way, but i don't know it...this seems pretty straight forward to me :P
I am working in asp.net. I am using Regular Expression Validator
Could you please help me in creating a regular expression for not allowing special characters other than comma. Comma has to be allowed.
I checked in regexlib, however I could not find a match. I treid with ^(a-z|A-Z|0-9)*[^#$%^&*()']*$ . When I add other characters as invalid, it does not work.
Also could you please suggest me a place where I can find a good resource of regular expressions? regexlib seems to be big; but any other place which lists very limited but most used examples?
Also, can I create expressions using C# code? Any articles for that?
[\w\s,]+
works fine, as you can see bellow.
RegExr is a great place to test your regular expressions with real time results, it also comes with a very complete list of common expressions.
[] character class \w Matches any word character (alphanumeric & underscore). \s
Matches any whitespace character (spaces, tabs, line breaks). , include comma + is greedy match; which will match the previous 1 or more times.
[\d\w\s,]*
Just a guess
To answer on any articles, I got started here, find it to be an excellent resource:
http://www.regular-expressions.info/
For your current problem, try something like this:
[\w\s,]*
Here's a breakdown:
Match a single character present in the list below «[\w\s,]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
A word character (letters, digits, etc.) «\w»
A whitespace character (spaces, tabs, line breaks, etc.) «\s»
The character “,” «,»
For a single character that is not a comma, [^,] should work perfectly fine.
You can try [\w\s,] regular expression. This regex will match only alpha-numeric characters and comma. If any other character appears within text, then this wont match.
For your second question regarding regular expression resource, you can goto
http://www.regular-expressions.info/
This website has lot of tutorials on regex, plus it has lot of usefult information.
Also, can I create expressions using
C# code? Any articles for that?
By this, do you mean to say you want to know which class and methods for regular expression execution? Or you want tool that will create regular expression for you?
You can create expressions with C#, something like this usually does the trick:
Regex regex = new Regex(#"^[a-z | 0-9 | /,]*$", RegexOptions.IgnoreCase);
System.Console.Write("Enter Text");
String s = System.Console.ReadLine();
Match match = regex.Match(s);
if (match.Success == true)
{
System.Console.WriteLine("True");
}
else
{
System.Console.WriteLine("False");
}
System.Console.ReadLine();
You need to import the System.Text.RegularExpressions;
The regular expression above, accepts only numbers, letters (both upper and lower case) and the comma.
For a small introduction to Regular Expressions, I think that the book for MCTS 70-536 can be of a big help, I am pretty sure that you can either download it from somewhere or obtain a copy.
I am assuming that you never messed around with regular expressions in C#, hence I provided the code above.
Hope this helps.
Thank you, all..
[\w\s,]* works
Let me go through regular-expressions.info and come back if I need further support.
Let me try the C# code approach and come back if I need further support.
[This forum is awesome. Quality replies so qucik..]
Thanks again
(…) is denoting a grouping and not a character set that’s denoted with […]. So try this:
^[a-zA-Z0-9,]*$
This will only allow alphanumeric characters and the comma.
I have a string like:
string str = "https://abce/MyTest";
I want to check if the particular string starts with https:// and ends with /MyTest.
How can I acheive that?
This regular expression:
^https://.*/MyTest$
will do what you ask.
^ matches the beginning of the string.
https:// will match exactly that.
.* will match any number of characters (the * part) of any kind (the . part). If you want to make sure there is at least one character in the middle, use .+ instead.
/MyTest matches exactly that.
$ matches the end of the string.
To verify the match, use:
Regex.IsMatch(str, #"^https://.*/MyTest$");
More info at the MSDN Regex page.
Try the following:
var str = "https://abce/MyTest";
var match = Regex.IsMatch(str, "^https://.+/MyTest$");
The ^ identifier matches the start of the string, while the $ identifier matches the end of the string. The .+ bit simply means any sequence of chars (except a null sequence).
You need to import the System.Text.RegularExpressions namespace for this, of course.
I want to check if the particular string starts with "https://" and ends with "/MyTest".
Well, you could use regex for that. But it's clearer (and probably quicker) to just say what you mean:
str.StartsWith("https://") && str.EndsWith("/MyTest")
You then don't have to worry about whether any of the characters in your match strings need escaping in regex. (For this example, they don't.)
In .NET:
^https://.*/MyTest$
Try Expresso, good for building .NET regexes and teaching you the syntax at the same time.
HAndy tool for genrating regular expressions
http://txt2re.com/
Ok sorry this might seem like a dumb question but I cannot figure this thing out :
I am trying to parse a string and simply want to check whether it only contains the following characters : '0123456789dD+ '
I have tried many things but just can't get to figure out the right regex to use!
Regex oReg = new Regex(#"[\d dD+]+");
oReg.IsMatch("e4");
will return true even though e is not allowed...
I've tried many strings, including Regex("[1234567890 dD+]+")...
It always works on Regex Pal but not in C#...
Please advise and again i apologize this seems like a very silly question
Try this:
#"^[0-9dD+ ]+$"
The ^ and $ at the beginning and end signify the beginning and end of the input string respectively. Thus between the beginning and then end only the stated characters are allowed. In your example, the regex matches if the string contains one of the characters even if it contains other characters as well.
#comments: Thanks, I fixed the missing + and space.
Oops, you forgot the boundaries, try:
Regex oReg = new Regex(#"^[0-9dD +]+$");
oReg.IsMatch("e4");
^ matches the begining of the text stream, $ matches the end.
It is matching the 4; you need ^ and $ to terminate the regex if you want a full match for the entire string - i.e.
Regex re = new Regex(#"^[\d dD+]+$");
Console.WriteLine(re.IsMatch("e4"));
Console.WriteLine(re.IsMatch("4"));
This is because regular expressions can also match parts of the input, in this case it just matches the "4" of "e4". If you want to match a whole line, you have to surround the regex with "^" (matches line start) and "$" (matches line end).
So to make your example work, you have to write is as follows:
Regex oReg = new Regex(#"^[\d dD+]+$");
oReg.IsMatch("e4");
I believe it's returning True because it's finding the 4. Nothing in the regex excludes the letter e from the results.
Another option is to invert everything, so it matches on characters you don't want to allow:
Regex oReg = new Regex(#"[^0-9dD+]");
!oReg.IsMatch("e4");