regex matching <value> with a CRLF - c#

I'm trying to crawl a web page and get all interesting elements with a regex including the following term:
<font\s+face=""Arial"">(?<value>.+)</font>
I don't understand very well why there is an "?" before my "< value >", if someone could explain me (this syntax works).
for each matching expression, I get my value like that:
var value = m.Groups["value"].Value;
My only problem is when my < value > includes a CRLF this is not matching even if I specify "RegexOptions.Multiline" in C#.
Thank's for your answers.

The parenthesis are the matching part of the regex, (?<name>pattern) assigns a name to the matching parenthesis, that is why you can refer to the match with ...Groups["value"]... instead of the number of the match, as is otherwise usual with regexps
Use RegexOptions.SingleLine to solve your problem; (DOTALL in other regexp flavours).
To clarify: RegexOption.MultiLine changes the meaning of ^ and $, RegexOptions.SingleLine the meaning of .; I found a full list here: http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions.aspx

I solved my problem using this syntax:
(?<value>.+(\n.*)?)
but now I don't understand an other thing. Why when I have this string:
style='font-family:Arial; font-size:10pt; mso-bidi-font-size:10.0pt;mso-bidi-font-family:"Times New Roman"'>Milord</span></b></p>
The term "Milord" is not matching in < value > with this pattern:
style='font\-family\:Arial;\s+font\-size\:10pt;\s+mso\-bidi\-font\-size\:10\.0pt;mso\-bidi\-font-family\:\n?"Times\s+New\s+Roman"'>(<font\s+face="Arial">?)(?<value>.+(\n.*)?)(</font>?)</span></b></p>
while I have specified these strings as optional
(<font\s+face="Arial">?)
(</font>?)
I really don't understand, I tried so many syntax with different places for the "?" and nothing is my expected result!

Dialects of Regex differ, but for your newline issue, look for a Regex flags called either MULTILINE and/or DOTALL.
If the only problem is with line breaks, one of those should fix it.
I can't answer the angle brackets part, I think it's specific to your dialect of Regex as well (in C#)

Related

Use OR in Regex Expression

I have a regex to match the following:
somedomain.com/services/something
Basically I need to ensure that /services is present.
The regex I am using and which is working is:
\/services*
But I need to match /services OR /servicos. I tried the following:
(\/services|\/servicos)*
But this shows 24 matches?! https://regex101.com/r/jvB1lr/1
How to create this regex?
The (\/services|\/servicos)* matches 0+ occurrences of /services or /servicos, and that means it can match an empty string anywhere inside the input string.
You can group the alternatives like /(services|servicos) and remove the * quantifier, but for this case, it is much better to use a character class [oe] as the strings only differ in 1 char.
You want to use the following pattern:
/servic[eo]s
See the regex demo
To make sure you match a whole subpart, you may append (?:/|$) at the pattern end, /servic[eo]s(?:/|$).
In C#, you may use Regex.IsMatch with the pattern to see if there is a match in a string:
var isFound = Regex.IsMatch(s, #"/servic[eo]s(?:/|$)");
Note that you do not need to escape / in a .NET regex as it is not a special regex metacharacter.
Pattern details
/ - a /
servic[eo]s - services or servicos
(?:/|$) - / or end of string.
Well the * quantifier means zero or more, so that is the problem. Remove that and it should work fine:
(\/services|\/servicos)
Keep in mind that in your example, you have a typo in the URL so it will correctly not match anything as it stands.
Here is an example with the typo in the URL fixed, so it shows 1 match as expected.
First off you specify C# (really .Net is the library which holds regex not the language) in this post but regex101 in your example is set to PHP. That is providing you with invalid information such as needed to escape a forward slash / with \/ which is unnecessary in .Net regular expressions. The regex language is the same but there are different tools which behave differently and php is not like .Net regex.
Secondly the star * on the ( ) is saying that there may be nothing in the parenthesis and your match is getting null nothing matches on every word.
Thirdly one does not need to split the whole word. I would just extract the commonality in the words into a set [ ]. That will allow the "or-ness" you need to match on either services or servicos. Such as
(/servic[oe]s)
Will inform you if services are found or not. Nothing else is needed.

Extract string from a pattern preceded by any length

I'm looking for a regular expression to extract a string from a file name
eg if filename format is "anythingatallanylength_123_TESTNAME.docx", I'm interested in extracting "TESTNAME" ... probably fixed length of 8. (btw, 123 can be any three digit number)
I think I can use regex match ...
".*_[0-9][0-9][0-9]_[A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z].docx$"
However this matches the whole thing. How can I just get "TESTNAME"?
Thanks
Use parenthesis to match a specific piece of the whole regex.
You can also use the curly braces to specify counts of matching characters, and \d for [0-9].
In C#:
var myRegex = new Regex(#"*._\d{3}_([A-Za-z]{8})\.docx$");
Now "TESTNAME" or whatever your 8 letter piece is will be found in the captures collection of your regex after using it.
Also note, there will be a performance overhead for look-ahead and look-behind, as presented in some other solutions.
You can use a look-behind and a look-ahead to check parts without matching them:
(?<=_[0-9]{3}_)[A-Z]{8}(?=\.docx$)
Note that this is case-sensitive, you may want to use other character classes and/or quantifiers to fit your exact pattern.
In your file name format "anythingatallanylength_123_TESTNAME.docx", the pattern you are trying to match is a string before .docx and the underscore _. Keeping the thing in mind that any _ before doesn't get matched I came up with following solution.
Regex: (?<=_)[A-Za-z]*(?=\.docx$)
Flags used:
g global search
m multi-line search.
Explanation:
(?<=_) checks if there is an underscore before the file name.
(?=\.docx$) checks for extension at the end.
[A-Za-z]* checks the required match.
Regex101 Demo
Thanks to #Lucero #noob #JamesFaix I came up with ...
#"(?<=.*[0-9]{3})[A-Z]{8}(?=.docx$)"
So a look behind (in brackets, starting with ?<=) for anything (ie zero or more any char (denoted by "." ) followed by an underscore, followed by thee numerics, followed by underscore. Thats the end of the look behind. Now to match what I need (eight letters). Finally, the look ahead (in brackets, starting with ?=), which is the .docx
Nice work, fellas. Thunderbirds are go.

Regex for git's repository

I want to use regex to validate git repository url. I found a few answers on stackoverflow but none of them passes my tests.
The debug is here: http://regexr.com/39qia
How can I make it passes the last four cases?
git#git.host.hy:group-name/project-name.git
git#git.ho-st.hy:group-name/project-name.git
http://host.xy/agroup-name/project-name.git
http://ho-st.xy/agroup-name/project-name.git
I can't be certain since I'm not familiar with git link syntaxes, but the following regex will additionally match the 4 next values:
((git|ssh|http(s)?)|(git#[\w.-]+))(:(//)?)([\w.#\:/~-]+)(\.git)(/)?
^ ^^ ^
I have indicated the changed parts; namely:
Added - to the part after # because ho-st was not passing otherwise.
Moved - to the end of the character class because otherwise /-~ would mean the character range / to ~ which matches a lot of characters.
Escaped the final dot (thanks #MatiCicero)
There are a lot of things that could be simplified from the above, but since I don't know your exact goals, I'm leaving the regex as close as possible to the one you have.
You can try this one:
(?'protocol'git#|https?:\/\/)(?'domain'[a-zA-Z0-9\.\-_]+)(\/|:)(?'group'[a-zA-Z0-9\-]+)\/(?'project'[a-zA-Z0-9\-]+)\.git
You can then extract the needed information from the matched groups.
You can test this regex on: Regex101
Ok, the following expression matches all of your current test-text and does not match any of your false positives provided before:
((((git|user)#[\w.-]+)|(git|ssh|http(s)?|file))(:(\/){0,3}))?([\w.#\:/~\-]+)(\.git)(\/)?
See also, regex.
Caveat: Be aware, that currently input is matched with '~' and '-' appearing in places where they shouldn't.

Regular Expression to reject special characters other than commas

I am working in asp.net. I am using Regular Expression Validator
Could you please help me in creating a regular expression for not allowing special characters other than comma. Comma has to be allowed.
I checked in regexlib, however I could not find a match. I treid with ^(a-z|A-Z|0-9)*[^#$%^&*()']*$ . When I add other characters as invalid, it does not work.
Also could you please suggest me a place where I can find a good resource of regular expressions? regexlib seems to be big; but any other place which lists very limited but most used examples?
Also, can I create expressions using C# code? Any articles for that?
[\w\s,]+
works fine, as you can see bellow.
RegExr is a great place to test your regular expressions with real time results, it also comes with a very complete list of common expressions.
[] character class \w Matches any word character (alphanumeric & underscore). \s
Matches any whitespace character (spaces, tabs, line breaks). , include comma + is greedy match; which will match the previous 1 or more times.
[\d\w\s,]*
Just a guess
To answer on any articles, I got started here, find it to be an excellent resource:
http://www.regular-expressions.info/
For your current problem, try something like this:
[\w\s,]*
Here's a breakdown:
Match a single character present in the list below «[\w\s,]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
A word character (letters, digits, etc.) «\w»
A whitespace character (spaces, tabs, line breaks, etc.) «\s»
The character “,” «,»
For a single character that is not a comma, [^,] should work perfectly fine.
You can try [\w\s,] regular expression. This regex will match only alpha-numeric characters and comma. If any other character appears within text, then this wont match.
For your second question regarding regular expression resource, you can goto
http://www.regular-expressions.info/
This website has lot of tutorials on regex, plus it has lot of usefult information.
Also, can I create expressions using
C# code? Any articles for that?
By this, do you mean to say you want to know which class and methods for regular expression execution? Or you want tool that will create regular expression for you?
You can create expressions with C#, something like this usually does the trick:
Regex regex = new Regex(#"^[a-z | 0-9 | /,]*$", RegexOptions.IgnoreCase);
System.Console.Write("Enter Text");
String s = System.Console.ReadLine();
Match match = regex.Match(s);
if (match.Success == true)
{
System.Console.WriteLine("True");
}
else
{
System.Console.WriteLine("False");
}
System.Console.ReadLine();
You need to import the System.Text.RegularExpressions;
The regular expression above, accepts only numbers, letters (both upper and lower case) and the comma.
For a small introduction to Regular Expressions, I think that the book for MCTS 70-536 can be of a big help, I am pretty sure that you can either download it from somewhere or obtain a copy.
I am assuming that you never messed around with regular expressions in C#, hence I provided the code above.
Hope this helps.
Thank you, all..
[\w\s,]* works
Let me go through regular-expressions.info and come back if I need further support.
Let me try the C# code approach and come back if I need further support.
[This forum is awesome. Quality replies so qucik..]
Thanks again
(…) is denoting a grouping and not a character set that’s denoted with […]. So try this:
^[a-zA-Z0-9,]*$
This will only allow alphanumeric characters and the comma.

Regex with < and >

ok i have a file that may or may not be newlined or carriage fed. frankly i need to ignore that. I need to search the document find all the < and matching > tags and remove everything inside them. I've been trying to get this to work for a bit my current regex is:
private Regex BracketBlockRegex = new Regex("<.*>", RegexOptions.Singleline);
....
resultstring = BracketBlockRegex.Replace(filecontents, "");
but this doesn't seem to be working because it catches WAY to much. any clues? is there something wierd with the < and > symbols in c#?
Replace
<.*>
with
<.*?>
Try a non-greedy variant of your regex:
<[^>]*>
What you have, <.*>, will match the first < followed by everything up to the last >, whereas what you want is to match to the first one.
Regular expressions are greedy and you've got a period which equates to ANYTHING which just so happens to include the greater than and less than characters.
Try this...
<[^<>]*>
Arguably the best Regular Expression resource on the Internet.
Try:
private Regex BracketBlockRegex = new Regex("<.*?>", RegexOptions.Singleline);
Note you may need to add some parsing qualifiers about how to interrupt the source data.
An HTML tag can be split up at white space onto different lines.
<IMGSRC="blah.jpg"ALT="blah">
Some regular expression parsers may, or may not, match . to \r or \n depending on settings.

Categories