Regular expression not capturing matches in the middle of a string

Regular expression not capturing matches in the middle of a string - c#

The regular expression I'm starting with is:
^(((http|ftp|https|www)://)?([\w+?.\w+])+([a-zA-Z0-9\~!\##\$\%\^\&*()_-\=+\/\?.\:\;\'\,]*)?)$
I'm using this to find URLs in the middle of user-supplied text and replace it with a hyperlink. This works fine and matches the following:
http://www.google.com
www.google.com
google.com
www.google.com?id=5
etc...
However, it doesn't find a match if there is any text on either side of it (kind of defeats the purpose of what I'm doing). :)
No match:
Go to www.google.com
www.google.com is the best.
I go to www.google.com all the time.
etc...
How can I change this so that it will match no matter where in the string it appears? I'm terrible with regular expressions...

You have a bug in your original regex. The square brackets make \w+?\.\w+ a character class:
(((http|ftp|https|www)://)?([\w+?\.\w+])+([a-zA-Z0-9\~\!\#\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?)
^ ^
After removing them (and the anchors ^ and $), your regex will not match obvious non-URLs.
I suggest using http://regexpal.com/ for testing regexes, as it has syntax highlighting within the regex.

i think you should use a positive look ahead, that is going to search for a given url to first of all check two possibilities, either is at the beginning or at the middile of the whole string.
but you should you use something like ^((?=url)?|.?(?=url).*?$))
that is just the beginning , i am not giving you an answer, just an idea.
i would do it, but at the moment i am lazy and your regex looks for a 20 minutes analisis.
stackoverflow erase some things of my example

Related

Regular expression to match specific filename pattern containing underscores

I'm trying to create a regular expression that would match files of this pattern:
Id_Name_processID_timestamp_logName.txt
Example of filename: abcd_Service_11234_15112013_Log.txt
I don't need perfect matching something that would match anything_anything_anything_anything_anything.txt would work for me.
I haven't tried anything just lost time starring at this Regex Tutorial for quite a long time, i don t know where to start :(.

Go to this site: http://regexpal.com/
Put abcd_Service_11234_15112013_Log.txt in the lower box.
Start writing your rexex on the top box, until it matches (it's a simple one, really, chars, underscore, rinse and repeat) ... You'll be ok ...

My regex, a short simple one.
^\w+_\w+.txt
Edit:
I do agree with the 1st answer: You really need to try something on your own but that website must be the least userfriendly page on regex. You get my answer out of sympathy ;)

Converting wildcard pattern to regular expression

I am new to regular expressions. Recently I was presented with a task to convert a wildcard pattern to regular expression. This will be used to check if a file path matches the regex.
For example if my pattern is *.jpg;*.png;*.bmp
I was able to generate the regex by spliting on semicolons, escaping the string and replaceing the escaped * with .*
String regex = "((?i)" + Regex.Escape(extension).Replace("\\*", ".*") + "$)";
So my resulting regex will be for jpg ((?i).*\.jpg)$)
Thien I combine all my extensions using the OR operator.
Thus my final expression for this example will be:
((?i).*\.jpg)$)|((?i).*\.png)$)|((?i).*\.bmp)$)
I have tested it and it worked yet I am not sure if I should add or remove any expression to cover other cases or is there a better format the whole thing
Also bear in mind that I can encounter a wildcard like *myfile.jpg where it should match all files whose names end with myfile.jpg
I can encounter patterns like *myfile.jpg;*.png;*.bmp

There's a lot of grouping going on there which isn't really needed... well unless there's something you haven't mentioned this regex would do the same for less:
/.*\.(jpg|png|bmp)$/i
That's in regex notation, in C# that would be:
String regex=new RegEx(#".*\.(jpg|png|bmp)$",RegexOptions.IgnoreCase);
If you have to programatically translate between the two, you've started on the right track - split by semicolon, group your extensions into the set (without the preceding dot). If your wildcard patterns can be more complicated (extensions with wildcards, multi-wildcard starting matches) it might need a bit more work ;)
Edit: (For your update)
If the wild cards can be more complicated, then you're almost there. There's an optimization in my above code that pulls the dot out (for extension) which has to be put back in so you'd end up with:
/.*(myfile\.jpg|\.png|\.bmp)$/i
Basically '*' -> '.*', '.' -> '\.'(gets escaped), rest goes into the set. Basically it says match anything ending (the dollar sign anchors to the end) in myfile.jpg, .png or .bmp.

How to use non capture groups with the "or" character in regular expressions

So basically I have this giant regular expression pattern, and somewhere in the middle of it is the expression (?:\s(\d\d\d)|(\d\d\d\d)). At this part of the parse I'm wanting to capture either 3 digits that follows a space or 4 digits, but I don't want the capture that comes from using the parenthesis around the whole thing (doesn't ?: make something non-capture). I have to use parenthesis so that the "or" logic works (I think).
So potential example inputs would be something like...
input1= giantexpression 123more characters after
input2= giantexpression1234blahblahblah
I tried (?:\s(\d\d\d)|(\d\d\d\d)) and it gave an extra capture at least in the case where I have 4 digits. So am I doing this right or am I messed up somewhere?
Edit:
To go into more detail... here's the current regular expression I'm working with.
pattern = #".?(\d{1,2})\s*(\w{2}).?.?.?(?:\s(\d\d\d)|(\d\d\d\d)).*"
There's a bit of parsing I have to do at the beginning. I think Sean Johnson's answer would still work because I wouldn't need to use "or". But is there a way to do it in which you DO use "or"? I think eventually I'll need that capability.

This should work:
(?:\s(\d{3,4}))
If you aren't doing any logic on that subpattern, you don't even need the parenthesis surrounding it if all you want to do is capture the digits. The following pattern:
\s(\d{3,4})
will capture three or four digits directly following a space character.

Designing Regular Expressions in C#

I've run into a bit of an issue designing RegEx in C#. I have to parse a text document that has multiple urls embedded in it, and I have to extract those
...url=http://www.cnn.com?id=abc,def&system=2&mode=2&quality=ade,url=http://www.bbc.com...
(^ I've added ellipsis to show that its part of content, ... won't actually be in the text)
The begining part is easy as I can start regex with 'url=', however, I can't come up with a way of ending the match
RegEx = (?<IgnoreFirst>[,]url=)(?<Url>[^,]+)
This regex stop at first comma - so just after 'abc' and doesn't return the entire url
RegEx = (?<IgnoreFirst>[,]url=)(?<Url>[^,]+)(?<IgnoreSecond>url)
This doesn't work either because the match stops at first comma and then looks for 'url', which it couldn't find. From some of the reading I've done it seems like its an issue of backtracking etc, so if anyone can help me out with the correct regex, that'd be great!
PS. while we're at it, if I wanted to extract url just before &quality, how would I do that?

How about using something like this:
RegEx = url=(?<Url>.+?)(?=,url|$)
The lookahead at the end will force matching to stop either at the next ",url" or at the end of the string or line.

Regular Expression to reject special characters other than commas

I am working in asp.net. I am using Regular Expression Validator
Could you please help me in creating a regular expression for not allowing special characters other than comma. Comma has to be allowed.
I checked in regexlib, however I could not find a match. I treid with ^(a-z|A-Z|0-9)*[^#$%^&*()']*$ . When I add other characters as invalid, it does not work.
Also could you please suggest me a place where I can find a good resource of regular expressions? regexlib seems to be big; but any other place which lists very limited but most used examples?
Also, can I create expressions using C# code? Any articles for that?

[\w\s,]+
works fine, as you can see bellow.
RegExr is a great place to test your regular expressions with real time results, it also comes with a very complete list of common expressions.
[] character class \w Matches any word character (alphanumeric & underscore). \s
Matches any whitespace character (spaces, tabs, line breaks). , include comma + is greedy match; which will match the previous 1 or more times.

[\d\w\s,]*
Just a guess

To answer on any articles, I got started here, find it to be an excellent resource:
http://www.regular-expressions.info/
For your current problem, try something like this:
[\w\s,]*
Here's a breakdown:
Match a single character present in the list below «[\w\s,]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
A word character (letters, digits, etc.) «\w»
A whitespace character (spaces, tabs, line breaks, etc.) «\s»
The character “,” «,»

For a single character that is not a comma, [^,] should work perfectly fine.

You can try [\w\s,] regular expression. This regex will match only alpha-numeric characters and comma. If any other character appears within text, then this wont match.
For your second question regarding regular expression resource, you can goto
http://www.regular-expressions.info/
This website has lot of tutorials on regex, plus it has lot of usefult information.
Also, can I create expressions using
C# code? Any articles for that?
By this, do you mean to say you want to know which class and methods for regular expression execution? Or you want tool that will create regular expression for you?

You can create expressions with C#, something like this usually does the trick:
Regex regex = new Regex(#"^[a-z | 0-9 | /,]*$", RegexOptions.IgnoreCase);
System.Console.Write("Enter Text");
String s = System.Console.ReadLine();
Match match = regex.Match(s);
if (match.Success == true)
{
System.Console.WriteLine("True");
}
else
{
System.Console.WriteLine("False");
}
System.Console.ReadLine();
You need to import the System.Text.RegularExpressions;
The regular expression above, accepts only numbers, letters (both upper and lower case) and the comma.
For a small introduction to Regular Expressions, I think that the book for MCTS 70-536 can be of a big help, I am pretty sure that you can either download it from somewhere or obtain a copy.
I am assuming that you never messed around with regular expressions in C#, hence I provided the code above.
Hope this helps.

Thank you, all..
[\w\s,]* works
Let me go through regular-expressions.info and come back if I need further support.
Let me try the C# code approach and come back if I need further support.
[This forum is awesome. Quality replies so qucik..]
Thanks again

(…) is denoting a grouping and not a character set that’s denoted with […]. So try this:
^[a-zA-Z0-9,]*$
This will only allow alphanumeric characters and the comma.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular expression not capturing matches in the middle of a string - c#

Related

Regular expression to match specific filename pattern containing underscores

Converting wildcard pattern to regular expression

How to use non capture groups with the "or" character in regular expressions

Designing Regular Expressions in C#

Regular Expression to reject special characters other than commas

Categories

Resources