Why does changing this regex class to .+ not provide any match? - c#

If I use this
string showPattern = #"return new_lightox\(this\);"">[a-zA-Z0-9(\s),!\?\-:'&%]+</a>";
MatchCollection showMatches = Regex.Matches(pageSource, showPattern);
I get some matches but I want to get rid of [a-zA-Z0-9(\s),!\?\-:'&%]+and use any char .+
but if do this I get no match at all.
What am I doing wrong?

By default "." does not match newlines, but the class \s does.

To let . match newline, turn on SingleLine/DOTALL mode - either using a flag in the function call (as Abel's answer shows), or using the inline modifier (?s), like this for the whole expression:
"(?s)return new_lightox\(this\);"">.+</a>"
Or for just the specific part of it:
"return new_lightox\(this\);"">(?s:.+)</a>"
It might be better to take that a step further and do this:
"return new_lightox\(this\);"">(?s:(?!</?a).+)</a>"
Which should prevent the closing </a> from belonging to a different link.
However, you need to be very wary here - it's not clear what you're doing overall, but regex is not a good tool for parsing HTML with, and can cause all sorts of problems. Look at using a HTML DOM parser instead, such as HtmlAgilityPack.

You're matching a tag, so you probably want something along these lines, instead of .+:
string showPattern = #"return new_lightox\(this\);"">[^<]+</a>";
The reason that the match doesn't hit is possibly because you are missing the multiline/singleline flag and the closing tag is on the next line. In other words, this should work too:
// SingleLine option changes the dot (.) to match newlines too
MatchCollection showMatches = Regex.Matches(
pageSource,
showPattern,
RegexOptions.SingleLine);

Related

Get what is in string from one quotation mark to other [duplicate]

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?
In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']
I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.
Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'
Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.
The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.
I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.
A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1
The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.
This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "
I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1
string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character
Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.
All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.
My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match
echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.
From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter
If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".
A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub

c# regex to match specific text

I'm looking to match all text in the format foo:12345 that is not contained within an HTML anchor. For example, I'd like to match lines 1 and 3 from the following:
foo:123456
foo:123456
foo:123456
I've tried these regexes with no success:
Negative lookahead attempt ( incorrectly matches, but doesn't include the last digit )
foo:(\d+)(?!</a>)
Negative lookahead with non-capturing grouping
(?:foo:(\d+))(?!</a>)
Negative lookbehind attempt ( wildcards don't seem to be supported )
(?<!<a[^>]>)foo:(\d+)
If you want to start analysing HTML like this then you probably want to actually parse HTML instead of using regular expressions. The HTML Agility Pack is the usual first port of call. Using Regular Expressions it becomes hard to deal with things like <a></a>foo:123456<a></a> which of course should pull out the middle bit but its extremely hard to write a regex that will do that.
I should add that I am assuming that you do in fact have a block of HTML rather than just individual short strings such as your each line above. Partly I ruled it out becasue matching it if it is the only thing on the line is pretty easy so I figured you'd have got it if you wanted that. :)
Regex is usually not the best tool for the job, but if your case is very specific like in your example you could use:
foo:((?>\d+))(?!</a>)
Your first expression didn't work because \d+ would backtrack till (?!</a>) matches. This can be fixed by not allowing \d+ to backtrack, as above with help of an atomic/nonbacktracking group, or you could also make the lookahead fail in case \d+ backtracks, like:
foo:((?>\d+))(?!</a>|\d)
Altho that is not as efficient.
Note, that lookbehind will not work with differnt string length inside, you may work it out differently
for example
Find and mark all foo-s that are contained in anchor
Find and do your goal with all other
Remove marks
This is prob a long winded way of doing this but you could simply bring back all occurences of foo:some digits then exclude them afterwards..
string pattern = #"foo:\d+ |" +
#"foo:\d+[<]";
Then use matchcollection
MatchCollection m0 = Regex.Matches(file, pattern, RegexOptions.Singleline);
Then loop through each occurrence:
foreach (Match m in m0)
{
. . . exclude the matches that contain the "<"
}
I would use linq and treat the html like xml, for example:
var query = MyHtml.Descendants().ToArray();
foreach (XElement result in query)
{
if (Regex.IsMatch(result.value, #"foo:123456") && result.Name.ToString() != "a")
{
//do something...
}
}
perhaps there's a better way, but i don't know it...this seems pretty straight forward to me :P

regex c# optional group - should act greedy?

having regex ~like this:
blablabla.+?(?:<a href="(http://.+?)" target="_blank">)?
I want to capture an url if I find one... finds stuff but I don't get the link (capture is always empty). Now if I remove the question mark at the end like this
blablabla.+?(?:<a href="(http://.+?)" target="_blank">)
This will only match stuff that has the link at the end... it's 2.40 am... and I've got no ideas...
--Edit--
sample input:
blablabla asd 1234t535 <a href="http://google.com" target="_blank">
expected output:
match 0:
group 1: <a href="http://google.com" target="_blank">
group 2: http://google.com`
I just want "http://google.com" or ""
Are you doing a whole-string match? If so, try adding .* to the end of the first regex and see what it matches. The problem with the first regex is that it can match anything after blablabla because of the .+? (leading to an empty capture), but the parenthesized part still won't match an a tag unless it's at the end of the string. By the way, looking at your expected output, capture 1 will be the URL; the parentheses around the whole HTML tag are non-capturing because of the ?: at the beginning.
you shouldn't need .+? at the start, the regex is going to search the whole input anyway
you also have the closing '>' right after blank which will limit your matches
(?:<a href="(http://.+?)" target="_blank".*?>)
regex test
It's the trailing ? that's doing you in. Reason: By marking it as optional, you're allowing the .+? to grab it.
blablabla.*(?:<a href="((http://)?.*)".+target="_blank".*>)
I modified it slightly... .+? is basically the same as .*, and if you may have nothing in your href (you indicated you wanted ""), you need to make the http optional as well as the trailing text. Also, .* in front target means you have at least one space or character, but may have more (multiple blanks or other attributes). .* before the > means you can have blanks or other attributes trailing after.
This will not match a line at all if there's no <a href...>, but that's what you want, right?
The (?: ... ) can be dropped completely, if you don't need to capture the whole <a href...> portion.
This will fail if the attributes are not listed in the order specified... which is one of the reasons regex can't really be used to parse html. But if you're certain the href will always come before the target, this should do what you need.

What is wrong with my regex (simple)?

I am trying to make a regex that matches all occurrences of words that are at the start of a line and begin with #.
For example in:
#region #like
#hey
It would match #region and #hey.
This is what I have right now:
^#\w*
I apologize for posting this question. I'm sure it has a very simple answer, but I have been unable to find it. I admit that I am a regex noob.
What you've got should work, depending on what flags you pass for RegexOptions. You need to make sure you pass RegexOptions.Multiline:
var matches = Regex.Matches(input, #"^#\w*", RegexOptions.Multiline);
See the documentation I linked to above:
Multiline Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.
The regex looks fine, make sure you're using a verbatim string literal (# prefix) to define your regex, i.e. #"^#\w*" otherwise the backslash will be treated as an escape sequence.
Use this regex
^#.+?\b
.+ will ensure at least one character after # and \b indicates word boundry. ? adds non-greediness to the + operator so as to avoid matching whole string #region #like

How to perform link rewriting with a .NET Regex?

I need to use C# Regex to do a link rewrite for html pages and I need to replace the links enclosed with quotes (") with my own ones. Say for example, I need to replace the following
"slashdot.org/index.rss"
into
"MY_OWN_LINK"
However, the actual link can be of the form
"//slashdot.org/index.rss" or
"/slashdot.org/index.rss"
where there can be other values that comes before "slashdot.org/index.rss" but after the quote (") which I don't care about.
To summarize, as long as the link ends with "slashdot.org/index.rss", I would want to replace the entire link with "MY_OWN_LINK".
How can I use Regex.Replace for the above?
edit: updated answer according to comment.
First, you don't have to use a regular expression for this job. Just check whether or not the string ends with `"slashdot.org/index.rss"', and if it is, replace the entire string.
If you're using regular expression, you'd better just test whether or not the string ends with
"slashdot.org/index.rss" and act accordingly, like so:
if (Regex.IsMatch(str,"slashdot.org/index\.rss$")) {str = new_str;}
If you insist of using Regex.Replace, go for
Regex.Replace(str,"^.*slashdot.org/index\.rss$","MY_OWN_LINK");
where the ^ and the $ stands for line/string begin/end respectively. The first .* means "capture the start of the URL, whatever it is". The last dot is perpended with slash, as it usually means "any character".
For additional info, see this cheat sheet of regular expression in C#.
Try this. Will work with no slash, single and two slashes.
string pattern = #"[/]{0,2}slashdot\.org[/]{0,2}index\.rss";
test1 = Regex.Replace(test1, pattern, "MY_OWN_LINK");

Categories