Regex Match Failure Parsing HTML Nodes - c#

I have a string:
<graphic id="8374932">Translating Cowl (Inner/Outer Bondments</graphic>
And my pattern:
"<graphic id=\"(.*?)\">(.*?)</graphic>"
But it fails for second group, saying: "Not enough )'s." How should I prevent it?

EDIT: First off, if you goal is to parse HTML or XML I strongly advise against it. If your goal is to learn or to surgically grab an element node then regex may, and I say may be a tool to use. I am answering this with the thought that you are using the html pattern to learn from....
I believe you have confused your data with your pattern and the regex pattern is failing.
I recommend these things
Don't use .*? to get text. It is too nebulous for the regex parser. Be more succinct in your pattern.
Since you know that the text is enclosed in quotes or by >xxx< use those as anchors.
Once anchors are determined extract the text
Place captured text into named capture groups.
How to get the text? Tell the regex parser to get everthing that is not an anchor character by using the set operation with the ^ (which means not when in a set [ ]) such as ([^\"]+) which says match everything that is not a quote.
Change your pattern to this which demonstrates the above suggestions:
string data = #"<graphic id=""8374932"">Translating Cowl (Inner/Outer Bondments</graphic>";
// \x22 is the hex escape for the quote, makes it easier to read.
string pattern = #"
(?:graphic\s+id=\x22) # Match but don't capture (MBDC) the beginning of the element
(?<ID>[^\x22]+) # Get all that is not a quote
(?:\x22>) # MBDC the quote
(?<Content>[^<+]+) # Place into the Content match capture group all text that is not + or <
(?:\</graphic) # MBDC The graphic";
// Ignore Pattern whitespace only allows us to comment, does not influence regex processing.
var mt = Regex.Match(data, pattern, RegexOptions.IgnorePatternWhitespace);
Console.WriteLine ("ID: {0} Content: {1}", mt.Groups["ID"], mt.Groups["Content"]);
// Outputs:
// ID: 8374932 Content: Translating Cowl (Inner/Outer Bondments

Related

RegEx expression works on regex101 but not in C# [duplicate]

https://regex101.com/r/sB9wW6/1
(?:(?<=\s)|^)#(\S+) <-- the problem in positive lookbehind
Working like this on prod: (?:\s|^)#(\S+), but I need a correct start index (without space).
Here is in JS:
var regex = new RegExp(/(?:(?<=\s)|^)#(\S+)/g);
Error parsing regular expression: Invalid regular expression:
/(?:(?<=\s)|^)#(\S+)/
What am I doing wrong?
UPDATE
Ok, no lookbehind in JS :(
But anyways, I need a regex to get the proper start and end index of my match. Without leading space.
Make sure you always select the right regex engine at regex101.com. See an issue that occurred due to using a JS-only compatible regex with [^] construct in Python.
JS regex - at the time of answering this question - did not support lookbehinds. Now, it becomes more and more adopted after its introduction in ECMAScript 2018. You do not really need it here since you can use capturing groups:
var re = /(?:\s|^)#(\S+)/g;
var str = 's #vln1\n#vln2\n';
var res = [];
while ((m = re.exec(str)) !== null) {
res.push(m[1]);
}
console.log(res);
The (?:\s|^)#(\S+) matches a whitespace or the start of string with (?:\s|^), then matches #, and then matches and captures into Group 1 one or more non-whitespace chars with (\S+).
To get the start/end indices, use
var re = /(\s|^)#\S+/g;
var str = 's #vln1\n#vln2\n';
var pos = [];
while ((m = re.exec(str)) !== null) {
pos.push([m.index+m[1].length, m.index+m[0].length]);
}
console.log(pos);
BONUS
My regex works at regex101.com, but not in...
First of all, have you checked the Code Generator link in the Tools pane on the left?
All languages - "Literal string" vs. "String literal" alert - Make sure you test against the same text used in code, literal string, at the regex tester. A common scenario is copy/pasting a string literal value directly into the test string field, with all string escape sequences like \n (line feed char), \r (carriage return), \t (tab char). See Regex_search c++, for example. Mind that they must be replaced with their literal counterparts. So, if you have in Python text = "Text\n\n abc", you must use Text, two line breaks, abc in the regex tester text field. Text.*?abc will never match it although you might think it "works". Yes, . does not always match line break chars, see How do I match any character across multiple lines in a regular expression?
All languages - Backslash alert - Make sure you correctly use a backslash in your string literal, in most languages, in regular string literals, use double backslash, i.e. \d used at regex101.com must written as \\d. In raw string literals, use a single backslash, same as at regex101. Escaping word boundary is very important, since, in many languages (C#, Python, Java, JavaScript, Ruby, etc.), "\b" is used to define a BACKSPACE char, i.e. it is a valid string escape sequence. PHP does not support \b string escape sequence, so "/\b/" = '/\b/' there.
All languages - Default flags - Global and Multiline - Note that by default m and g flags are enabled at regex101.com. So, if you use ^ and $, they will match at the start and end of lines correspondingly. If you need the same behavior in your code check how multiline mode is implemented and either use a specific flag, or - if supported - use an inline (?m) embedded (inline) modifier. The g flag enables multiple occurrence matching, it is often implemented using specific functions/methods. Check your language reference to find the appropriate one.
line-breaks - Line endings at regex101.com are LF only, you can't test strings with CRLF endings, see regex101.com VS myserver - different results. Solutions can be different for each regex library: either use \R (PCRE, Java, Ruby) or some kind of \v (Boost, PCRE), \r?\n, (?:\r\n?|\n)/(?>\r\n?|\n) (good for .NET) or [\r\n]+ in other libraries (see answers for C#, PHP). Another issue related to the fact that you test your regex against a multiline string (not a list of standalone strings/lines) is that your patterns may consume the end of line, \n, char with negated character classes, see an issue like that. \D matched the end of line char, and in order to avoid it, [^\d\n] could be used, or other alternatives.
php - You are dealing with Unicode strings, or want shorthand character classes to match Unicode characters, too (e.g. \w+ to match Стрибижев or Stribiżew, or \s+ to match hard spaces), then you need to use u modifier, see preg_match() returns 0 although regex testers work - To match all occurrences, use preg_match_all, not preg_match with /...pattern.../g, see PHP preg_match to find multiple occurrences and "Unknown modifier 'g' in..." when using preg_match in PHP?- Your regex with inline backreference like \1 refuses to work? Are you using a double quoted string literal? Use a single-quoted one, see Backreference does not work in PHP
phplaravel - Mind you need the regex delimiters around the pattern, see https://stackoverflow.com/questions/22430529
python - Note that re.search, re.match, re.fullmatch, re.findall and re.finditer accept the regex as the first argument, and the input string as the second argument. Not re.findall("test 200 300", r"\d+"), but re.findall(r"\d+", "test 200 300"). If you test at regex101.com, please check the "Code Generator" page. - You used re.match that only searches for a match at the start of the string, use re.search: Regex works fine on Pythex, but not in Python - If the regex contains capturing group(s), re.findall returns a list of captures/capture tuples. Either use non-capturing groups, or re.finditer, or remove redundant capturing groups, see re.findall behaves weird - If you used ^ in the pattern to denote start of a line, not start of the whole string, or used $ to denote the end of a line and not a string, pass re.M or re.MULTILINE flag to re method, see Using ^ to match beginning of line in Python regex
- If you try to match some text across multiple lines, and use re.DOTALL or re.S, or [\s\S]* / [\s\S]*?, and still nothing works, check if you read the file line by line, say, with for line in file:. You must pass the whole file contents as the input to the regex method, see Getting Everything Between Two Characters Across New Lines. - Having trouble adding flags to regex and trying something like pattern = r"/abc/gi"? See How to add modifers to regex in python?
c#, .net - .NET regex does not support possessive quantifiers like ++, *+, ??, {1,10}?, see .NET regex matching digits between optional text with possessive quantifer is not working - When you match against a multiline string and use RegexOptions.Multiline option (or inline (?m) modifier) with an $ anchor in the pattern to match entire lines, and get no match in code, you need to add \r? before $, see .Net regex matching $ with the end of the string and not of line, even with multiline enabled - To get multiple matches, use Regex.Matches, not Regex.Match, see RegEx Match multiple times in string - Similar case as above: splitting a string into paragraphs, by a double line break sequence - C# / Regex Pattern works in online testing, but not at runtime - You should remove regex delimiters, i.e. #"/\d+/" must actually look like #"\d+", see Simple and tested online regex containing regex delimiters does not work in C# code - If you unnecessarily used Regex.Escape to escape all characters in a regular expression (like Regex.Escape(#"\d+\.\d+")) you need to remove Regex.Escape, see Regular Expression working in regex tester, but not in c#
dartflutter - Use raw string literal, RegExp(r"\d"), or double backslashes (RegExp("\\d")) - https://stackoverflow.com/questions/59085824
javascript - Double escape backslashes in a RegExp("\\d"): Why do regex constructors need to be double escaped?
- (Negative) lookbehinds unsupported by most browsers: Regex works on browser but not in Node.js - Strings are immutable, assign the .replace result to a var - The .replace() method does change the string in place - Retrieve all matches with str.match(/pat/g) - Regex101 and Js regex search showing different results or, with RegExp#exec, RegEx to extract all matches from string using RegExp.exec- Replace all pattern matches in string: Why does javascript replace only first instance when using replace?
javascriptangular - Double the backslashes if you define a regex with a string literal, or just use a regex literal notation, see https://stackoverflow.com/questions/56097782
java - Word boundary not working? Make sure you use double backslashes, "\\b", see Regex \b word boundary not works - Getting invalid escape sequence exception? Same thing, double backslashes - Java doesn't work with regex \s, says: invalid escape sequence - No match found is bugging you? Run Matcher.find() / Matcher.matches() - Why does my regex work on RegexPlanet and regex101 but not in my code? - .matches() requires a full string match, use .find(): Java Regex pattern that matches in any online tester but doesn't in Eclipse - Access groups using matcher.group(x): Regex not working in Java while working otherwise - Inside a character class, both [ and ] must be escaped - Using square brackets inside character class in Java regex - You should not run matcher.matches() and matcher.find() consecutively, use only if (matcher.matches()) {...} to check if the pattern matches the whole string and then act accordingly, or use if (matcher.find()) to check if there is a single match or while (matcher.find()) to find multiple matches (or Matcher#results()). See Why does my regex work on RegexPlanet and regex101 but not in my code?
scala - Your regex attempts to match several lines, but you read the file line by line (e.g. use for (line <- fSource.getLines))? Read it into a single variable (see matching new line in Scala regex, when reading from file)
kotlin - You have Regex("/^\\d+$/")? Remove the outer slashes, they are regex delimiter chars that are not part of a pattern. See Find one or more word in string using Regex in Kotlin - You expect a partial string match, but .matchEntire requires a full string match? Use .find, see Regex doesn't match in Kotlin
mongodb - Do not enclose /.../ with single/double quotation marks, see mongodb regex doesn't work
c++ - regex_match requires a full string match, use regex_search to find a partial match - Regex not working as expected with C++ regex_match - regex_search finds the first match only. Use sregex_token_iterator or sregex_iterator to get all matches: see What does std::match_results::size return? - When you read a user-defined string using std::string input; std::cin >> input;, note that cin will only get to the first whitespace, to read the whole line properly, use std::getline(std::cin, input); - C++ Regex to match '+' quantifier - "\d" does not work, you need to use "\\d" or R"(\d)" (a raw string literal) - This regex doesn't work in c++ - Make sure the regex is tested against a literal text, not a string literal, see Regex_search c++
go - Double backslashes or use a raw string literal: Regular expression doesn't work in Go - Go regex does not support lookarounds, select the right option (Go) at regex101.com before testing! Regex expression negated set not working golang
groovy - Return all matches: Regex that works on regex101 does not work in Groovy
r - Double escape backslashes in the string literal: "'\w' is an unrecognized escape" in grep - Use perl=TRUE to PCRE engine ((g)sub/(g)regexpr): Why is this regex using lookbehinds invalid in R?
oracle - Greediness of all quantifiers is set by the first quantifier in the regex, see Regex101 vs Oracle Regex (then, you need to make all the quantifiers as greedy as the first one)] - \b does not work? Oracle regex does not support word boundaries at all, use workarounds as shown in Regex matching works on regex tester but not in oracle
firebase - Double escape backslashes, make sure ^ only appears at the start of the pattern and $ is located only at the end (if any), and note you cannot use more than 9 inline backreferences: Firebase Rules Regex Birthday
firebasegoogle-cloud-firestore - In Firestore security rules, the regular expression needs to be passed as a string, which also means it shouldn't be wrapped in / symbols, i.e. use allow create: if docId.matches("^\\d+$").... See https://stackoverflow.com/questions/63243300
google-data-studio - /pattern/g in REGEXP_REPLACE must contain no / regex delimiters and flags (like g) - see How to use Regex to replace square brackets from date field in Google Data Studio?
google-sheets - If you think REGEXEXTRACT does not return full matches, truncates the results, you should check if you have redundant capturing groups in your regex and remove them, or convert the capturing groups to non-capturing by add ?: after the opening (, see Extract url domain root in Google Sheet
sed - Why does my regular expression work in X but not in Y?
word-boundarypcrephp - [[:<:]] and [[:>:]] do not work in the regex tester, although they are valid constructs in PCRE, see https://stackoverflow.com/questions/48670105
snowflake-cloud-data-platform snowflake-sql - If you are writing a stored procedure, and \\d does not work, you need to double them again and use \\\\d, see REGEX conversion of VARCHAR value to DATE in Snowflake stored procedure using RLIKE not consistent.

C# regex can't find text with whitespace if input is escaped

While trying to find a bit of text with a single whitespace between two words, I encountered something that seems like a bug. I'm using a pattern like (abc)\s(abc), to find two specific words. Now I'm escaping my input using Regex.Escape, but then my regex doesn't match anymore because spaces are escaped (to \space), and then not matched. Is this intended?
My text comes from user input, so as far as I know it should be escaped.
To clarify my question, the following code:
Console.WriteLine("Original text: " + text);
Console.WriteLine("Escaped text: " + Regex.Escape(text));
Console.WriteLine("Matches non-escaped text: " + Regex.IsMatch(text, #"(abc)\s(abc)", RegexOptions.IgnoreCase));
Console.WriteLine("Matches escaped text: " + Regex.IsMatch(Regex.Escape(text), #"(abc)\s(abc)", RegexOptions.IgnoreCase));
Gives the following result for input abc abc
Original text: abc abc
Escaped text: abc\ abc
Matches non-escaped text: True
Matches escaped text: False
While I would expect it to still match on spaces
My text comes from user input, so as far as I know it should be escaped.
This is a faulty premise. If you assume this, then every time someone uses any of your apps to create a record for an employee named Shamus A. O'Leary, they'll probably end up being inserted into the db as Shamus A\. O\'Leary, Shamus A. O'Leary, Shamus+A%2E+O'Leary etc depending on where the data came from and how you decided it needed to be escaped
Just because user provides text doesn't mean it needs to be escaped - you're going to have to apply escaping contextually rather than as a blanket rule based on where text comes from. Generally escaping is used to make sure data can survive being put through some transport channel that doesn't support all the characters, or will try to process some of the characters as having a special meaning when they should not. Instead of hence looking at escaping as something that must be done depending on the source of data, look at it as something that must be done to ensure data reaches a destination unharmed
Regex-wise (abc)\s(abc) does not match a string of abc\ abc, because of the slash. You've transformed your string from matching X to something else (Y), and then asked the regex parser whether Y matches the regex. It's no more a match than abc+abc is a match, going off an assumption that "when URLs are escaped, spaces become pluses, so a plus and a space must mean the same thing to a regex" - the regex engine will just look at the data and say "plus is not a whitespace character; no match". The regex engine won't look at your data and think "hey, if I just unescape this before I run it through the pattern matcher..." and it won't look at your data and think "it's a regex pattern" - a regex pattern expression and data passed to a regex matcher working from that pattern are very different things, and if you want your data to match a described pattern, don't alter the data after you've decided on the pattern
Thus the fault is in transforming the string by running a character replacement (escaping) before asking for the match

getting the correct regex to print out in c#

Below is a regex statement I have been working on for quite sometime:
Match parsedRequestData = Regex.Match(requestData, #"^.*\[(.*)\]$");
What this is supposed to be doing is taking the email out of the email below:
2.3|[0246303#up.com]
For clarification, this email comes from a table in SQL Server. There are many emails that are formatted like this in there and the regex is supposed to be getting all of that from inside the brackets. However, it is matching the entirety of this line instead of whats inside of it. So my question is, is there something wrong with my regex statement or do I have something in my code I need to add?
Your regex is storing the email address in capture group 1. Try referencing group 1 like this:
parsedRequestData.Groups[1];
Code Sample:
string requestData = "2.3|[0246303#up.com]";
Match parsedRequestData = Regex.Match(requestData, #"^.*\[(.*)\]$");
if (parsedRequestData.Success)
{
Console.WriteLine(parsedRequestData.Groups[1]);
}
Results:
0246303#up.com
Your regex is OK. All you need is to use the Group[1]
var email = Regex.Match("2.3|[0246303#up.com]", #"^.*\[(.*)\]$").Groups[1].Value;
However, it is matching the entirety of this line instead of whats inside of it.
Unless one uses named match captures, the match capture groups are indexed.
Match.Groups[0].Value is the whole match; it shows all the match captures and all the grouped matched text.
Match.Groups[{1-N}].Value is the match captures in the order of specification in the pattern for anything in a ( ) parenthesis set(s). If there is only one ( ) there will be two indexed groups; 0 as mentioned above, and 1 of the items specified to be captured to N.
You only have one ( ) set so the data you want is found in match capture group 1. Group 0 has the non match capture items along with the match capture data.
If one names the match capture such as (?<MyNameHere> ) one can also access the match via Match.Groups["MyNameHere"].Value.
Suggestion on your pattern away from the answer
Usage of * (zero or more) in patterns can be problematic in that it can significantly increase the time of the parser takes due to backtracking false scenarios.
If one knows there is text to be found, don't tell the parser zero items may happen when that is impossible, change it to + one or more. That slight change can greatly affect the parsing operations, both in time and operations.
Change ^.*\[(.*)\]$ to ^.+\[(.+)\]$.
But to even increase the efficiency of the pattern, focus on the knowns of the characters [ and ] as anchors.
Pattern Restructure To Use Anchors
^[^[]+\[([^\]]+)[\s\]]+$
Why is this pattern better? Because we will look for "[" and "]" as anchors.
Let us break it down
^ - Beginning of the pattern (a hard anchor)
[^ ]+ This is a set notation where the ^ says NOT.
[^\[]+ So we want to capture all text + (one or more) that is NOT a [. This tells the pattern to match up to our anchor [ in the text. Note that we don't have to escape it for regex parser treats all characters in a set [ ] as a literal so [^[] is valid. (To be clear this is a match but don't capture text anchor so we will not find this text in an index above the 0 index; only in 0).
\[ Our literal anchor the "[" character.
([^\]]+) This is our match capture which says match this set where any character is valid but not an "]". Here we have to escape the ] because otherwise it would signify the end of our set.
[\s\]]+ we know the end of our text there will be spaces and the "]" character, so let us match (but not to capture) any combination of spaces and a ] before the end.
$ our final anchor, the end of the file/buffer indicator (or line if the right parser rule is set).

regex syntax stop search

How do I make Regex stop the search after "Target This"?
HeaderText="Target This" AnotherAttribute="Getting Picked Up"
This is what i've tried
var match = Regex.Match(string1, #"(?<=HeaderText=\").*(?=\")");
The quantifier * is eager, which means it will consume as many characters as it can while still getting a match. You want the lazy quantifier, *?.
As an aside, rather than using look-around expressions as you have done here, you may find it in general easier to use capturing groups:
var match = Regex.Match(string1, "HeaderText=\"(.*?)\"");
^ ^ these make a capturing group
Now the match matches the whole thing, but match.Groups[1] is just the value in the quotes.
Plain regex pattern
(?<=HeaderText=").*?(?=")
or as string
string pattern = "(?<=HeaderText=\").*?(?=\")";
or using a verbatim string
string pattern = #"(?<=HeaderText="").*?(?="")";
The trick is the question mark after .*. It means "as few as possible", making it stop after the first end-quotes it encounters.
Note that verbatim strings (introduced with #) do not recognize the backslash \ as escape character. Escape the double quotes by doubling them.
Note for others interested in regex: The search pattern used finds a postion between a prefix and a suffix:
(?<=prefix)find(?=suffix)
Try this:
var match = Regex.Match(string1, "HeaderText=\"([^\"]+)");
var val = match.Groups[1].Value; //Target This
UPDATE
if there possibilities have double quotes in target,change the regex to:
HeaderText=\"(.+?)\"\\s+\\w
Note: it's not right way to do this, if it's a XML, check out System.XML otherwise,HtmlAgilityPack / How to use HTML Agility pack.

Why does changing this regex class to .+ not provide any match?

If I use this
string showPattern = #"return new_lightox\(this\);"">[a-zA-Z0-9(\s),!\?\-:'&%]+</a>";
MatchCollection showMatches = Regex.Matches(pageSource, showPattern);
I get some matches but I want to get rid of [a-zA-Z0-9(\s),!\?\-:'&%]+and use any char .+
but if do this I get no match at all.
What am I doing wrong?
By default "." does not match newlines, but the class \s does.
To let . match newline, turn on SingleLine/DOTALL mode - either using a flag in the function call (as Abel's answer shows), or using the inline modifier (?s), like this for the whole expression:
"(?s)return new_lightox\(this\);"">.+</a>"
Or for just the specific part of it:
"return new_lightox\(this\);"">(?s:.+)</a>"
It might be better to take that a step further and do this:
"return new_lightox\(this\);"">(?s:(?!</?a).+)</a>"
Which should prevent the closing </a> from belonging to a different link.
However, you need to be very wary here - it's not clear what you're doing overall, but regex is not a good tool for parsing HTML with, and can cause all sorts of problems. Look at using a HTML DOM parser instead, such as HtmlAgilityPack.
You're matching a tag, so you probably want something along these lines, instead of .+:
string showPattern = #"return new_lightox\(this\);"">[^<]+</a>";
The reason that the match doesn't hit is possibly because you are missing the multiline/singleline flag and the closing tag is on the next line. In other words, this should work too:
// SingleLine option changes the dot (.) to match newlines too
MatchCollection showMatches = Regex.Matches(
pageSource,
showPattern,
RegexOptions.SingleLine);

Categories