Regex hangs trying to find match

Regex hangs trying to find match - c#

I am trying to match an assignment string in VB code (as in I'm passing in text that is VB code into my program that's written in C#). The assignment string that I'm trying to match is something for example like
CustomClassInitializer(someParameter, anotherParameter, someOtherClassAsParameterWithInitialization()).SomeProperty = 7
and I realize that's rather complex, but it actually isn't far off from some of the real text I'm trying to match.
In order to do so I wrote a Regex. This Regex:
#"[\w,.]+\(([\w,.]*\(*,* *\)*)+ = "
which correctly matches. The problem is it becomes VERY slow (with timeouts), which I've researched and found is probably because of "backtracking". One of the suggested solutions to help with backtracking in general was to add "?>" to the regex, which I think would go in this position:
[\w,.]+\(?>([\w,.]*\(*,* *\)*)+ =
but this no longer matches properly.
I'm fairly new to Regex, so I imagine that there is a much better pattern. What is it please? Or how can I improve my times in general?
Helpful notes:
I'm only interested in position 0 of the string I'm searching for a
match in. My code is "if (isMatch && match.index == 0) { ... }. Can
I tell it to only check position 0 and if it's not a match move on?
The reason I use all the 0 or more things is the match could be as simple as CustomClass() = new CustomClass(), and as complicated as the above or perhaps a bit worse. I'm trying to get as many cases as possible.
This Regex is interested in "[\w,.]+(" and then "whatever may be inside the parentheses" (I tried to think of what all could be inside them based on the fact that it's valid VB code) until you get to the close parenthesis and then " = ". Perhaps I can use a wildcard for literally anything until it get's to ") = " in the string? - Like I said, fairly new to Regex.
Thanks in advance!

This seems to do what you want. Normally, I like to be more specific than .*, but it is working correctly. Note that I am using the Multi-line option.
^.*=\s*.+$
Here is a working example in RegExStorm.net example

Related

Find all if else and else if statements with no parentheses

I've got a bunch of code in c# that has here and there lines where an if or else statement without parentheses got some extra lines which result in false behavior. I am looking for a regex to find all possible places where this problem could occur so i can manually look for it.
To clarify what i mean. In code are a few places (i found until now) where the following code is wrong.
if (notInitialized)
Initialize();
AdditionalInitializationNotUseThisWhenAlreadyInitialized();
which should be
if (notInitialized) {
Initialize();
AdditionalInitializationNotUseThisWhenAlreadyInitialized();
}
I tried this if\s*\(.*\)([\n\r\s[^{]]*.*);* but it gives me not only the results i want. It has the if (notInitialized) { parts too. I have almost no experience in using regex.
How can i find all these cases without checking every if/else/else if in the code, just the ones without curly braces?

One problem you are facing is that the regex is matching as much as it can on the .* to find a pattern match. Therefore, and depending on the options used (e.g. . matches anything but \n, or anything), you'll get unsatisfactory results.
Another problem is that you'd need to match recursively, e.g. skip as many ) as there were nested '(' in the expression. Only a very few regex engines can do this; .NET fortunately can via "balancing groups" but it is tricky and highly advanced application of regex. Also, for that to work correctly, you'd have to also recognize string and character literals (in quotes) as to not count parens in these.
Edit This regex for .NET should pretty reliably find these if and else statements:
\b(if\s*\(((?:[^()"']|'(\\.|[^'\\])'|"(\\.|[^"\\])"|(?<parens>\()|(?<-parens>\)))*(?(parens)(?!))\))|else)\s*[^{\s]
While this shows how powerful regexes can be, it's very cryptic and the proper way to do this would really be with a real parser (such as Roslyn).

You can use this:
if \(.+?\)[^{]*?\n[^{]*?[^{]
If you dont have ifs in a format like this
if (notInitialized)
{
Initialize();
AdditionalInitializationNotUseThisWhenAlreadyInitialized();
}
this also works:
if \(.+?\)[^{]*?\n
It will detect, if a line with "if" has no { at the end. Its is also a little shorter.

Is there a better way to check if an entire string was matched? [duplicate]

This question already has answers here:
Match exact string
(3 answers)
Closed 3 years ago.
I'm parsing a text file line by line and for each line I have a special regex. However in one case a pattern is matching two lines. One that is a correct match and another line only partialy because a couple of values are optional.
Invalid match:
BNE1010/1000 HKG1955/2005 7/PLD/CLD/YLD
matches patial string (shouln't match this at all):
BNE1010/1000
Correct match (matches the entire string):
RG878A/21AUG15 GIG/BOG 1/RG/AV 3/AV 4/AV 5/RG 6/AV081C/22 7/CDC/YD 9/TP
The regex for this is quite long and contains several optionl groups:
^(?<FlightDesignator>([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))(?<OperationalSuffix>[A-Z])?(?<FlightIdentifierDate>\/(\d{2})([A-Z]{3})?(\d{2})?)?(\s(?<FlightLegsChangeIdentifier>(\/?[A-Z]{3})+)(?=(\s|$)))?(\s1(?<JointOperationAirlineDesignators>(\/.{2}[A-Z]?)+))?(\s3\/(?<AircraftOwner>([A-Z]{2}|.)))?(\s4\/(?<CockpitCrewEmployer>(.+?)(?=(?: \d\/|$))))?(\s5\/(?<CabinCrewEmployer>([A-Z]{2}|.)))?(?<OnwardFlight>\s6\/(([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))([A-Z])?(\/(\d{2})([A-Z]{3})?(\d{2})?)?)?(\s7\/(?<MealServiceNote>(\/?[A-Z]{0,3})+))?(\s9\/(?<OperatingAirlineDisclosure>(.{2}[A-Z]?)))?
I think there is no need to study the entire regex becasue it's build dynamically from smaller patterns at runtime and all the parts work correctly. Also lots of combinations are tested with unit tests and they all work... as long as I try to parse ony the line that should be matched by the pattern.
Currently I'm checking if the entire string is matched by
match.Group[0].Value == line
but I find it's quite ugly. I know from JavaScript the regex engine provides an Index property where the regex engine stopped. So my idea was to compare the index with the length of the string. Unfortunatelly I wasn't able to find such a property in C#.
Another idea would be to modify the regex so that it matches only one line and no partial lines.
Example: https://regex101.com/r/dM5wU4/1
The example contains only two cases because there aren't actually any combinations that would change its behavior. I could remove some parameters but it wouldn't change anything.
EDIT:
I've edited my question. Sorry to every for not providing all the information at the first time. I won't ask any more questions when writing on the phone :) It wasn't a good idea. Hopefully it won't get closed now.
You asked whether I could simplify the regex. I would do it if I could and knew how. If it was easy I wouldn't have asked. The problem started as the regex ans string became bigger during development. Now they are at the production length and I can't actually make them shorter even for the sake of the quesion, sorry.
EDIT-2:
I found the reason why I couldn't find the inherited Index and Length properties of the Match class.
For some strange reason when selecting the Match class and pressing F1 Visual Studio opened the wrong help page (Match Properties) even though I'm not working with the Micro Framework. I didn't notice that but I was indeed wondering why there is very little information. Thx to #Jamiec for the correct link. I won't trust Visual Studio anymore when hitting F1.

Disclaimer: Im going to add this, but I doubt its the solution. If it's not this part will get deleted in short order
You can add a $ at the end of your regular expression. This stops your first example matching but continues to match the second example.
As you've not provided any more than 2 examples, its unclear if this actually solves all your cases or just that one specific false positive.
My question is whether it is possible to check if a regular expression matched the entire sting without checking the first group against the original line?
If you're not adverse to checking the entire match to the length of the string you can do that too:
var regex = new Regex(#"^(?<FlightDesignator>([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))(?<OperationalSuffix>[A-Z])?(?<FlightIdentifierDate>\/(\d{2})([A-Z]{3})?(\d{2})?)?(\s(?<FlightLegsChangeIdentifier>(\/?[A-Z]{3})+)(?=(\s|$)))?(\s1(?<JointOperationAirlineDesignators>(\/.{2}[A-Z]?)+))?(\s3\/(?<AircraftOwner>([A-Z]{2}|.)))?(\s4\/(?<CockpitCrewEmployer>(.+?)(?=(?: \d\/|$))))?(\s5\/(?<CabinCrewEmployer>([A-Z]{2}|.)))?(?<OnwardFlight>\s6\/(([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))([A-Z])?(\/(\d{2})([A-Z]{3})?(\d{2})?)?)?(\s7\/(?<MealServiceNote>(\/?[A-Z]{0,3})+))?(\s9\/(?<OperatingAirlineDisclosure>(.{2}[A-Z]?)))?");
var input1 = #"BNE1010/1000 HKG1955/2005 7/PLD/CLD/YLD";
var input2 = #"RG878A/21AUG15 GIG/BOG 1/RG/AV 3/AV 4/AV 5/RG 6/AV081C/22 7/CDC/YD 9/TP";
var match1 = regex.Match(input1);
var match2 = regex.Match(input2);
Console.WriteLine(match1.Length == input1.Length); // False
Console.WriteLine(match2.Length == input2.Length); // True
Live example: http://rextester.com/NIBE6349

Unable to get a block of code into my regex match groups

So yeah, the title is pretty weird but I have no other idea how to describe my problem properly. Whatever... lets get to the problem.
Job to get done
My boss wants a function that read all functions of a python file and return a DataTable containing the found functions. This function should be written in IronPython (Python which actually uses C# libraries).
The Problem
I am relatively new to Python and I have no idea what this language is capable of, so I started to write my function and yeah it works pretty well, except one weird problem. I wrote a regular expression to find the functions and to test it I downloaded a RegEx Tester. The Regex Tester showed the results I wanted: Group 1 - The function name, Group 2 - The functions parameter and Group 3 - the content of the function.
For some magical reasons, it doesn't work when it goes to live testing. And with doesn't work I mean, Group 3 has actually no output. After testing the expression with another (online) RegEx Tester, it showed me, that Group 3 has actually not the content of the function, it only has a small part of it, starting with a newline/return character.
In my test cases, the results of Group 3 where all the same, starting with a newline/return character and ended with the functions return (e.g. return objDic).
Question: What the hell is going wrong there? I have no idea what is wrong on my RegEx.
The Regex
objRegex = Regex(r"(?i)def[\s]+([\w]+)\(([\, [\w]+)\)(?:[\:{1}]\s*)([\n].*(?!\ndef[\s]+))+")
The Data
def test_function(some_parameter):
try:
some_cool_code_goes_here()
return obj
except Exception as ex:
DetailsBox.Show(ex)
def another_cool_function(another_parameter):
try:
what_you_want()
return obj
except Exception as ex:
DetailsBox.Show(ex)
The Result
Match: def test_function(some_parameter):...
Position: ..
Length: ..
Group 1: test_function
Group 2: some_parameter
Group 3: (newline/return character)
return obj
But Group 3 should be:
try:
some_cool_code_goes_here()
return obj
except Exception as ex:
DetailsBox.Show(ex)
I hope you can help me :3 Thank you guys!

Although #Hamza said in his comment that you have several problems in your regex, I think they are more of uneeded complexity, the reason for not matching the body might be that you haven't let the . special meta-character match the new line so it is stopping at the first new line character after the first Try: statement.
To fix this you will need to let the . match new line characters and here is a stripped down version of your regex that works:
(?i)def\s+(\w+)\s*\(([\, \w]+)\)(?:\s*:\s*)(.+?)(?=def|$)

Thanks to HamZa for the quick help (and of course also thanks for all the other helpers), he actually solved the problem. There were just a few adjustments necessary (to make it work for C# :-)) but the main point comes from him, thanks a lot.
Solution for my problem:
Regex(r"(?is)def\s*(?<name>\w+)\s*\((?<parameter>[^)]+)\)\s*:\s*(?:\r?\n)+(?<body>.*?)(?=\r?\ndef|$)")

Using RegEx to read through a CSV file

I have a CSV file, with the following type of data:
0,'VT,C',0,
0,'C,VT',0,
0,'VT,H',0,
and I desire the following output
0
VT,C
0
0
C,VT
0
0
VT,H
0
Therefore splitting the string on the comma however ignoring the comma within quote marks. At the moment I'm using the following RegEx:
("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)"
however this gives me the result of:
0
VT
C
0
0
C
VT
0
0
VT
H
0
This show the RegEx is not reading the quote mark properly. Can anyone suggest some alterations that might help?

Usually when it comes to CSV parsing, people use specific libraries well suited for the programming language they are using to code their application.
Anyway if you are going to use a regular expression to make a really loose(!) parsing you may try using something like this:
'(?<value>[^']*?)'
It will match anything in between single quotes, and assuming the csv file is well formed, it will not miss a field. Of course it doesn't accept embedded quotes but it easily gets the job done. That's what I use when I need to get the job done really quickly. Please don't consider it a complete solution to your problem...it just works in special conditions when the requirements are what you described and the input is well formed.
[EDIT]
I was checking again your question and noticed you want to include also non quoted fields...well ok in that case my expression will not work at all. Anyway listen...if you think hard about your problem, you'll find that's something quite difficult to solve without ambiguity. Because you need fixed rules and if you allow quoted and not quoted fields, the parser will have hard time figuring out legit commas as separator/quoted.
Another expression to model such a solution may be:
('[^']+'|[^,]+),?
It will match both quoted/notquoted fields...anyway I'm not sure if it needs to assume the csv HAS to adhere to strict conditions. That will work much safer then a split strategy as far as I can tell ... you just need to collect all matches and print the matched_value + \r\n on your target string.

This regex is based of the fact you have 1 digit before and after your 'value'
Regex.Replace(input, #"(?:(?<=\d),|,(?=\d))", "\n");
You can test it out on RegexStorm

foreach(var m in Regex.Matches(s,"(('.*?')|[0-9])"))

I have manages to get the following method to read the file as required:
public List<string> SplitCSV(string input, List<string> line)
{
Regex csvSplit = new Regex("(([^,^\'])*(\'.*\')*([^,^\'])*)(,|$)", RegexOptions.Compiled);
foreach (Match match in csvSplit.Matches(input))
{
line.Add(match.Value.TrimStart(','));
}
return line;
}
Thanks for everyone help though.

Regex MatchCollection obj hangs\"Function evuluation timed out" after Regex.Matches

I'm kind of new too C#, and regular expression for that matter, but I've searched a couple of hours to find a solution too this problem so, hopefully this is easy for you guys:)
My application uses a regex to match email addresses in a given string,
then loops throu the matches.:
String EmailPattern = "\\w+([-+.]\\w+)*#\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*";
MatchCollection mcemail = Regex.Matches(rawHTML, EmailPattern);
foreach (Match memail in mcemail)
Works fine, but, when I downloaded the string from a certain page, http://www.sp.se/sv/index/services/quality/sidor/default.aspx, the MatchCollection(mcemail) object "hangs" the loop. When using a break point and accessing the object, I get "Function evuluation timed out" on everything(.Count etc).
Update
I've tried my pattern and other email patterns on the same string, everyone(regex desingers, python based web pages etc.) fails/timesout when trying too match this particular string.
How can I detect that the matchcollection obj is not "ready" to use?

If you can post the email that's causing the problem (perhaps anonymized in some way), that will give us more information, but I'm thinking the problem is this little guy right here:
([-.]\\w+)*\\.\\w+([-.]\\w+)*
To understand the problem, let's break that into groups:
([-.]\\w+)*
\\.\\w+
([-.]\\w+)*
The strings that will match \\.\\w+ are a subset of those that will match [-.]\\w+. So if part of your input looks like foo.bar.baz.blah.yadda.com, your regex engine has no way of knowing which group is supposed to match it. Does that make sense? So the first ([-.]\\w+)* could match .bar.baz.blah, then the \\.\\w+ could match .yadda, then the last ([-.]\\w+)* could match .com...
...OR the first clause could match .bar.baz, the second could match .blah, and the last could match .yadda.com. Since it doesn't know which one is right, it will keep trying different combinations. It should stop eventually, but that could still take a long time. This is called "catastrophic backtracking".
This issue is compounded by the fact that you're using capturing groups rather than non-capturing groups; i.e. ([-+.]\\w+) instead of (?:[-+.]\\w+). That causes the engine to try and separate and save whatever matches inside the parentheses for later reference. But as I explained above, it's ambiguous which group each substring belongs in.
You might consider replacing everything after the # with something like this:
\\w[-\\w]*\\.[-.\\w]+
That could use some refinement to make it more specific, but you get the general idea. Hope I explained all this well enough; grouping and backreferences are kind of tough to describe.
EDIT:
Looking back at your pattern, there's a deeper issue here, still related to the backtracking/ambiguity problem I mentioned. The clause \\w+([-.]\\w+)* is ambiguous all by itself. Splitting it into parts, we have:
\\w+
([-.]\\w+)*
Suppose you have a string like foobar. Where does the \\w+ end and the ([-.]\\w+)* begin? How many repetitions of ([-.]\\w+) are there? Any of the following could work as matches:
f(oobar)
foo(bar)
f(o)(oba)(r)
f(o)(o)(b)(a)(r)
foobar
etc...
The regex engine doesn't know which is important, so it will try them all. This is the same problem I pointed out above, but it means you have it in multiple places in your pattern.
Even worse, ([-.]\\w+)* is also ambiguous, because of the + after the \\w. How many groups are there in blah? I count 16 possible combinations: (blah), (b)(lah), (bl)(ah)...
The amount of different possible combinations is going to be huge, even for a relatively small input, so your engine is going to be in overdrive. I would definitely simplify it if I were you.

I just did a local test and it appears either the sheer document size or something in the ViewState causes the Regex match evaluation to time out. (Edit: I'm pretty sure it's the size, actually. Removing the ViewState just reduces the size significantly.)
An admittedly crude way to solve this would be something like this:
string[] rawHtmlLines = File.ReadAllLines(#"C:\default.aspx");
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => !line.Contains("_VIEWSTATE")).ToArray());
string emailPattern = #"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
var emailMatches = Regex.Matches(filteredHtml, emailPattern);
foreach (Match match in emailMatches)
{
//...
}
Overall I suspect the email pattern is just not well optimised (or intended) to filter out emails in a large string but just used as validation for user input. Generally it might be a good idea to limit the string you search in to just the parts you are actually interested in and keep it as small as possible - for example by leaving out the ViewState which is guaranteed to not contain any readable email addresses.
If performance is important, it's probably also a better idea to create the filtered HTML using a StringBuilder and IndexOf (etc.) instead of splitting lines and LINQing up the result :)
Edit:
To further minimize the length of the string the Regex needs to check you could only include lines that contain the # character to begin with, like so:
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => line.IndexOf('#') >= 0 && !line.Contains("_VIEWSTATE")).ToArray());

From "Function evaluation timed out", I'm assuming you're doing this in the debugger. The debugger has some fairly quick timeouts with regard to how long a method takes. Not eveything happens quickly. I would suggest going the operation in code, storing the result, then viewing that result in the debugger (i.e. let the call to Matches run and put a breakpoint after it).
Now, with regard to detecting whether the string will make Matches take a long time; that's a bit of a black art. You basically have to perform some sort of input validation. Just because you got some value from the internet, doesn't mean that value will work well with Matches. The ultimate validation logic is up to you; but, starting with the length of rawHtmlLines might be useful. (i.e. if the lenght is 1000000 bytes, Matches might take a while) But, you have to decide what to do if the length is too long; e.g give an error to the user.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.