So yeah, the title is pretty weird but I have no other idea how to describe my problem properly. Whatever... lets get to the problem.
Job to get done
My boss wants a function that read all functions of a python file and return a DataTable containing the found functions. This function should be written in IronPython (Python which actually uses C# libraries).
The Problem
I am relatively new to Python and I have no idea what this language is capable of, so I started to write my function and yeah it works pretty well, except one weird problem. I wrote a regular expression to find the functions and to test it I downloaded a RegEx Tester. The Regex Tester showed the results I wanted: Group 1 - The function name, Group 2 - The functions parameter and Group 3 - the content of the function.
For some magical reasons, it doesn't work when it goes to live testing. And with doesn't work I mean, Group 3 has actually no output. After testing the expression with another (online) RegEx Tester, it showed me, that Group 3 has actually not the content of the function, it only has a small part of it, starting with a newline/return character.
In my test cases, the results of Group 3 where all the same, starting with a newline/return character and ended with the functions return (e.g. return objDic).
Question: What the hell is going wrong there? I have no idea what is wrong on my RegEx.
The Regex
objRegex = Regex(r"(?i)def[\s]+([\w]+)\(([\, [\w]+)\)(?:[\:{1}]\s*)([\n].*(?!\ndef[\s]+))+")
The Data
def test_function(some_parameter):
try:
some_cool_code_goes_here()
return obj
except Exception as ex:
DetailsBox.Show(ex)
def another_cool_function(another_parameter):
try:
what_you_want()
return obj
except Exception as ex:
DetailsBox.Show(ex)
The Result
Match: def test_function(some_parameter):...
Position: ..
Length: ..
Group 1: test_function
Group 2: some_parameter
Group 3: (newline/return character)
return obj
But Group 3 should be:
try:
some_cool_code_goes_here()
return obj
except Exception as ex:
DetailsBox.Show(ex)
I hope you can help me :3 Thank you guys!
Although #Hamza said in his comment that you have several problems in your regex, I think they are more of uneeded complexity, the reason for not matching the body might be that you haven't let the . special meta-character match the new line so it is stopping at the first new line character after the first Try: statement.
To fix this you will need to let the . match new line characters and here is a stripped down version of your regex that works:
(?i)def\s+(\w+)\s*\(([\, \w]+)\)(?:\s*:\s*)(.+?)(?=def|$)
Thanks to HamZa for the quick help (and of course also thanks for all the other helpers), he actually solved the problem. There were just a few adjustments necessary (to make it work for C# :-)) but the main point comes from him, thanks a lot.
Solution for my problem:
Regex(r"(?is)def\s*(?<name>\w+)\s*\((?<parameter>[^)]+)\)\s*:\s*(?:\r?\n)+(?<body>.*?)(?=\r?\ndef|$)")
Related
I am trying to match an assignment string in VB code (as in I'm passing in text that is VB code into my program that's written in C#). The assignment string that I'm trying to match is something for example like
CustomClassInitializer(someParameter, anotherParameter, someOtherClassAsParameterWithInitialization()).SomeProperty = 7
and I realize that's rather complex, but it actually isn't far off from some of the real text I'm trying to match.
In order to do so I wrote a Regex. This Regex:
#"[\w,.]+\(([\w,.]*\(*,* *\)*)+ = "
which correctly matches. The problem is it becomes VERY slow (with timeouts), which I've researched and found is probably because of "backtracking". One of the suggested solutions to help with backtracking in general was to add "?>" to the regex, which I think would go in this position:
[\w,.]+\(?>([\w,.]*\(*,* *\)*)+ =
but this no longer matches properly.
I'm fairly new to Regex, so I imagine that there is a much better pattern. What is it please? Or how can I improve my times in general?
Helpful notes:
I'm only interested in position 0 of the string I'm searching for a
match in. My code is "if (isMatch && match.index == 0) { ... }. Can
I tell it to only check position 0 and if it's not a match move on?
The reason I use all the 0 or more things is the match could be as simple as CustomClass() = new CustomClass(), and as complicated as the above or perhaps a bit worse. I'm trying to get as many cases as possible.
This Regex is interested in "[\w,.]+(" and then "whatever may be inside the parentheses" (I tried to think of what all could be inside them based on the fact that it's valid VB code) until you get to the close parenthesis and then " = ". Perhaps I can use a wildcard for literally anything until it get's to ") = " in the string? - Like I said, fairly new to Regex.
Thanks in advance!
This seems to do what you want. Normally, I like to be more specific than .*, but it is working correctly. Note that I am using the Multi-line option.
^.*=\s*.+$
Here is a working example in RegExStorm.net example
This question already has answers here:
Match exact string
(3 answers)
Closed 3 years ago.
I'm parsing a text file line by line and for each line I have a special regex. However in one case a pattern is matching two lines. One that is a correct match and another line only partialy because a couple of values are optional.
Invalid match:
BNE1010/1000 HKG1955/2005 7/PLD/CLD/YLD
matches patial string (shouln't match this at all):
BNE1010/1000
Correct match (matches the entire string):
RG878A/21AUG15 GIG/BOG 1/RG/AV 3/AV 4/AV 5/RG 6/AV081C/22 7/CDC/YD 9/TP
The regex for this is quite long and contains several optionl groups:
^(?<FlightDesignator>([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))(?<OperationalSuffix>[A-Z])?(?<FlightIdentifierDate>\/(\d{2})([A-Z]{3})?(\d{2})?)?(\s(?<FlightLegsChangeIdentifier>(\/?[A-Z]{3})+)(?=(\s|$)))?(\s1(?<JointOperationAirlineDesignators>(\/.{2}[A-Z]?)+))?(\s3\/(?<AircraftOwner>([A-Z]{2}|.)))?(\s4\/(?<CockpitCrewEmployer>(.+?)(?=(?: \d\/|$))))?(\s5\/(?<CabinCrewEmployer>([A-Z]{2}|.)))?(?<OnwardFlight>\s6\/(([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))([A-Z])?(\/(\d{2})([A-Z]{3})?(\d{2})?)?)?(\s7\/(?<MealServiceNote>(\/?[A-Z]{0,3})+))?(\s9\/(?<OperatingAirlineDisclosure>(.{2}[A-Z]?)))?
I think there is no need to study the entire regex becasue it's build dynamically from smaller patterns at runtime and all the parts work correctly. Also lots of combinations are tested with unit tests and they all work... as long as I try to parse ony the line that should be matched by the pattern.
Currently I'm checking if the entire string is matched by
match.Group[0].Value == line
but I find it's quite ugly. I know from JavaScript the regex engine provides an Index property where the regex engine stopped. So my idea was to compare the index with the length of the string. Unfortunatelly I wasn't able to find such a property in C#.
Another idea would be to modify the regex so that it matches only one line and no partial lines.
Example: https://regex101.com/r/dM5wU4/1
The example contains only two cases because there aren't actually any combinations that would change its behavior. I could remove some parameters but it wouldn't change anything.
EDIT:
I've edited my question. Sorry to every for not providing all the information at the first time. I won't ask any more questions when writing on the phone :) It wasn't a good idea. Hopefully it won't get closed now.
You asked whether I could simplify the regex. I would do it if I could and knew how. If it was easy I wouldn't have asked. The problem started as the regex ans string became bigger during development. Now they are at the production length and I can't actually make them shorter even for the sake of the quesion, sorry.
EDIT-2:
I found the reason why I couldn't find the inherited Index and Length properties of the Match class.
For some strange reason when selecting the Match class and pressing F1 Visual Studio opened the wrong help page (Match Properties) even though I'm not working with the Micro Framework. I didn't notice that but I was indeed wondering why there is very little information. Thx to #Jamiec for the correct link. I won't trust Visual Studio anymore when hitting F1.
Disclaimer: Im going to add this, but I doubt its the solution. If it's not this part will get deleted in short order
You can add a $ at the end of your regular expression. This stops your first example matching but continues to match the second example.
As you've not provided any more than 2 examples, its unclear if this actually solves all your cases or just that one specific false positive.
My question is whether it is possible to check if a regular expression matched the entire sting without checking the first group against the original line?
If you're not adverse to checking the entire match to the length of the string you can do that too:
var regex = new Regex(#"^(?<FlightDesignator>([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))(?<OperationalSuffix>[A-Z])?(?<FlightIdentifierDate>\/(\d{2})([A-Z]{3})?(\d{2})?)?(\s(?<FlightLegsChangeIdentifier>(\/?[A-Z]{3})+)(?=(\s|$)))?(\s1(?<JointOperationAirlineDesignators>(\/.{2}[A-Z]?)+))?(\s3\/(?<AircraftOwner>([A-Z]{2}|.)))?(\s4\/(?<CockpitCrewEmployer>(.+?)(?=(?: \d\/|$))))?(\s5\/(?<CabinCrewEmployer>([A-Z]{2}|.)))?(?<OnwardFlight>\s6\/(([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))([A-Z])?(\/(\d{2})([A-Z]{3})?(\d{2})?)?)?(\s7\/(?<MealServiceNote>(\/?[A-Z]{0,3})+))?(\s9\/(?<OperatingAirlineDisclosure>(.{2}[A-Z]?)))?");
var input1 = #"BNE1010/1000 HKG1955/2005 7/PLD/CLD/YLD";
var input2 = #"RG878A/21AUG15 GIG/BOG 1/RG/AV 3/AV 4/AV 5/RG 6/AV081C/22 7/CDC/YD 9/TP";
var match1 = regex.Match(input1);
var match2 = regex.Match(input2);
Console.WriteLine(match1.Length == input1.Length); // False
Console.WriteLine(match2.Length == input2.Length); // True
Live example: http://rextester.com/NIBE6349
i have text box for phone number .i need to validate it.my requiremants are
Take only numeric more than 10digits
Take symbols like (,),-,
can any one help for this.i tried
^[\d{10,14} +\s +\( +\)-]+$
but not working.
You may take a look at the following article which will help you build such expression.
You haven't said what is wrong with your regex (why it's not working as expected) but I'm guessing that the issue is it matches far more than it should. I.e it will match 1 or more of all the characters in your set (rather than just between 10 and 14).
I think you're mistake is that you have put way too much in your character set. You've got the + symbol in there 3 times and it looks like your trying to use quantifiers from within the set as well, which is not allowed. Character sets are the equivalent of single character alternations. So, [abc] is the equivalent of a|b|c.
I'm assuming that you want the input to be between 10 and 14 numbers while still allowing any number (zero or more) of the following characters:
+()-,
As some others have suggested, you could just put the chars you want in a set and then specify the quantifier after it like this: ^[0-9()-,+]{10,14}$. This will almost get you there. Only problem with it is that it will allow between 10 and 14 of any of these characters, so it would successfully match this:
,,,,,++()---
Which clearly you don't want (do you?)
So, in order to better solve this problem, you'll need to be more specific about what is allowed and where in the subject it is allowed. Because i don't know exactly what you want to match, i can't take you much further.
Hopefully the information I've provided here should be good enough to get you started, and if you have more questions... well that's what we're all here for right, so ask away.
To help you out with learning, below are a few resources you might find useful (this is a small subset of what's available, so do go ahead and search for yourself):
Testing tools
Rubular (ruby)
GSkinner Regex Testser
RegexHero (dotnet)
Helpful info
Regular-Expressions.Info
Codeproject 30 Minute Tutorial
I'm kind of new too C#, and regular expression for that matter, but I've searched a couple of hours to find a solution too this problem so, hopefully this is easy for you guys:)
My application uses a regex to match email addresses in a given string,
then loops throu the matches.:
String EmailPattern = "\\w+([-+.]\\w+)*#\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*";
MatchCollection mcemail = Regex.Matches(rawHTML, EmailPattern);
foreach (Match memail in mcemail)
Works fine, but, when I downloaded the string from a certain page, http://www.sp.se/sv/index/services/quality/sidor/default.aspx, the MatchCollection(mcemail) object "hangs" the loop. When using a break point and accessing the object, I get "Function evuluation timed out" on everything(.Count etc).
Update
I've tried my pattern and other email patterns on the same string, everyone(regex desingers, python based web pages etc.) fails/timesout when trying too match this particular string.
How can I detect that the matchcollection obj is not "ready" to use?
If you can post the email that's causing the problem (perhaps anonymized in some way), that will give us more information, but I'm thinking the problem is this little guy right here:
([-.]\\w+)*\\.\\w+([-.]\\w+)*
To understand the problem, let's break that into groups:
([-.]\\w+)*
\\.\\w+
([-.]\\w+)*
The strings that will match \\.\\w+ are a subset of those that will match [-.]\\w+. So if part of your input looks like foo.bar.baz.blah.yadda.com, your regex engine has no way of knowing which group is supposed to match it. Does that make sense? So the first ([-.]\\w+)* could match .bar.baz.blah, then the \\.\\w+ could match .yadda, then the last ([-.]\\w+)* could match .com...
...OR the first clause could match .bar.baz, the second could match .blah, and the last could match .yadda.com. Since it doesn't know which one is right, it will keep trying different combinations. It should stop eventually, but that could still take a long time. This is called "catastrophic backtracking".
This issue is compounded by the fact that you're using capturing groups rather than non-capturing groups; i.e. ([-+.]\\w+) instead of (?:[-+.]\\w+). That causes the engine to try and separate and save whatever matches inside the parentheses for later reference. But as I explained above, it's ambiguous which group each substring belongs in.
You might consider replacing everything after the # with something like this:
\\w[-\\w]*\\.[-.\\w]+
That could use some refinement to make it more specific, but you get the general idea. Hope I explained all this well enough; grouping and backreferences are kind of tough to describe.
EDIT:
Looking back at your pattern, there's a deeper issue here, still related to the backtracking/ambiguity problem I mentioned. The clause \\w+([-.]\\w+)* is ambiguous all by itself. Splitting it into parts, we have:
\\w+
([-.]\\w+)*
Suppose you have a string like foobar. Where does the \\w+ end and the ([-.]\\w+)* begin? How many repetitions of ([-.]\\w+) are there? Any of the following could work as matches:
f(oobar)
foo(bar)
f(o)(oba)(r)
f(o)(o)(b)(a)(r)
foobar
etc...
The regex engine doesn't know which is important, so it will try them all. This is the same problem I pointed out above, but it means you have it in multiple places in your pattern.
Even worse, ([-.]\\w+)* is also ambiguous, because of the + after the \\w. How many groups are there in blah? I count 16 possible combinations: (blah), (b)(lah), (bl)(ah)...
The amount of different possible combinations is going to be huge, even for a relatively small input, so your engine is going to be in overdrive. I would definitely simplify it if I were you.
I just did a local test and it appears either the sheer document size or something in the ViewState causes the Regex match evaluation to time out. (Edit: I'm pretty sure it's the size, actually. Removing the ViewState just reduces the size significantly.)
An admittedly crude way to solve this would be something like this:
string[] rawHtmlLines = File.ReadAllLines(#"C:\default.aspx");
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => !line.Contains("_VIEWSTATE")).ToArray());
string emailPattern = #"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
var emailMatches = Regex.Matches(filteredHtml, emailPattern);
foreach (Match match in emailMatches)
{
//...
}
Overall I suspect the email pattern is just not well optimised (or intended) to filter out emails in a large string but just used as validation for user input. Generally it might be a good idea to limit the string you search in to just the parts you are actually interested in and keep it as small as possible - for example by leaving out the ViewState which is guaranteed to not contain any readable email addresses.
If performance is important, it's probably also a better idea to create the filtered HTML using a StringBuilder and IndexOf (etc.) instead of splitting lines and LINQing up the result :)
Edit:
To further minimize the length of the string the Regex needs to check you could only include lines that contain the # character to begin with, like so:
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => line.IndexOf('#') >= 0 && !line.Contains("_VIEWSTATE")).ToArray());
From "Function evaluation timed out", I'm assuming you're doing this in the debugger. The debugger has some fairly quick timeouts with regard to how long a method takes. Not eveything happens quickly. I would suggest going the operation in code, storing the result, then viewing that result in the debugger (i.e. let the call to Matches run and put a breakpoint after it).
Now, with regard to detecting whether the string will make Matches take a long time; that's a bit of a black art. You basically have to perform some sort of input validation. Just because you got some value from the internet, doesn't mean that value will work well with Matches. The ultimate validation logic is up to you; but, starting with the length of rawHtmlLines might be useful. (i.e. if the lenght is 1000000 bytes, Matches might take a while) But, you have to decide what to do if the length is too long; e.g give an error to the user.
Before I look into doing this myself, is there already a function out there that will do this:
I want to pass in a string containing text and RegEx markup and then it returns all possible matches it would look for in a string.
so the following passed to the method
abc|def|xyz
Would return 3 strings in an array or collection:
abc
def
xyz
Because that regex notation says to look for either abc, def or xyz.
I don't want this to search for the term in another string or anything like that, just return the possible matches it could make.
That's a simple example, anything that will do this for me, or shall I start writing the method myself?
With a simple regex as per your example it will work, but as soon as you start dealing with wild cards and repetition, it will have to generate almost an infinite amount of possible solutions, which in some cases may never even terminate.
No, it does not :)