I'm kind of new too C#, and regular expression for that matter, but I've searched a couple of hours to find a solution too this problem so, hopefully this is easy for you guys:)
My application uses a regex to match email addresses in a given string,
then loops throu the matches.:
String EmailPattern = "\\w+([-+.]\\w+)*#\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*";
MatchCollection mcemail = Regex.Matches(rawHTML, EmailPattern);
foreach (Match memail in mcemail)
Works fine, but, when I downloaded the string from a certain page, http://www.sp.se/sv/index/services/quality/sidor/default.aspx, the MatchCollection(mcemail) object "hangs" the loop. When using a break point and accessing the object, I get "Function evuluation timed out" on everything(.Count etc).
Update
I've tried my pattern and other email patterns on the same string, everyone(regex desingers, python based web pages etc.) fails/timesout when trying too match this particular string.
How can I detect that the matchcollection obj is not "ready" to use?
If you can post the email that's causing the problem (perhaps anonymized in some way), that will give us more information, but I'm thinking the problem is this little guy right here:
([-.]\\w+)*\\.\\w+([-.]\\w+)*
To understand the problem, let's break that into groups:
([-.]\\w+)*
\\.\\w+
([-.]\\w+)*
The strings that will match \\.\\w+ are a subset of those that will match [-.]\\w+. So if part of your input looks like foo.bar.baz.blah.yadda.com, your regex engine has no way of knowing which group is supposed to match it. Does that make sense? So the first ([-.]\\w+)* could match .bar.baz.blah, then the \\.\\w+ could match .yadda, then the last ([-.]\\w+)* could match .com...
...OR the first clause could match .bar.baz, the second could match .blah, and the last could match .yadda.com. Since it doesn't know which one is right, it will keep trying different combinations. It should stop eventually, but that could still take a long time. This is called "catastrophic backtracking".
This issue is compounded by the fact that you're using capturing groups rather than non-capturing groups; i.e. ([-+.]\\w+) instead of (?:[-+.]\\w+). That causes the engine to try and separate and save whatever matches inside the parentheses for later reference. But as I explained above, it's ambiguous which group each substring belongs in.
You might consider replacing everything after the # with something like this:
\\w[-\\w]*\\.[-.\\w]+
That could use some refinement to make it more specific, but you get the general idea. Hope I explained all this well enough; grouping and backreferences are kind of tough to describe.
EDIT:
Looking back at your pattern, there's a deeper issue here, still related to the backtracking/ambiguity problem I mentioned. The clause \\w+([-.]\\w+)* is ambiguous all by itself. Splitting it into parts, we have:
\\w+
([-.]\\w+)*
Suppose you have a string like foobar. Where does the \\w+ end and the ([-.]\\w+)* begin? How many repetitions of ([-.]\\w+) are there? Any of the following could work as matches:
f(oobar)
foo(bar)
f(o)(oba)(r)
f(o)(o)(b)(a)(r)
foobar
etc...
The regex engine doesn't know which is important, so it will try them all. This is the same problem I pointed out above, but it means you have it in multiple places in your pattern.
Even worse, ([-.]\\w+)* is also ambiguous, because of the + after the \\w. How many groups are there in blah? I count 16 possible combinations: (blah), (b)(lah), (bl)(ah)...
The amount of different possible combinations is going to be huge, even for a relatively small input, so your engine is going to be in overdrive. I would definitely simplify it if I were you.
I just did a local test and it appears either the sheer document size or something in the ViewState causes the Regex match evaluation to time out. (Edit: I'm pretty sure it's the size, actually. Removing the ViewState just reduces the size significantly.)
An admittedly crude way to solve this would be something like this:
string[] rawHtmlLines = File.ReadAllLines(#"C:\default.aspx");
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => !line.Contains("_VIEWSTATE")).ToArray());
string emailPattern = #"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
var emailMatches = Regex.Matches(filteredHtml, emailPattern);
foreach (Match match in emailMatches)
{
//...
}
Overall I suspect the email pattern is just not well optimised (or intended) to filter out emails in a large string but just used as validation for user input. Generally it might be a good idea to limit the string you search in to just the parts you are actually interested in and keep it as small as possible - for example by leaving out the ViewState which is guaranteed to not contain any readable email addresses.
If performance is important, it's probably also a better idea to create the filtered HTML using a StringBuilder and IndexOf (etc.) instead of splitting lines and LINQing up the result :)
Edit:
To further minimize the length of the string the Regex needs to check you could only include lines that contain the # character to begin with, like so:
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => line.IndexOf('#') >= 0 && !line.Contains("_VIEWSTATE")).ToArray());
From "Function evaluation timed out", I'm assuming you're doing this in the debugger. The debugger has some fairly quick timeouts with regard to how long a method takes. Not eveything happens quickly. I would suggest going the operation in code, storing the result, then viewing that result in the debugger (i.e. let the call to Matches run and put a breakpoint after it).
Now, with regard to detecting whether the string will make Matches take a long time; that's a bit of a black art. You basically have to perform some sort of input validation. Just because you got some value from the internet, doesn't mean that value will work well with Matches. The ultimate validation logic is up to you; but, starting with the length of rawHtmlLines might be useful. (i.e. if the lenght is 1000000 bytes, Matches might take a while) But, you have to decide what to do if the length is too long; e.g give an error to the user.
Related
I am trying to match an assignment string in VB code (as in I'm passing in text that is VB code into my program that's written in C#). The assignment string that I'm trying to match is something for example like
CustomClassInitializer(someParameter, anotherParameter, someOtherClassAsParameterWithInitialization()).SomeProperty = 7
and I realize that's rather complex, but it actually isn't far off from some of the real text I'm trying to match.
In order to do so I wrote a Regex. This Regex:
#"[\w,.]+\(([\w,.]*\(*,* *\)*)+ = "
which correctly matches. The problem is it becomes VERY slow (with timeouts), which I've researched and found is probably because of "backtracking". One of the suggested solutions to help with backtracking in general was to add "?>" to the regex, which I think would go in this position:
[\w,.]+\(?>([\w,.]*\(*,* *\)*)+ =
but this no longer matches properly.
I'm fairly new to Regex, so I imagine that there is a much better pattern. What is it please? Or how can I improve my times in general?
Helpful notes:
I'm only interested in position 0 of the string I'm searching for a
match in. My code is "if (isMatch && match.index == 0) { ... }. Can
I tell it to only check position 0 and if it's not a match move on?
The reason I use all the 0 or more things is the match could be as simple as CustomClass() = new CustomClass(), and as complicated as the above or perhaps a bit worse. I'm trying to get as many cases as possible.
This Regex is interested in "[\w,.]+(" and then "whatever may be inside the parentheses" (I tried to think of what all could be inside them based on the fact that it's valid VB code) until you get to the close parenthesis and then " = ". Perhaps I can use a wildcard for literally anything until it get's to ") = " in the string? - Like I said, fairly new to Regex.
Thanks in advance!
This seems to do what you want. Normally, I like to be more specific than .*, but it is working correctly. Note that I am using the Multi-line option.
^.*=\s*.+$
Here is a working example in RegExStorm.net example
This question already has answers here:
Match exact string
(3 answers)
Closed 3 years ago.
I'm parsing a text file line by line and for each line I have a special regex. However in one case a pattern is matching two lines. One that is a correct match and another line only partialy because a couple of values are optional.
Invalid match:
BNE1010/1000 HKG1955/2005 7/PLD/CLD/YLD
matches patial string (shouln't match this at all):
BNE1010/1000
Correct match (matches the entire string):
RG878A/21AUG15 GIG/BOG 1/RG/AV 3/AV 4/AV 5/RG 6/AV081C/22 7/CDC/YD 9/TP
The regex for this is quite long and contains several optionl groups:
^(?<FlightDesignator>([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))(?<OperationalSuffix>[A-Z])?(?<FlightIdentifierDate>\/(\d{2})([A-Z]{3})?(\d{2})?)?(\s(?<FlightLegsChangeIdentifier>(\/?[A-Z]{3})+)(?=(\s|$)))?(\s1(?<JointOperationAirlineDesignators>(\/.{2}[A-Z]?)+))?(\s3\/(?<AircraftOwner>([A-Z]{2}|.)))?(\s4\/(?<CockpitCrewEmployer>(.+?)(?=(?: \d\/|$))))?(\s5\/(?<CabinCrewEmployer>([A-Z]{2}|.)))?(?<OnwardFlight>\s6\/(([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))([A-Z])?(\/(\d{2})([A-Z]{3})?(\d{2})?)?)?(\s7\/(?<MealServiceNote>(\/?[A-Z]{0,3})+))?(\s9\/(?<OperatingAirlineDisclosure>(.{2}[A-Z]?)))?
I think there is no need to study the entire regex becasue it's build dynamically from smaller patterns at runtime and all the parts work correctly. Also lots of combinations are tested with unit tests and they all work... as long as I try to parse ony the line that should be matched by the pattern.
Currently I'm checking if the entire string is matched by
match.Group[0].Value == line
but I find it's quite ugly. I know from JavaScript the regex engine provides an Index property where the regex engine stopped. So my idea was to compare the index with the length of the string. Unfortunatelly I wasn't able to find such a property in C#.
Another idea would be to modify the regex so that it matches only one line and no partial lines.
Example: https://regex101.com/r/dM5wU4/1
The example contains only two cases because there aren't actually any combinations that would change its behavior. I could remove some parameters but it wouldn't change anything.
EDIT:
I've edited my question. Sorry to every for not providing all the information at the first time. I won't ask any more questions when writing on the phone :) It wasn't a good idea. Hopefully it won't get closed now.
You asked whether I could simplify the regex. I would do it if I could and knew how. If it was easy I wouldn't have asked. The problem started as the regex ans string became bigger during development. Now they are at the production length and I can't actually make them shorter even for the sake of the quesion, sorry.
EDIT-2:
I found the reason why I couldn't find the inherited Index and Length properties of the Match class.
For some strange reason when selecting the Match class and pressing F1 Visual Studio opened the wrong help page (Match Properties) even though I'm not working with the Micro Framework. I didn't notice that but I was indeed wondering why there is very little information. Thx to #Jamiec for the correct link. I won't trust Visual Studio anymore when hitting F1.
Disclaimer: Im going to add this, but I doubt its the solution. If it's not this part will get deleted in short order
You can add a $ at the end of your regular expression. This stops your first example matching but continues to match the second example.
As you've not provided any more than 2 examples, its unclear if this actually solves all your cases or just that one specific false positive.
My question is whether it is possible to check if a regular expression matched the entire sting without checking the first group against the original line?
If you're not adverse to checking the entire match to the length of the string you can do that too:
var regex = new Regex(#"^(?<FlightDesignator>([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))(?<OperationalSuffix>[A-Z])?(?<FlightIdentifierDate>\/(\d{2})([A-Z]{3})?(\d{2})?)?(\s(?<FlightLegsChangeIdentifier>(\/?[A-Z]{3})+)(?=(\s|$)))?(\s1(?<JointOperationAirlineDesignators>(\/.{2}[A-Z]?)+))?(\s3\/(?<AircraftOwner>([A-Z]{2}|.)))?(\s4\/(?<CockpitCrewEmployer>(.+?)(?=(?: \d\/|$))))?(\s5\/(?<CabinCrewEmployer>([A-Z]{2}|.)))?(?<OnwardFlight>\s6\/(([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))([A-Z])?(\/(\d{2})([A-Z]{3})?(\d{2})?)?)?(\s7\/(?<MealServiceNote>(\/?[A-Z]{0,3})+))?(\s9\/(?<OperatingAirlineDisclosure>(.{2}[A-Z]?)))?");
var input1 = #"BNE1010/1000 HKG1955/2005 7/PLD/CLD/YLD";
var input2 = #"RG878A/21AUG15 GIG/BOG 1/RG/AV 3/AV 4/AV 5/RG 6/AV081C/22 7/CDC/YD 9/TP";
var match1 = regex.Match(input1);
var match2 = regex.Match(input2);
Console.WriteLine(match1.Length == input1.Length); // False
Console.WriteLine(match2.Length == input2.Length); // True
Live example: http://rextester.com/NIBE6349
Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.
I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.
To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.
This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.
I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures
In my answer to this question, I mentioned that we used UpperCamelCase parsing to get a description of an enum constant not decorated with a Description attribute, but it was naive, and it didn't work in all cases. I revisited it, and this is what I came up with:
var result = Regex.Replace(camelCasedString,
#"(?<a>(?<!^)[A-Z][a-z])", #" ${a}");
result = Regex.Replace(result,
#"(?<a>[a-z])(?<b>[A-Z0-9])", #"${a} ${b}");
The first Replace looks for an uppercase letter, followed by a lowercase letter, EXCEPT where the uppercase letter is the start of the string (to avoid having to go back and trim), and adds a preceding space. It handles your basic UpperCamelCase identifiers, and leading all-upper acronyms like FDICInsured.
The second Replace looks for a lowercase letter followed by an uppercase letter or a number, and inserts a space between the two. This is to handle special but common cases of middle or trailing acronyms, or numbers in an identifier (except leading numbers, which are usually prohibited in C-style languages anyway).
Running some basic unit tests, the combination of these two correctly separated all of the following identifiers: NoDescription, HasLotsOfWords, AAANoDescription, ThisHasTheAcronymABCInTheMiddle, MyTrailingAcronymID, TheNumber3, IDo3Things, IAmAValueWithSingleLetterWords, and Basic (which didn't have any spaces added).
So, I'm posting this first to share it with others who may find it useful, and second to ask two questions:
Anyone see a case that would follow common CamelCase-ish conventions, that WOULDN'T be correctly separated into a friendly string this way? I know it won't separate adjacent acronyms (FDICFCUAInsured), recapitalize "properly" camelCased acronyms like FdicInsured, or capitalize the first letter of a lowerCamelCased identifier (but that one's easy to add - result = Regex.Replace(result, "^[a-z]", m=>m.ToString().ToUpper());). Anything else?
Can anyone see a way to make this one statement, or more elegant? I was looking to combine the Replace calls, but as they do two different things to their matches it can't be done with these two strings. They could be combined into a method chain with a RegexReplace extension method on String, but can anyone think of better?
So while I agree with Hans Passant here, I have to say that I had to try my hand at making it one regex as an armchair regex user.
(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))
Is what I came up with. It seems to pass all the tests you put forward in the question.
So
var result = Regex.Replace(camelCasedString, #"(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))", #" ${a}");
Does it in one pass.
not that this directly answers the question, but why not test by taking the standard C# API and converting each class into a friendly name? It'd take some manual verification, but it'd give you a good list of standard names to test.
Let's say every case you come across works with this (you're asking us for examples that won't and then giving us some, so you don't even have a question left).
This still binds UI to programmatic identifiers in a way that will make both programming and UI changes brittle.
It still assumes your program will only be used in one language. Either your potential market it so small that just indexing an array of names would be scalable enough (e.g. a one-client bespoke or in-house project), or you are assuming you will never be successful enough to need to be available to other languages or other dialects of your first-chosen language.
Does "well, it'll work as long as we're a failure" sound like a passing grade in balancing designs?
Either code it to use resources, or else code it to pass the enum name blindly or use an array of names, as that at least will be modifiable afterwards.
I am calling a web service and all I get back is a giant blob of text. I am left to process it myself. Problem is not all lines are necessarily the same. They each have 2 or 3 sections to them and they are similar. Here are the most common examples
text1 [text2] /text3/
text1/test3
text1[text2]/text3
text1 [text2] /text /3 here/
I am not exactly sure how to approach this problem. I am not too good at doing anything advanced as far as manipulating strings.
I was thinking using a regular expression might work, but not too sure on that either. If I can get each of these 3 sections broken up it is easier from there to do the rest. its just there doesn't seem to be any uniformity to the main 3 sections that I know how to work with.
EDIT: Thanks for mentioning i didn't actually say what I wanted to do.
Basically, I want to split these 3 sections of text into their own strings seperate stings so basically take it from one single string to an array of 3 strings.
string[0] = text1
string[1] = text2
string[2] = text3
Here is some of the text I get back from a call as an example
スルホ基 [スルホき] /(n) sulfo group/
鋭いナイフ [するどいナイフ] /(n) sharp knife/
鋭い批判 [するどいひはん] /(n) sharp criticism/
スルナーイ /(n) (See ズルナ) (obsc) surnay (Anatolian woodwind instrument) (per:)/zurna/
スルピリン /(n) sulpyrine/
スルファミン /(n) sulfamine/
剃る [そる(P);する] /(v5r,vt) to shave/(P)/
As the first line for an example I want to pull it out into an array
string[0] = スルホ基
string[0] = [スルホき]
string[0] = /(n) sulfo group/
Those example seem a bit random, there has to be some kind of order, isn't there a spec for the service? If not i suggest more example so that we can understand the rules.
Read up on some of the info here on finite state machines, and see if you can use some of the concepts on your input parsing problem.
If there is some order to the groups on each line, then maybe you can use a regex to separate the groups out.
Edit: after seeing your samples, you may get by with a regex, breaking on some of those specific delimiters. It will take maybe half an hour to test theory: pick yourself up a free regex tester, make yourself a regex that will isolate out just one of those groups, and pump a few sample lines through. If it performs reliably on the real data that you have, then expand it and see if you can also isolate out the other groups.
I should mention though that your regexes will break or just become a nightmare if there is any sort of vagaries in your data (and frequently there is). So test long and hard before settling on them. If you find you start to have exceptions in your data, then you will need to choose some sort of parsing algorithm (the FSM i mentioned above is a pattern you can follow if you implement a parsing mechanism).
The most stupid answer is "Use regex". But more information needed for better one.