RegEx expressions with replacement and extracting values - c#

I want to take an expression like
(123456789..value > 2000) && (987654321.Value < 12)
extract the 123456789 and 987654321 (could be anything here)
and replace it with
ClassName.GetInfo("%s").GetValue() (as an example)
putting the 2 values in the place of the %s...
to be a resulting
(ClassName.GetInfo("123456789").GetValue() > 2000) && (ClassName.GetInfo("987654321").GetValue() < 12)
Can anyone give me a clue as to how to accomplish this?

A rather oversimplified example, but this should work.
Note that the following will only allow alpha-numeric or '-' or '_' in the place you claim (could be anything here). This is by nessesity if you intend to be able to recognize it with any form of parser regex or otherwise. You need to either limit the characters that can be used as an identifier, or you need to delineate them and allow for escaping the delimitation characters.
private static void Main()
{
Regex pattern = new Regex(#"(?<Name>[\w\-_]+)\.+(?<Value>[\w\-_]+)");
string sample = #"(123456789..value > 2000) && (987654321.Value < 12)";
string result = pattern.Replace(sample,
m =>
String.Format(
"ClassName.GetInfo(\"{0}\").Get{1}{2}()",
m.Groups["Name"].Value,
Char.ToUpper(m.Groups["Value"].Value[0]),
m.Groups["Value"].Value.Substring(1))
);
Console.WriteLine(result);
}
The program outputs:
(ClassName.GetInfo("123456789").GetValue() > 2000) && (ClassName.GetInfo("987654321").GetValue() < 12)
There are two other rather odd behaviors in your example that are addressed above. The first is the use of multiple delimiters '..' in your example "(123456789..value". This seems like a possible mistake, just remove the '+' from this part of the expression ").+(".
The second oddity is that your example just auto-magically corrects the character-case of the first property from "value" to "Value". Although I mimic this magical behavior by ensuring the first character is upper-case this is not a great solution. A better answer would be to use a case-insensitive dictionary and lookup the proper case.
Hopefully that will get you started, but I have to be honest and say you have a VERY long road ahead of you. Parsing an expression language is never a trivial thing and should generally be avoided. If this is for internal use just make them type in the full version. If this is for external use... well, I would re-think you're objective. Perhaps building a graphical expression tree like SQL's QBE would be a better expenditure of your time and energy.

Related

Regex hangs trying to find match

I am trying to match an assignment string in VB code (as in I'm passing in text that is VB code into my program that's written in C#). The assignment string that I'm trying to match is something for example like
CustomClassInitializer(someParameter, anotherParameter, someOtherClassAsParameterWithInitialization()).SomeProperty = 7
and I realize that's rather complex, but it actually isn't far off from some of the real text I'm trying to match.
In order to do so I wrote a Regex. This Regex:
#"[\w,.]+\(([\w,.]*\(*,* *\)*)+ = "
which correctly matches. The problem is it becomes VERY slow (with timeouts), which I've researched and found is probably because of "backtracking". One of the suggested solutions to help with backtracking in general was to add "?>" to the regex, which I think would go in this position:
[\w,.]+\(?>([\w,.]*\(*,* *\)*)+ =
but this no longer matches properly.
I'm fairly new to Regex, so I imagine that there is a much better pattern. What is it please? Or how can I improve my times in general?
Helpful notes:
I'm only interested in position 0 of the string I'm searching for a
match in. My code is "if (isMatch && match.index == 0) { ... }. Can
I tell it to only check position 0 and if it's not a match move on?
The reason I use all the 0 or more things is the match could be as simple as CustomClass() = new CustomClass(), and as complicated as the above or perhaps a bit worse. I'm trying to get as many cases as possible.
This Regex is interested in "[\w,.]+(" and then "whatever may be inside the parentheses" (I tried to think of what all could be inside them based on the fact that it's valid VB code) until you get to the close parenthesis and then " = ". Perhaps I can use a wildcard for literally anything until it get's to ") = " in the string? - Like I said, fairly new to Regex.
Thanks in advance!
This seems to do what you want. Normally, I like to be more specific than .*, but it is working correctly. Note that I am using the Multi-line option.
^.*=\s*.+$
Here is a working example in RegExStorm.net example

How is it possible to implement regular expressions (or the like) into WebMatrix razor C#?

I am trying to use regular expressions with my C# code, and I can find a few examples here on what those regular expressions might look like for my cause, but nowhere can I tell how I am supposed to (syntactically) implement that into my logic and code.
I have tried setting up an external .cs file and then calling the method, but the value returned is unable to be casted or worked with in any way.
The code can be found here:Checking strings for a strong enough password
In any event, I just want to know how I can check (and test against) values in a password to make sure it is up to standards that I specify.
A better way than suggesting to use regular expressions (since information on how to incorporate them into my own specific logistical setup is both very sparse and very confusing)
...is suggesting how it can be done without them.
I know I could use
foreach(char c in myString)
and then test individually, character by character, but I was hoping there was a better way that can either be regex, explained (that is, explained how to call this suggestion into action, not just posting a string of seemingly (but not) random characters and told, hey use that!), or not using regex at all, but somekind of shorthand, perhaps.
Truth is I would use regular expressions, but every time I look them up I can't seem to find anything that is useable by me in WebMatrix.
I want to be able to test a password to be sure that it has both an uppercase and a lowercase number. In addition I need to check for at least one number.
UPDATE:
Okay, allow me to rephrase my question, maybe I am being confusing...
How does regex work in webmatrix razor (C#). Please show how the regex actually works (I can't seem to find a clear, or even close to clear, answer on this on the web for webmatrix), then please show how it can be put (directly or indirectly) into if logic, on my cshtml page, so that I can actually check against it.
A Regular Expression (Regex), as you will find out, is a tool used in matching text against a pattern and even extracting or replacing matches in the source text.
To do this the Regex engine (which exists in the .Net framework inside the namespace System.Text.RegularExpressions) uses some established patterns that represent certain kinds of chars.
To use a Regex, you pass it the pattern agains which a text will be tested and the text itself.
For instance, the following snippet tests if there are lowercase letters in the text:
using System.Text.RegularExpressions;
...
var Pattern = "[a-z]";
if (Regex.IsMatch(SomeText, Pattern)) {
//... there is at least one lower case char in the text
}
The pattern used above estates that we are interested in a range of chars from lowercase "a" to lowercase "z".
A pattern to require at least one lowercase letter, one digit and one uppercase letter could probably be something like
#"[\d]+[a-z]+[A-Z]|[\d]+[A-Z]+[a-z]|[a-z]+[\d]+[A-Z]|[a-z]+[A-Z]+[\d]|[A-Z]+[\d]+[a-z]|[A-Z]+[a-z]+[\d]"
As you can see, a Regex pattern can be as simple as in the first example, but may escalate fast in complexity depending on the kind of information you want to check or extract.
In the Regex above, the items between square brackets are called character classes. The plus sign after an element is a quantifier, it indicates that the item may appear one or more times. Items separated by a vertical bar indicate alternatives. The "\d" pattern represents any digit from "0" to "9".
You don't really need a Regex to check the password strength in your application, but using then instead of the explicit tests you are currently making would (in my opinion), greatly improve your program's readability:
using System.Text.RegularExpressions;
...
if (!Regex.IsMatch(password, "[A-Z]")) {
errorMessage = "Your password must contain at least one uppercase character.";
} else if (!Regex.IsMatch(password, "[a-z]") {
errorMessage = "Your password must contain at least one lowercase character.";
} else if (! Regex.IsMatch(password, #"[\d]")){
errorMessage = "Your password must contain at least one numerical character.";
}
...
If you're trying to do this test on the page before the information is sent to the server, you can check with javascript and regex. I think this way is best, since you eliminate additional trips to the server. It can pass the javascript validation, post the data to your site, and you can validate more on the server (in case the user has javascript off). Razor syntax helps in the page rendering, but won't help you when the user is doing stuff on the page.
Pattern: ((?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15})
This is requiring at least 1 digit, 1 lowercase and 1 uppercase. It's also requiring the password to be between 8 and 15 characters long (inclusive). You can easily change the length requirements in the pattern. For example, if you wanted it to be at least 8 characters, but no max:
((?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,})
To use this in javascript (I did use jQuery to wire up my click event in my test and get the password from the input):
var password = $('#password').val();
var pattern = /((?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15})/;
var pass = pattern.test(password);
A Fiddle showing it in action: http://jsfiddle.net/gromer/chF4D/1/
Well I know this isn't the way to properly do this, (the proper way is using regular expressions, which is what I was attempting to figure out) but if you are like me, and sick of asking/searching the same question over and over again, without getting any entirely useful feedback, this is what worked for me.
int Ucount = 0;
int Lcount = 0;
int Ncount = 0;
foreach(char c in password)
{
if(c!='A' && c!='B' && c!='C' && c!='D' && c!='E' && c!='F' && c!='G' && c!='H' && c!='I' && c!='J' && c!='K' && c!='L' && c!='M' && c!='N' && c!='O' && c!='P' && c!='Q' && c!='R' && c!='S' && c!='T' && c!='U' && c!='V' && c!='W' && c!='X' && c!='Y' && c!='Z')
{
Ucount++;
if(Ucount >= password.Length)
{
errorMessage = "Your password must contain at least one uppercase character.";
}
}
if(c!='a' && c!='b' && c!='c' && c!='d' && c!='e' && c!='f' && c!='g' && c!='h' && c!='i' && c!='j' && c!='k' && c!='l' && c!='m' && c!='n' && c!='o' && c!='p' && c!='q' && c!='r' && c!='s' && c!='t' && c!='u' && c!='v' && c!='w' && c!='x' && c!='y' && c!='z')
{
Lcount++;
if(Lcount >= password.Length)
{
errorMessage = "Your password must contain at least one lowercase character.";
}
}
if(c!='0' && c!='1' && c!='2' && c!='3' && c!='4' && c!='5' && c!='6' && c!='7' && c!='8' && c!='9')
{
Ncount++;
if(Ncount >= password.Length)
{
errorMessage = "Your password must contain at least one numerical character.";
}
}
}
Again, please note that the proper way to do this is to use regular expressions, but this code works fine for me, and I got to end the wild goose chase that was understanding regular expressions.
UPDATE:
Or, if you want to do it right, read Branco Medeiros's post above, he posted after I posted this, and his way is the right way to do it (provided you don't want to do it with JavaScript before it is sent to the server. If your app is like mine and not resource intensive enough to need JavaScript to do this, use Branco Medeiros's example above).

Using RegEx to read through a CSV file

I have a CSV file, with the following type of data:
0,'VT,C',0,
0,'C,VT',0,
0,'VT,H',0,
and I desire the following output
0
VT,C
0
0
C,VT
0
0
VT,H
0
Therefore splitting the string on the comma however ignoring the comma within quote marks. At the moment I'm using the following RegEx:
("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)"
however this gives me the result of:
0
VT
C
0
0
C
VT
0
0
VT
H
0
This show the RegEx is not reading the quote mark properly. Can anyone suggest some alterations that might help?
Usually when it comes to CSV parsing, people use specific libraries well suited for the programming language they are using to code their application.
Anyway if you are going to use a regular expression to make a really loose(!) parsing you may try using something like this:
'(?<value>[^']*?)'
It will match anything in between single quotes, and assuming the csv file is well formed, it will not miss a field. Of course it doesn't accept embedded quotes but it easily gets the job done. That's what I use when I need to get the job done really quickly. Please don't consider it a complete solution to your problem...it just works in special conditions when the requirements are what you described and the input is well formed.
[EDIT]
I was checking again your question and noticed you want to include also non quoted fields...well ok in that case my expression will not work at all. Anyway listen...if you think hard about your problem, you'll find that's something quite difficult to solve without ambiguity. Because you need fixed rules and if you allow quoted and not quoted fields, the parser will have hard time figuring out legit commas as separator/quoted.
Another expression to model such a solution may be:
('[^']+'|[^,]+),?
It will match both quoted/notquoted fields...anyway I'm not sure if it needs to assume the csv HAS to adhere to strict conditions. That will work much safer then a split strategy as far as I can tell ... you just need to collect all matches and print the matched_value + \r\n on your target string.
This regex is based of the fact you have 1 digit before and after your 'value'
Regex.Replace(input, #"(?:(?<=\d),|,(?=\d))", "\n");
You can test it out on RegexStorm
foreach(var m in Regex.Matches(s,"(('.*?')|[0-9])"))
I have manages to get the following method to read the file as required:
public List<string> SplitCSV(string input, List<string> line)
{
Regex csvSplit = new Regex("(([^,^\'])*(\'.*\')*([^,^\'])*)(,|$)", RegexOptions.Compiled);
foreach (Match match in csvSplit.Matches(input))
{
line.Add(match.Value.TrimStart(','));
}
return line;
}
Thanks for everyone help though.

How to restore HTML brackets that have been replaced?

I'm working with a database that has content where the angled brackets have been replaced with the character ^.
e.g.
^b^some text^/b^
Can anyone please recommended a c# solution to convert the ^ character back to the appropriate bracket, so it can be displayed as html? I'm guessing some kind of regex will do the job...?
Thanks in advance
You can replace every n'th ^ character with > where n is even and < where n is odd.
var html = "^b^some text^/b^";
var n = 0;
var result = Regex.Replace(html, "\\^", m => ((n++ % 2) == 0) ? "<" : ">");
// result == "<b>some text</b>"
Note that this works only as long as the original HTML code contains a closing > character for every < character (<p<b>... is bad) and that there were no ^ characters in the original HTML code (<b>2^5</b> is bad).
A more complicated, but possibly safer solution would be to search for specific sets of characters, such as ^p, ^img, ^div, etc. and their counterparts, ^/p^, ^/div^, ^/img^, etc., and replacing each of them specifically.
Whether this is feasible though, depends on what tags exist in the data, and how big an effort you are willing to put in to do this securely. Do you know if there is a finite set of tags that have been used? Was the HTML generated, or is there a chance that someone has edited them manually, necessarily making the pattern-searching more complicated?
Maybe you could first do some analysis, for instance searching and listing the various instances where the character ^ occurs? How much data are we talking about, and is it static, or will it continue to grow (including the ^-problem)?
Tricky, to the point of being impossible to do perfectly automatically -- unless you can make some very convenient assumptions about the original HTML (that it is a small subset of all possible HTML, that it was known to conform to certain predictable patterns). I think in the end there's going to have to hand editing.
Having said that, and apologies for not including any actual C# code, here's how I'd consider approaching it.
Let's go after the problem incrementally, where we convert common patterns first. The goal being after every step to reduce the number of remaining ^ characters.
So first, regex-replace lots of very common literal patterns
^p^ -> <p>
^div^ -> <div>
^/div^ -> <div>
etc.
Next, replace patterns that contain optional text, like
^link[anything-except-^]^ -> <link[original-text]>
and on and on. My approach is to replace only expected patterns, and by doing that, avoid false matches. Then iterate with other patterns until there are no ^ chars left. This takes lots of inspection of data, and lots of patterns. It's brute force, not smart, but there you go.

Is it possible to negate a regular expression search?

I'm building a lexical analysis engine in c#. For the most part it is done and works quite well. One of the features of my lexer is that it allows any user to input their own regular expressions. This allows the engine to lex all sort of fun and interesting things and output a tokenised file.
One of the issues im having is I want the user to have everything contained in this tokenised file. I.E the parts they are looking for and the parts they are not (Partial Highlighting would be a good example of this).
Based on the way my lexer highlights I found the best way to do this would be to negate the regular expressions given by the user.
So if the user wanted to lex a string for every occurrence of "T" the negated version would find everything except "T".
Now the above is easy to do but what if a user supplies 8 different expressions of a complex nature, is there a way to put all these expressions into one and negate the lot?
You could combine several RegEx's into 1 by using (pattern1)|(pattern1)|...
To negate it you just check for !IsMatch
var matches = Regex.Matches("aa bb cc dd", #"(?<token>a{2})|(?<token>d{2})");
would return in fact 2 tokens (note that I've used the same name twice.. that's ok)
Also explore Regex.Split. For instance:
var split = Regex.Split("aa bb cc dd", #"(?<token>aa bb)|(?:\s+)");
returns the words as tokens, except for "aa bb" which is returned as one token because I defined it as so with (?...).
You can also use the Index and Length properties to calculate the middle parts that have not been recognized by the Regex:
var matches = Regex.Matches("aa bb cc dd", #"(?<token>a{2})|(?<token>d{2})");
for (int i = 0; i < matches.Count; i++)
{
var group = matches[i].Groups["token"];
Console.WriteLine("Token={0}, Index={1}, Length={2}", group.Value, group.Index, group.Length);
}

Categories