Extract tables and columns from SQL query using regular expression - c#

I am trying to create a regex for this task, but I really can't grasp the understanding of regex apart from very simple cases :-( :
The problem: I have this ("SQL like") query:
SELECT tcmcs003.*, tccom130.nama, tccom705.dsca, tcmcs052.dsca, tccom100.nama
FROM tcmcs003, tccom130,tccom705,tcmcs052,tccom100
WHERE tcmcs003.cadr REFERS TO tccom130
AND tcmcs003.casi REFERS TO tccom705
AND tcmcs003.cprj REFERS TO tcmcs052
AND tcmcs003.bpid REFERS TO tccom100
ORDER BY tcmcs003._index1
I want to "extract" all the table names and column names, and after that I want to simply add my characters to them...
For example replace:
SELECT tcmcs003.*, tccom130.nama
with:
SELECT tcmcs003XXX.*, tccom130XXX.namaYYY
Up to now I have the "best" regex I have is this:
(?<gselect>SELECT\s+)*(?<tname>\w{5}\d{3})*(?<spaces>[\.\,\s])+(?<colname>\w{4})*
And replacement pattern:
${gselect}${tname}XXX${spaces}${colname}YYY
The output is really terrible :-(
SELECT tcmcs003.
m130
.nama
m705
.dsca
s052
.dsca
m100
.nama
FROM
s003
m130
,m705
,s052
,m100
WHER
s003
.cadr
REFE
m130
s003
How can I write the regex?
I want to capture repeteately something like
[(any string)(table name)(\.a dot or not)(column name)(any string) ] (repeat N times)
EDIT
I am writing in C#
The pattern should be a bit more general that:
\b(tc(?:mcs|com)\d{3}XXX.\w+)\b
in the sense that table name is 5 characters (the first is always a t, followed by 4 random chars) followed by 3 random digits
table column is 4 random chars

Instead of trying to match the whole command, I'll simply match each table or column independently. Since tables have digits in its name, there's few chances it could match something else.
Match column names with:
\b(t\w{4}\d{3}\.\w{4})\b
Match table names with:
\b(t\w{4}\d{3})\b
Then, we can replace each with the desired value: "$1YYY" and "$1XXX" respectively. The patterns use these constructs:
\b Matches a word boundary (a word char on one side and not a word char on the other).
\w{4} Matches 4 word chars ([A-Za-z0-9_]).
\d{3} Matches 3 digits ([0-9]).
Code:
string input = #"SELECT tcmcs003.*, tccom130.nama, tccom705.dsca, tcmcs052.dsca, tccom100.nama
FROM tcmcs003, tccom130,tccom705,tcmcs052,tccom100
WHERE tcmcs003.cadr REFERS TO tccom130
AND tcmcs003.casi REFERS TO tccom705
AND tcmcs003.cprj REFERS TO tcmcs052
AND tcmcs003.bpid REFERS TO tccom100
ORDER BY tcmcs003._index1";
string Pattern1 = #"\b(t\w{4}\d{3}\.\w{4})\b";
string Pattern2 = #"\b(t\w{4}\d{3})\b";
Regex r1 = new Regex(Pattern1);
Regex r2 = new Regex(Pattern2);
string replacement1 = "YYY";
string replacement2 = "XXX";
string result = "";
result = r1.Replace(input, "$1" + replacement1);
result = r2.Replace(result, "$1" + replacement2);
Console.WriteLine(result);
ideone Demo

Related

Regex.Replace using regular expression as replacement

I am new to C# programming language and came across the following problem
I have a string " avenue 4 TH some more words". I want to remove space between 4 and TH. I have written a regex which helps in determining whether "4 TH" is available in a string or not.
[0-9]+\s(th|nd|st|rd)
string result = "avanue 4 TH some more words";
var match = Regex.IsMatch(result,"\\b" + item + "\\b",RegexOptions.IgnoreCase) ;
Console.WriteLine(match);//True
Is there anything in C# which will remove the space
something likeRegex.Replace(result, "[0-9]+\\s(th|nd|st|rd)", "[0-9]+(th|nd|st|rd)",RegexOptions.IgnoreCase);
so that end result looks like
avenue 4TH some more words
You may use
var pattern = #"(?i)(\d+)\s*(th|[nr]d|st)\b";
var match = string.Concat(Regex.Match(result, pattern)?.Groups.Cast<Group>().Skip(1));
See the C# demo yielding 4TH.
The regex - (?i)(\d+)\s*(th|[nr]d|st)\b - matches 1 or more digits capturing the value into Group 1, then 0 or more whitespaces are matched with \s*, and then th, nd, rd or st as whole words (as \b is a word boundary) are captured into Group 2.
The Regex.Match(result, pattern)? part tries to match the pattern in the string. If there is a match, the match object Groups property is accessed and all groups are cast to aGrouplist withGroups.Cast(). Since the first group is the whole match value, we.Skip(1)` it.
The rest - the values of Group 1 and Group 2 - are concatenated with string.Concat.

separate number from String containing spaces and hyphens in C#

I am developing C# MVC application. I got an account name and its code from one field from the view but I have to segregate them for storing them in database. I have used Regular Expression and successfully separated the code from rest of the string. But in the string part I can only get the string before the space or hyphen. My Regex is:
string numberPart = Regex.Match(s, #"\d+").Value;
string alphaPart = Regex.Match(s, #"[a-zA-Z]+\s+").Value;
d.code = numberPart;
d.name = alphaPart;
"2103010001 - SALES - PACKING SERV - MUTTON ( 1F )"
this is my complete string from the view. When I used the above Regex for separating code and description, I get the following,
numberPart = 2103010001
alphaPart = SALES
What I want is:
numberPart = 2103010001
alphaPart = SALES - PACKING SERV - MUTTON ( 1F )
What would be the appropriate expression to do this?
For the second regex, you essentially want "everything after (and including) the first letter". Thus you can simply try
string alphaPart = Regex.Match(s, #"[a-zA-Z].*").Value;
If you want to be more specific, you can restrict the "after" part to just the characters you expect, maybe
string alphaPart = Regex.Match(s, #"[a-zA-Z][a-zA-Z0-9 ()-]*").Value;
but you still need the leading [a-zA-Z] because otherwise you'd match the number part too.
Just do splitting accoring to the first - character.
Regex.Split(input, #"(?<=^[^-]*?)\s*-\s*");
DEMO

Using Regex to extract part of a string from a HTML/text file

I have a C# regular expression to match author names in a text document that is written as:
"author":"AUTHOR'S NAME"
The regex is as follows:
new Regex("\"author\":\"[A-Za-z0-9]*\\s?[A-Za-z0-9]*")
This returns "author":"AUTHOR'S NAME. However, I don't want the quotation marks or the word Author before. I just want the name.
Could anyone help me get the expected value please?
Use regex groups to get a part of the string. ( ) acts as a capture group and can be accessed by the .Groups field.
.Groups[0] matches the whole string
.Groups[1] matches the first group (and so on)
string pattern = "\"author\":\"([A-Za-z0-9]*\\s?[A-Za-z0-9]*)\"";
var match = Regex.Match("\"author\":\"Name123\"", pattern);
string authorName = match.Groups[1];
You can also use look-around approach to only get a match value:
var txt = "\"author\":\"AUTHOR'S NAME\"";
var rgx = new Regex(#"(?<=""author"":"")[^""]+(?="")");
var result = rgx.Match(txt).Value;
My regex yields 555,020 iterations per second speed with this input string, which should suffice.
result will be AUTHOR'S NAME.
(?<="author":") checks if we have "author":" before the match, [^"]+ looks safe since you only want to match alphanumerics and space between the quotes, and (?=") is checking the trailing quote.

Regex problems with equal sign?

In C# I'm trying to validate a string that looks like:
I#paramname='test'
or
O#paramname=2827
Here is my code:
string t1 = "I#parameter='test'";
string r = #"^([Ii]|[Oo])#\w=\w";
var re = new Regex(r);
If I take the "=\w" off the end or variable r I get True. If I add an "=\w" after the \w it's False. I want the characters between # and = to be able to be any alphanumeric value. Anything after the = sign can have alphanumeric and ' (single quotes). What am I doing wrong here. I very rarely have used regular expressions and normally can find example, this is custom format though and even with cheatsheets I'm having issues.
^([Ii]|[Oo])#\w+=(?<q>'?)[\w\d]+\k<q>$
Regular expression:
^ start of line
([Ii]|[Oo]) either (I or i) or (O or o)
\w+ 1 or more word characters
= equals sign
(?<q>'?) capture 0 or 1 quotes in named group q
[\w\d]+ 1 or more word or digit characters
\k<q> repeat of what was captured in named group q
$ end of line
use \w+ instead of \w to one character or more. Or \w* to get zero or more:
Try this: Live demo
^([Ii]|[Oo])#\w+=\'*\w+\'*
If you are being a bit more strict with using paramname:
^([Ii]|[Oo])#paramname=[']?[\w]+[']?
Here is a demo
You could try something like this:
Regex rx = new Regex( #"^([IO])#(\w+)=(.*)$" , RegexOptions.IgnoreCase ) ;
Match group 1 will give you the value of I or O (the parameter direction?)
Match group 2 will give you the name of the parameter
Match group 3 will give you the value of the parameter
You could be stricter about the 3rd group and match it as
(([^']+)|('(('')|([^']+))*'))
The first alternative matches 1 or more non quoted character; the second alternative match a quoted string literal with any internal (embedded) quotes escape by doubling them, so it would match things like
'' (the empty string
'foo bar'
'That''s All, Folks!'

Get sub-strings from a string that are enclosed using some specified character

Suppose I have a string
Likes (20)
I want to fetch the sub-string enclosed in round brackets (in above case its 20) from this string. This sub-string can change dynamically at runtime. It might be any other number from 0 to infinity. To achieve this my idea is to use a for loop that traverses the whole string and then when a ( is present, it starts adding the characters to another character array and when ) is encountered, it stops adding the characters and returns the array. But I think this might have poor performance. I know very little about regular expressions, so is there a regular expression solution available or any function that can do that in an efficient way?
If you don't fancy using regex you could use Split:
string foo = "Likes (20)";
string[] arr = foo.Split(new char[]{ '(', ')' }, StringSplitOptions.None);
string count = arr[1];
Count = 20
This will work fine regardless of the number in the brackets ()
e.g:
Likes (242535345)
Will give:
242535345
Works also with pure string methods:
string result = "Likes (20)";
int index = result.IndexOf('(');
if (index >= 0)
{
result = result.Substring(index + 1); // take part behind (
index = result.IndexOf(')');
if (index >= 0)
result = result.Remove(index); // remove part from )
}
Demo
For a strict matching, you can do:
Regex reg = new Regex(#"^Likes\((\d+)\)$");
Match m = reg.Match(yourstring);
this way you'll have all you need in m.Groups[1].Value.
As suggested from I4V, assuming you have only that sequence of digits in the whole string, as in your example, you can use the simpler version:
var res = Regex.Match(str,#"\d+")
and in this canse, you can get the value you are looking for with res.Value
EDIT
In case the value enclosed in brackets is not just numbers, you can just change the \d with something like [\w\d\s] if you want to allow in there alphabetic characters, digits and spaces.
Even with Linq:
var s = "Likes (20)";
var s1 = new string(s.SkipWhile(x => x != '(').Skip(1).TakeWhile(x => x != ')').ToArray());
const string likes = "Likes (20)";
int likesCount = int.Parse(likes.Substring(likes.IndexOf('(') + 1, (likes.Length - likes.IndexOf(')') + 1 )));
Matching when the part in paranthesis is supposed to be a number;
string inputstring="Likes (20)"
Regex reg=new Regex(#"\((\d+)\)")
string num= reg.Match(inputstring).Groups[1].Value
Explanation:
By definition regexp matches a substring, so unless you indicate otherwise the string you are looking for can occur at any place in your string.
\d stand for digits. It will match any single digit.
We want it to potentially be repeated several times, and we want at least one. The + sign is regexp for previous symbol or group repeated 1 or more times.
So \d+ will match one or more digits. It will match 20.
To insure that we get the number that is in paranteses we say that it should be between ( and ). These are special characters in regexp so we need to escape them.
(\d+) would match (20), and we are almost there.
Since we want the part inside the parantheses, and not including the parantheses we tell regexp that the digits part is a single group.
We do that by using parantheses in our regexp. ((\d+)) will still match (20), but now it will note that 20 is a subgroup of this match and we can fetch it by Match.Groups[].
For any string in parantheses things gets a little bit harder.
Regex reg=new Regex(#"\((.+)\)")
Would work for many strings. (the dot matches any character) But if the input is something like "This is an example(parantesis1)(parantesis2)", you would match (parantesis1)(parantesis2) with parantesis1)(parantesis2 as the captured subgroup. This is unlikely to be what you are after.
The solution can be to do the matching for "any character exept a closing paranthesis"
Regex reg=new Regex(#"\(([^\(]+)\)")
This will find (parantesis1) as the first match, with parantesis1 as .Groups[1].
It will still fail for nested paranthesis, but since regular expressions are not the correct tool for nested paranthesis I feel that this case is a bit out of scope.
If you know that the string always starts with "Likes " before the group then Saves solution is better.

Categories