regex approach for extracting strings surrounded with double quotes

regex approach for extracting strings surrounded with double quotes - c#

I have a search string that is getting passed
Eg: "a+b",a, b, "C","d+e",a-b,d
I want to filter out all sub strings surrounded by double quotes("").
In above sample Output should contain:
"a+b","C","d+e"
Is there a way to do this without looping?
Also I then need to extract a string without above values to do further processing
Eg: a,b,a-b,d
Any suggestions on how to do this with minimal performance impact?
Thank you in advance for all your comments and suggestions

Since you didn't say anything about how exactly you want your output (do you need to keep the commas and extra whitespace? Is it comma delimited to begin with? Let's assume that it is NOT comma delimited and you are just trying to remove the occurences of the "xyz":
string strRegex = #"""([^""])+""";
string strTargetString = #" ""a+b"",a, b, ""C"",""d+e"",a-b,d";
string strOutput = Regex.Replace(strTargetString, strRegex, x => "");
Will remove all of the items (leaving the extra commas and whitespace).
If you are trying to do something where you need each individual match then you might want to try:
var y = (from Match m in Regex.Matches(strTargetString, strRegex) select m.Value).ToList<string>();
y.ForEach(s => Console.WriteLine(s));
To get the list of items without the surrounding quotes, you could either reverse the regex pattern OR use the replace method in the first code sample and then split on the commas, trimming white space (again, assuming you are splitting on commas which it sounds like you are)

First, add a comma to the end of your output:
"a+b",a, b, "C","d+e",a-b,d,
Then, use this regular expression:
((?<quoted>\".+?\")|(?<unquoted>.+?)),\s*
Now you have 2 problems. Kidding!
You'll have to find a way of extracting the matches without using a loop, but at least they are separated into quoted and unquoted strings by using the group. You could use a lamdba expression to pull the data out and join it, one each for quoted and unquoted, but it's just doing a loop behind the scenes, and may add more overhead than a simple for loop. It sounds like you're trying to eek out performance here, so time and test each method to see what gives the best results.

Related

Regex.Split returning whitespaces

I want to export a View as a HTML-Document to the User on my ASP.NET page. I want to give the option to only get a part of the view.
Because of that I want to split the output with Regex.Split(). I wrote a Regex that matches the part I want to cut out. After splitting I put the 2 output parts together again.
The problem is that I get a list of 3 parts, of which the second contains " ". How can I change the code that the output contains only 2 strings?
My Code:
textParts = Regex.Split(text, #"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->");
text = textParts[0] + textParts[1];
text contains HTML, CSS and jQuery Code. I wrote comments like <!--Graphic2--> around the blocks I want to cut out.
EDIT
I got it working now by using the Regex.Replace() Method. But I still don't know why Split isn't working how I expected.

You should consider parsing HTML with the proper tools, like HtmlAgilityPack.
The current question is about why Regex.Split returned 3 values. That is due to the presence of a capturing group in your pattern. Regex.Split returns the chunks between start/end of string and the matched chunks, and all captured substrings:
If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.
So, Regex.Split(text, #"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->") matches <!--Graphic2--> substring, then matches and captures into Group 1 any 0+ occurrences of any char, as many as possible, and then matches <!--EndDiscarded-->") - these matches are removed and substrings that are not matched are returned, but the last char captured into the repeated capturing group is also returned.
So, if you plan to use regex for this task, you should consider re-writing it to #"(?s)<!--Graphic2-->.*?<!--EndDiscarded-->" or #"<!--Graphic2-->[^<]*(?:<(?!!--EndDiscarded)[^<]*)*<!--EndDiscarded-->" that will be much more efficient, or even #"<!--Graphic2-->[^<]*(?:<(?!!--(?:EndDiscarded|Graphic2))[^<]*)*<!--EndDiscarded-->" that will ensure no nested Graphic2 comments are matched.
See, the complexity of the regexps rises when you want to make sure your patterns work more efficiently and safer. However, even these longer versions do not guarantee 100% safety.

C# Trouble with Regex.Replace

Been scratching my head all day about this one!
Ok, so I have a string which contains the following:
?\"width=\"1\"height=\"1\"border=\"0\"style=\"display:none;\">');
I want to convert that string to the following:
?\"width=1height=1border=0style=\"display:none;\">');
I could theoretically just do a String.Replace on "\"1\"" etc. But this isn't really a viable option as the string could theoretically have any number within the expression.
I also thought about removing the string "\"", however there are other occurrences of this which I don't want to be replaced.
I have been attempting to use the Regex.Replace method as I believe this exists to solve problems along my lines. Here's what I've got:
chunkContents = Regex.Replace(chunkContents, "\".\"", ".");
Now that really messes things up (It replaces the correct elements, but with a full stop), but I think you can see what I am attempting to do with it. I am also worrying that this will only work for single numbers (\"1\" rather than \"11\").. So that led me into thinking about using the "*" or "+" expression rather than ".", however I foresaw the problem of this picking up all of the text inbetween the desired characters (which are dotted all over the place) whereas I obviously only want to replace the ones with numeric characters in between them.
Hope I've explained that clearly enough, will be happy to provide any extra info if needed :)

Try this
var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');";
str = Regex.Replace(str , "\"(\\d+)\"", "$1");
(\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.

This works
String input = #"?\""width=\""1\""height=\""1\""border=\""0\""style=\""display:none;\"">');";
//replace the entire match of the regex with only what's captured (the number)
String result = Regex.Replace(input, #"\\""(\d+)\\""", match => match.Result("$1"));
//control string for excpected result
String shouldBe = #"?\""width=1height=1border=0style=\""display:none;\"">');";
//prints true
Console.WriteLine(result.Equals(shouldBe).ToString());

Matching a substring of any length and characters using RegEx

I would like to be able to match and then extract all substrings in the following string using regex in c#:
"2012-05-15 00:49:02 192.168.100.10 POST /Microsoft-Server-ActiveSync/default.eas User=nikced&DeviceId=ApplDNWGRKZQDTC0&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst8_LdapC0_LdapL0_RpcC31_RpcL50_Hb3540_Erq1_Pk1728465481_S2_ 443 redcloud\nikced 94.234.170.42 Apple-iPhone4C1/902.179 200 0 64 3140491"
Since it's a logfile it the regex should be able to handle any line that is of a similar type.
In this case, the preferred output to a collection should be:
2012-05-15
00:49:02
192.168.100.10
/Microsoft-Server-ActiveSync/default.eas
User=nikced&DeviceId=ApplDNWGRKZQDTC0&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst8_LdapC0_LdapL0_RpcC31_RpcL50_Hb3540_Erq1_Pk1728465481_S2_
443
redcloud\nikced
94.234.170.42
Apple-iPhone4C1/902.179
200
0
64
3140491
Appreciate any answer using C#, .net and Regex to extract the above substrings into a collection (MatchCollection preferred). All log lines follows the same format and pattern.

Incredibly complex regex incoming:
logFile.Split(' ');

This will give you an array that you can iterate through to retrieve all of the "lines" which are separated by a space
string[] lines = log.Split(' ');

You don't need to use a Regex. You can simply use String.Split Method, and specify space as separator:
string [] substrings = line.Split(new Char [] {' '});
If you need to identify the kind of each part, then you should specify what you need to find, and a regex can be created for it.
Anyway, if you really want to use a Regex, do this:
Regex re = new Regex (#"(?:(?<s>[^ ]+)(?: |$))*");
This will give you all the captures in the "s" group, when you call the Match method.
As the OP pointed out in a comment that the separator can be anything appart from a single space, then the possible separators should be included in the (?: |$) and the [^ ] parts of the expression. I.e. if space as well as tab are possible separators, replace that part with (?: |\t|$) and [^ \t]. If you need to accept more than one of those characters as separators, add a + after the () group:
(?:(?<s>[^ \t]+)(?: |\t|$)+)*

The fastest and most obvious way is to use String.Split:
string[] substrings = result = line->Split( nullptr, StringSplitOptions::RemoveEmptyEntries );
But if you insist on a MatchCollection then this will do what you want
MatchCollection ^ substrings = Regex.Matches(line, "\\S+")

Really, you just need to break this down into the parts.
First, the date. Will it always be in YYYY-MM-DD format? Could it be possible that it will be different based on region/culture settings?
(?<LogDate>dddd-dd-dd)
Next, you have the time. Same thing:
(?<LogTime>dd:dd:dd)
Next, I'm assuming this is the web method that was actually called? Not entirely sure, since you haven't really explained how the data is laid out. However, I'm assuming it's either going to be either POST or GET, so that's what we're going to do next...
(?<LogMethod>POST|GET)
Just do this for every part of the log line you're interested in, and you'll be set. IE:
(?<LogDate>dddd-dd-dd) (?<LogTime>dd:dd:dd) (?<LogMethod>POST|GET)...
If you want to anchor to the start/end of the line, be sure to use ^ and $ respectively. When you get the Matches, you can get the values from each group by indexing the Groups property with the named group (such as match.Groups["LogMethod"].Value). Good luck!

Regex to match multiple strings with positive look behind

So I have been trying to combine the answers of these two questions:
C# split string but keep split chars\seperators
Regex to match multiple strings
Essentially I'd like to be able to split a string around certain strings and have the splitting strings in the output array of Regex.Split() as well. Here is what I have tried so far:
// ** I'd also like to have UNION ALL but not sure how to add that
private const string CompoundSelectRegEx = #"(?<=[\b(UNION|INTERSECT|EXCEPT)\b])";
string sql = "SELECT TOP 5 * FROM Persons UNION SELECT TOP 5 * FROM Persons INTERSECT SELECT TOP 5 * FROM Persons EXCEPT SELECT TOP 5 * FROM Persons";
string[] strings = Regex.Split(sql, CompoundSelectRegEx);
The problem is that it starts matching individual characters like E and U so I get an incorrect array of strings.
I'd also like to match around UNION ALL but since thats not just a single word but a string I wasn't sure how to add it the above regex so if someone could point me in the right direction there as well that would be great!
Thanks!

If you want to split on those words and include them in the results simply alternate on them and place them in a group. There's no need for look-arounds. This pattern should fit your needs:
string pattern = #"\b(UNION(?:\sALL)?|INTERSECT|EXCEPT)\b";
The (?:\sALL)? makes the word ALL optionally matched. The (?:...) part means match but don't capture the specified pattern. The trailing ? at the end of the group makes it optional. If you want to trim the results you could add a \s* at the end of the pattern.
Be aware that this might work for simple SQL statements, but once you start dealing with nested queries the above approach will probably break down. At that point a regex might not be the best solution and you should develop a parser instead.

Parsing commas and quotemarks in degenerate CSV files with Regular Expressions

I need to parse strings inputs where the columns are separated by columns and any field that contains a comma in the data is wrapped in quotes (commas separated, quoted text identifiers). For this project I need to remove the quotes and any commas that occur between pairs of quotes. Basically, I need to remove commas and quotes that are contained in fields while preserving the commas that are used to separate the fields. Here's a little code I put together that handles the simple scenario:
// Sample input 1: This works and covers 99% of the records that I need to parse.
string str1 = "an_email_address#somewhere.com,2010/03/27 12:2:02,,some_first_name,some_last_name,,\"This Address Works, Suite 200\",Some City,TN,09876-5432,9795551212x123,XYZ";
str1 = Regex.Replace(str1, "\"([^\"^,]*),([^\"^,]*)\"", "$1$2");
Console.WriteLine(str1);
// Outputs: an_email_address#somewhere.com,2010/03/27 12:2:02,,some_first_name,some_last_name,,This Address Works Suite 200,Some City,TN,09876-5432,9795551212x123,XYZ
Although this code works for most of my records, it doesn't work when a field contains more than one comma. What I would like to do is modify the code so that it remove each instance of a comma contained within the column no matter how many commas there are in the field. I don't want to hard code only handling 2 commas, or 3 commas, or 25 commas. The code should just remove all the commas in the field. Below is an example of what my code doesn't handle properly.
// Sample input 2: This doesn't work since there is more than 1 comma between the quotes.
string str2 = "an_email_address#somewhere.com,2010/03/27 12:2:02,,some_first_name,some_last_name,,\"i,l,k,e, c,o,m,m,a,s, i,n ,m,y, f,i,e,l,d\",Some City,TN,09876-5432,9795551212x123,XYZ";
str2 = Regex.Replace(str2, "\"([^\"^,]*),([^\"^,]*)\"", "$1$2");
Console.WriteLine(str2);
// Desired output: an_email_address#somewhere.com,2010/03/27 12:2:02,,some_first_name,some_last_name,,i like commas in my field,Some City,TN,09876-5432,9795551212x123,XYZ
How can I accomplish this with regular expressions?

Matching quotes and regular expression don't go hand in hand, and you are probably better of using a CSV parser, as Michael Madsen suggested.
However, if you know the quotes only occur as you expect, you can do something like the following:
str2 = Regex.Replace(str2, "\"[^\"]*\"",
match => match.Value.Trim('\"').Replace(",", ""));

Here's a pure regex version:
str2 = Regex.Replace(str0,
#"""|,(?=(?>[^""]*""[^""]*(?:""[^""]*""[^""]*)*)$)",
String.Empty);
It matches any quotation mark, or a comma if it's followed by an odd number of quotation marks, and replaces it with nothing.
I would only go this route if I absolutely had to, for example if I were working with a framework that only let me specify the regex and the replacement string. Otherwise, I would either go with #Kobi's approach (because it's so much more readable) or use a dedicated CSV processor. They're not hard to find.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

regex approach for extracting strings surrounded with double quotes - c#

Related

Regex.Split returning whitespaces

C# Trouble with Regex.Replace

Matching a substring of any length and characters using RegEx

Regex to match multiple strings with positive look behind

Parsing commas and quotemarks in degenerate CSV files with Regular Expressions

Categories

Resources