Regular expression to remove whitespace around a comma, except when quoted - c#

I have a CSV file that has rows resembling this:
1, 4, 2, "PUBLIC, JOHN Q" ,ACTIVE , 1332
I am looking for a regular expression replacement that will match against these rows and spit out something resembling this:
1,4,2,"PUBLIC, JOHN Q",ACTIVE,1332
I thought this would be rather easy: I made the expression ([ \t]+,) and replaced it with ,. I made a complement expression (,[ \t]+) with a replacement of , and I thought I had achieved a good means of right-trimming and left-trimming strings.
...but then I noticed that my "PUBLIC, JOHN Q" was now "PUBLIC,JOHN Q" which isn't what I wanted. (Note the space following the comma is now gone).
What would be the appropriate expression to trim the white space before and after a comma, but leave quoted text untouched?
UPDATE
To clarify, I am using an application to handle the file. This application allows me to define multiple regular expression replacements; it does not provide a parsing capability. While this may not be the ideal mechanism for this, it would sure beat making another application for this one file.

If the engine used by your tool is the C# regular expression engine, then you can try the following expression:
(?<!,\s*"(?:[^\\"]|\\")*)\s+(?!(?:[^\\"]|\\")*"\s*,)
replace with empty string.
The guys answers assumed the quotes are balanced and used counting to determine if the space is part of a quoted value or not.
My expression looks for all spaces that are not part of a quoted value.
RegexHero Demo

Something like this might do the job:
(?<!(^[^"]*"[^"]*(("[^"]*){2})*))[\t ]*,[ \t]*
Which matches [\t ]*,[ \t]*, only when not preceded by an odd number of quotes.

Going with some CSV library or parsing the file yourself would be much more easier, and IMO should be preferable option here.
But if you really insist on a regex, you can use this one:
"\s+(?=([^\"]*\"[^\"]*\")*[^\"]*$)"
And replace it with empty string - ""
This regex matches one or more whitespaces, followed by an even number of quotes. This will of course work only if you have balanced quote.
(?x) # Ignore Whitespace
\s+ # One or more whitespace characters
(?= # Followed by
( # A group - This group captures even number of quotes
[^\"]* # Zero or more non-quote characters
\" # A quote
[^\"]* # Zero or more non-quote characters
\" # A quote
)* # Zero or more repetition of previous group
[^\"]* # Zero or more non-quote characters
$ # Till the end
) # Look-ahead end

string format(string val)
{
if (val.StartsWith("\"")) val = " " + val;
string[] vals = val.Split('\"');
for (int i = 0; i < vals.Length; i += 2) vals[i] = vals[i].Replace(" ", "").Replace("\t", "");
return string.Join("\t", vals);
}
This will work if you have properly closed quoted strings in between

Forget the regex (See Bart's comment on the question, regular expressions aren't suitable for CSV).
public static string ReduceSpaces( string input )
{
char[] a = input.ToCharArray();
int placeComma = 0, placeOther = 0;
bool inQuotes = false;
bool followedComma = true;
foreach( char c in a ) {
inQuotes ^= (c == '\"');
if (c == ' ') {
if (!followedComma)
a[placeOther++] = c;
}
else if (c == ',') {
a[placeComma++] = c;
placeOther = placeComma;
followedComma = true;
}
else {
a[placeOther++] = c;
placeComma = placeOther;
followedComma = false;
}
}
return new String(a, 0, placeComma);
}
Demo: http://ideone.com/NEKm09

Related

Regex for alphanumeric, at least 1 number and special chars

I am trying to find a regex which will give me the following validation:
string should contain at least 1 digit and at least 1 special character. Does allow alphanumeric.
I tried the following but this fails:
#"^[a-zA-Z0-9##$%&*+\-_(),+':;?.,!\[\]\s\\/]+$]"
I tried "password1$" but that failed
I also tried "Password1!" but that also failed.
ideas?
UPDATE
Need the solution to work with C# - currently the suggestions posted as of Oct 22 2013 do not appear to work.
Try this:
Regex rxPassword = new Regex( #"
^ # start-of-line, followed by
[a-zA-Z0-9!##]+ # a sequence of one or more characters drawn from the set consisting of ASCII letters, digits or the punctuation characters ! # and #
(<=[0-9]) # at least one of which is a decimal digit
(<=[!##]) # at least one of which is one of the special characters
(<=[a-zA-Z]) # at least one of which is an upper- or lower-case letter
$ # followed by end-of-line
" , RegexOptions.IgnorePatternWhitespace ) ;
The construct (<=regular-expression) is a zero-width positive look-behind assertion.
Sometimes it's a lot simpler to do things one step at a time. The static constructor builds the escaped character class characters from a simple list of allowed special characters. The built-in Regex.Escape method doesn't work here.
public static class PasswordValidator {
private const string ALLOWED_SPECIAL_CHARS = #"##$%&*+_()':;?.,![]\-";
private static string ESCAPED_SPECIAL_CHARS;
static PasswordValidator() {
var escapedChars = new List<char>();
foreach (char c in ALLOWED_SPECIAL_CHARS) {
if (c == '[' || c == ']' || c == '\\' || c == '-')
escapedChars.AddRange(new[] { '\\', c });
else
escapedChars.Add(c);
}
ESCAPED_SPECIAL_CHARS = new string(escapedChars.ToArray());
}
public static bool IsValidPassword(string input) {
// Length requirement?
if (input.Length < 8) return false;
// First just check for a digit
if (!Regex.IsMatch(input, #"\d")) return false;
// Then check for special character
if (!Regex.IsMatch(input, "[" + ESCAPED_SPECIAL_CHARS + "]")) return false;
// Require a letter?
if (!Regex.IsMatch(input, "[a-zA-Z]")) return false;
// DON'T allow anything else:
if (Regex.IsMatch(input, #"[^a-zA-Z\d" + ESCAPED_SPECIAL_CHARS + "]")) return false;
return true;
}
}
This may be work, there are two possible, the digit before special char or the digit after the special char. You should use DOTALL(the dot point all char)
^((.*?[0-9].*?[##$%&*+\-_(),+':;?.,!\[\]\s\\/].*)|(.*?[##$%&*+\-_(),+':;?.,!\[\]\s\\/].*?[0-9].*))$
This worked for me:
#"(?=^[!##$%\^&*()_-+=[{]};:<>|./?a-zA-Z\d]{8,}$)(?=([!##$%\^&*()_-+=[{]};:<>|./?a-zA-Z\d]\W+){1,})(?=[^0-9][0-9])[!##$%\^&*()_-+=[{]};:<>|./?a-zA-Z\d]*$"
alphanumeric, at least 1 numeric, and special character with a min length of 8
This should do the work
(?:(?=.*[0-9]+)(?=.*[a-zA-Z]+)(?=.*[##$%&*+\-_(),+':;?.,!\[\]\s\\/]+))+
Tested with javascript, not sure about c#, may need some little adjust.
What it does is use anticipated positive lookahead to find the required elements of the password.
EDIT
Regular expression is designed to test if there are matches. Since all the patterns are lookahead, no real characters get captured and matches are empty, but if the expression "match", then the password is valid.
But, since the question is C# (sorry, i don't know c#, just improvising and adapting samples)
string input = "password1!";
string pattern = #"^(?:(?=.*[0-9]+)(?=.*[a-zA-Z]+)(?=.*[##$%&*+\-_(),+':;?.,!\[\]\s\\/]+))+.*$";
Regex rgx = new Regex(pattern, RegexOptions.None);
MatchCollection matches = rgx.Matches(input);
if (matches.Count > 0) {
Console.WriteLine("{0} ({1} matches):", input, matches.Count);
foreach (Match match in matches)
Console.WriteLine(" " + match.Value);
}
Adding start of line, and a .*$ to the end, the expression will match if the password is valid. And the match value will be the password. (i guess)

Do not match opening and closing parenthesis when a character sequence appears in middle

Got an interesting problem here for everyone to consider:
I am trying to parse and tokenize strings delimited by a "/" character but only when not in between parenthesis.
For instance:
Root/Branch1/branch2/leaf
Should be tokenized as: "Root", "Branch1", "Branch2", "leaf"
Root/Branch1(subbranch1/subbranch2)/leaf
Should be tokenized as: "Root", "Branch1(subbranch1,subbranch2)", "leaf"
Root(branch1/branch2) text (branch3/branch4) text/Root(branch1/branch2)/Leaf
Should be tokenized as: "Root(branch1/branch2) text(branch3/branch4)", "Root(branch1/branch2)", "leaf".
I came up with the following expression which works great for all cases except ONE!
([^/()]*\((?<=\().*(?=\))\)[^/()]*)|([^/()]+)
The only case where this does not work is the following test condition:
Root(branch1/branch2)/SubRoot/SubRoot(branch3/branch4)/Leaf
This should be tokenized as: "Root(branch1/branch2)", "SubRoot", "SubRoot(branch3/branch4)", "Leaf"
The result I get instead consists of only one group that matches the whole line so it is not tokenizing it at all:
"Root(branch1/branch2)/SubRoot/SubRoot(branch3/branch4)/Leaf"
What is happening here is that because Regex is greedy it is matching the left most opening parenthesis "(" with the last closing parenthesis ")" instead of just knowing to stop at its appropriate delimiter.
Any of you Regex gurus out there can help me figure out how to add a small Regex piece to my existing expression to handle this additional case?
Root(branch1/branch2) Test (branch3/branch4)/SubRoot/SubRoot(branch5/branch6)/Leaf
Should be tokenized into groups as:
"Root(branch1/branch2) Test (branch3/branch4)"
"SubRoot"
"SubRoot(branch5/branch6)"
"Leaf"
List<string> Tokenize(strInput)
{
var sb = new StringBuilder();
var tokens = new List<string>();
bool inParen = false;
foreach(var c in strInput)
{
if (inParens)
{
if (c == ')')
inParens = false;
else
sb.Append(c);
}
else if (c == '(')
inParens = true;
else if (c == '/')
{
tokens.Add(sb.ToString());
sb.Length = 0;
}
else
sb.Append(c);
}
if (sb.Length > 0)
tokens.Add(sb.ToString());
return tokens;
}
That's untested but it should work. (and will almost certainly be much faster than the regex)
Different approach, trying to avoid costly look-around assertions...
/(\(.+?\)|[^\/(]+)+/
With some comments...
/
( # group things to be captured
\(.+?\) # 1 or more of anything in (escaped) brackets, un-greedily
| # or ...
[^\/(]+ # 1 or more, not slash, and not open bracket characters
)+ # repeat until done...
/
The following uses balanced groups to capture each matching item with Regex.Matches, ensuring the closing / isn't matched when the brackets before it haven't balanced:
(?<=^|/)((?<br>\()|(?<-br>\))|[^()])*?(?(br)(?!))(?=$|/)
Bizarrely, this seems to perform similarly to Billy Moon's much simpler answer, even though this is overengineered (supporting multiple, possibly nested sets of brackets per token).
The following does something similar, but splits the string with Regex.Split (linebreaks added for clarity):
(?<=^(?(brb)(?!))(?:(?<-brb>\()|(?<brb>\))|[^()])*)
/
(?=(?:(?<bra>\()|(?<-bra>\))|[^()])*(?(bra)(?!))$)
This matches "any / where any brackets between the start of the string and the / are balanced, and any bracket between the / and the end of the string are balanced".
Note that in the lookbehind, the brb captures appear in reverse order from before - this is because a lookbehind apparently works right-to-left. (Thanks to Kobi for the answer that taught me this.)
This is much slower than the match version, but I wanted to work out how to do it anyway.

C# Regex To Escape Certain Characters

How can I escape certain characters in a string with a C# Regex?
This is a test for % and ' thing? -> This is a test for \% and \' thing?
resultString = Regex.Replace(subjectString,
#"(?<! # Match a position before which there is no
(?<!\\) # odd number of backlashes
\\ # (it's odd if there is one backslash,
(?:\\\\)* # followed by an even number of backslashes)
)
(?=[%']) # and which is followed by a % or a '",
#"\", RegexOptions.IgnorePatternWhitespace);
However, if you're trying to protect yourself against malevolent SQL queries, regex is not the right way to go.
var escapedString = Regex.Replace(input, #"[%']", #"\$1");
This is pretty much all you need. Inside the square brackets, you should put every character you wish to escape with a backslash, which may include the backslash character itself.
I don't think this could be done with regex in good fashion, but you can simply run a for loop:
var specialChars = new char[]{'%',....};
var stream = "";
for (int i=0;i<myStr.Length;i++)
{
if (specialChars.Contains(myStr[i])
{
stream+= '\\';
}
stream += myStr[i];
}
(1) you can use StringBuilder to prevent from too many string creation.

How to parse a comma delimited string when comma and parenthesis exists in field

I have this string in C#
adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO
I want to use a RegEx to parse it to get the following:
adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
NG/CL
5 value of CL(JK)
HO
In addition to the above example, I tested with the following, but am still unable to parse it correctly.
"%exc.uns: 8 hours let # = ABC, DEF", "exc_it = 1 day" , " summ=graffe ", " a,b,(c,d)"
The new text will be in one string
string mystr = #"""%exc.uns: 8 hours let # = ABC, DEF"", ""exc_it = 1 day"" , "" summ=graffe "", "" a,b,(c,d)""";
string str = "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
var resultStrings = new List<string>();
int? firstIndex = null;
int scopeLevel = 0;
for (int i = 0; i < str.Length; i++)
{
if (str[i] == ',' && scopeLevel == 0)
{
resultStrings.Add(str.Substring(firstIndex.GetValueOrDefault(), i - firstIndex.GetValueOrDefault()));
firstIndex = i + 1;
}
else if (str[i] == '(') scopeLevel++;
else if (str[i] == ')') scopeLevel--;
}
resultStrings.Add(str.Substring(firstIndex.GetValueOrDefault()));
Event faster:
([^,]*\x28[^\x29]*\x29|[^,]+)
That should do the trick. Basically, look for either a "function thumbprint" or anything without a comma.
adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO
^ ^ ^ ^ ^
The Carets symbolize where the grouping stops.
Just this regex:
[^,()]+(\([^()]*\))?
A test example:
var s= "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
Regex regex = new Regex(#"[^,()]+(\([^()]*\))?");
var matches = regex.Matches(s)
.Cast<Match>()
.Select(m => m.Value);
returns
adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
NG/CL
5 value of CL(JK)
HO
If you simply must use Regex, then you can split the string on the following:
, # match a comma
(?= # that is followed by
(?: # either
[^\(\)]* # no parens at all
| # or
(?: #
[^\(\)]* # ...
\( # (
[^\(\)]* # stuff in parens
\) # )
[^\(\)]* # ...
)+ # any number of times
)$ # until the end of the string
)
It breaks your input into the following:
adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
NG/CL
5 value of CL(JK)
HO
You can also use .NET's balanced grouping constructs to create a version that works with nested parens, but you're probably just as well off with one of the non-Regex solutions.
Another way to implement what Snowbear was doing:
public static string[] SplitNest(this string s, char src, string nest, string trg)
{
int scope = 0;
if (trg == null || nest == null) return null;
if (trg.Length == 0 || nest.Length < 2) return null;
if (trg.IndexOf(src) >= 0) return null;
if (nest.IndexOf(src) >= 0) return null;
for (int i = 0; i < s.Length; i++)
{
if (s[i] == src && scope == 0)
{
s = s.Remove(i, 1).Insert(i, trg);
}
else if (s[i] == nest[0]) scope++;
else if (s[i] == nest[1]) scope--;
}
return s.Split(trg);
}
The idea is to replace any non-nested delimiter with another delimiter that you can then use with an ordinary string.Split(). You can also choose what type of bracket to use - (), <>, [], or even something weird like \/, ][, or `'. For your purposes you would use
string str = "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
string[] result = str.SplitNest(',',"()","~");
The function would first turn your string into
adj_con(CL2,1,3,0)~adj_cont(CL1,1,3,0)~NG~ NG/CL~ 5 value of CL(JK)~ HO
then split on the ~, ignoring the nested commas.
Assuming non nested, matching parentheses, you can easily match the tokens you want instead of splitting the string:
MatchCollection matches = Regex.Matches(data, #"(?:[^(),]|\([^)]*\))+");
var s = "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
var result = string.Join(#"\n",Regex.Split(s, #"(?<=\)),|,\s"));
The pattern matches for ) and excludes it from the match then matches ,
or
matches , followed by a space.
result =
adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
NG/CL
5 value of CL(JK)
HO
The TextFieldParser (msdn) class seems to have the functionality built-in:
TextFieldParser Class: - Provides methods and properties for parsing structured text files.
Parsing a text file with the TextFieldParser is similar to iterating over a text file, while the ReadFields method to extract fields of text is similar to splitting the strings.
The TextFieldParser can parse two types of files: delimited or fixed-width. Some properties, such as Delimiters and HasFieldsEnclosedInQuotes are meaningful only when working with delimited files, while the FieldWidths property is meaningful only when working with fixed-width files.
See the article which helped me find that
Here's a stronger option, which parses the whole text, including nested parentheses:
string pattern = #"
\A
(?>
(?<Token>
(?:
[^,()] # Regular character
|
(?<Paren> \( ) # Opening paren - push to stack
|
(?<-Paren> \) ) # Closing paren - pop
|
(?(Paren),) # If inside parentheses, match comma.
)*?
)
(?(Paren)(?!)) # If we are not inside parentheses,
(?:,|\Z) # match a comma or the end
)*? # lazy just to avoid an extra empty match at the end,
# though it removes a last empty token.
\Z
";
Match match = Regex.Match(data, pattern, RegexOptions.IgnorePatternWhitespace);
You can get all matches by iterating over match.Groups["Token"].Captures.

.Net Removing all the first 0 of a string

I got the following :
01.05.03
I need to convert that to 1.5.3
The problem is I cannot only trim the 0 because if I got :
01.05.10
I need to convert that to 1.5.10
So, what's the better way to solve that problem ? Regex ? If so, any regex example doing that ?
Expanding on the answer of #FrustratedWithFormsDesigner:
string Strip0s(string s)
{
return string.Join<int>(".", from x in s.Split('.') select int.Parse(x));
}
Regex-replace
(?<=^|\.)0+
with the empty string. The regex is:
(?<= # begin positive look-behind (i.e. "a position preceded by")
^|\. # the start of the string or a literal dot †
) # end positive look-behind
0+ # one or more "0" characters
† note that not all regex flavors support variable-length look-behind, but .NET does.
If you expect this kind of input: "00.03.03" and want to to keep the leading zero in this case (like "0.3.3"), use this expression instead:
(?<=^|\.)0+(?=\d)
and again replace with the empty string.
From the comments (thanks Kobi): There is a more concise expression that does not require look-behind and is equivalent to my second suggestion:
\b0+(?=\d)
which is
\b # a word boundary (a position between a word char and a non-word char)
0+ # one or more "0" characters
(?=\d) # positive look-ahead: a position that's followed by a digit
This works because the 0 happens to be a word character, so word boundaries can be used to find the first 0 in a row. It is a more compatible expression, because many regex flavors do not support variable-length look-behind, and some (like JavaScript) no look-behind at all.
You could split the string on ., then trim the leading 0s on the results of the split, then merge them back together.
I don't know of a way to do this in a single operation, but you could write a function that hides this and makes it look like a single operation. ;)
UPDATE:
I didn't even think of the other guy's regex. Yeah, that will probably do it in a single operation.
Here's another way you could do what FrustratedWithFormsDesigner suggests:
string s = "01.05.10";
string s2 = string.Join(
".",
s.Split('.')
.Select(str => str.TrimStart('0'))
.ToArray()
);
This is almost the same as dtb's answer, but doesn't require that the substrings be valid integers (it would also work with, e.g., "000A.007.0HHIMARK").
UPDATE: If you'd want any strings consisting of all 0s in the input string to be output as a single 0, you could use this:
string s2 = string.Join(
".",
s.Split('.')
.Select(str => TrimLeadingZeros(str))
.ToArray()
);
public static string TrimLeadingZeros(string text) {
int number;
if (int.TryParse(text, out number))
return number.ToString();
else
return text.TrimStart('0');
}
Example input/output:
00.00.000A.007.0HHIMARK // input
0.0.A.7.HHIMARK // output
There's also the old-school way which probably has better performance characteristics than most other solutions mentioned. Something like:
static public string NormalizeVersionString(string versionString)
{
if(versionString == null)
throw new NullArgumentException("versionString");
bool insideNumber = false;
StringBuilder sb = new StringBuilder(versionString.Length);
foreach(char c in versionString)
{
if(c == '.')
{
sb.Append('.');
insideNumber = false;
}
else if(c >= '1' && c <= '9')
{
sb.Append(c);
insideNumber = true;
}
else if(c == '0')
{
if(insideNumber)
sb.Append('0');
}
}
return sb.ToString();
}
string s = "01.05.10";
string newS = s.Replace(".0", ".");
newS = newS.StartsWith("0") ? newS.Substring(1, newS.Length - 1) : newS;
Console.WriteLine(newS);
NOTE: You will have to thoroughly check for possible input combination.
This looks like it is a date format, if so I would use Date processing code
DateTime time = DateTime.Parse("01.02.03");
String newFormat = time.ToString("d.M.yy");
or even better
String newFormat = time.ToShortDateString();
which will respect you and your clients culture setting.
If this data is not a date then don't use this :)
I had a similar requirement to parse a string with street adresses, where some of the house numbers had leading zeroes and I needed to remove them while keeping the rest of the text intact, so I slightly edited the accepted answer to meet my requirements, maybe someone finds it useful. Basically doing the same as accepted answer, with the difference that I am checking if the string part can be parsed as an integer, and defaulting to the string value when false;
string Strip0s(string s)
{
int outputValue;
return
string.Join(" ",
from x in s.Split(new[] { ' ' })
select int.TryParse(x, out outputValue) ? outputValue.ToString() : x);
}
Input: "Islands Brygge 34 B 07 TV"
Output: "Islands Brygge 34 B 7 TV"

Categories