Using Regular Expressions for Pattern Finding with Replace - c#

I have a string in the following format in a comma delimited file:
someText, "Text with, delimiter", moreText, "Text Again"
What I need to do is create a method that will look through the string, and will replace any commas inside of quoted text with a dollar sign ($).
After the method, the string will be:
someText, "Text with$ delimiter", moreText, "Text Again"
I'm not entirely good with RegEx, but would like to know how I can use regular expressions to search for a pattern (finding a comma in between quotes), and then replace that comma with the dollar sign.

Personally, I'd avoid regexes here - assuming that there aren't nested quote marks, this is quite simple to write up as a for-loop, which I think will be more efficient:
var inQuotes = false;
var sb = new StringBuilder(someText.Length);
for (var i = 0; i < someText.Length; ++i)
{
if (someText[i] == '"')
{
inQuotes = !inQuotes;
}
if (inQuotes && someText[i] == ',')
{
sb.Append('$');
}
else
{
sb.Append(someText[i]);
}
}

This type of problem is where Regex fails, do this instead:
var sb = new StringBuilder(str);
var insideQuotes = false;
for (var i = 0; i < sb.Length; i++)
{
switch (sb[i])
{
case '"':
insideQuotes = !insideQuotes;
break;
case ',':
if (insideQuotes)
sb.Replace(',', '$', i, 1);
break;
}
}
str = sb.ToString();
You can also use a CSV parser to parse the string and write it again with replaced columns.

Here's how to do it with Regex.Replace:
string output = Regex.Replace(
input,
"\".*?\"",
m => m.ToString().Replace(',', '$'));
Of course, if you want to ignore escaped double quotes it gets more complicated. Especially when the escape character can itself be escaped.
Assuming the escape character is \, then when trying to match the double quotes, you'll want to match only quotation marks which are preceded by an even number of escape characters (including zero). The following pattern will do that for you:
string pattern = #"(?<=((^|[^\\])(\\\\){0,}))"".*?(?<=([^\\](\\\\){0,}))""";
A this point, you might prefer to abandon regular expressions ;)
UPDATE:
In reply to your comment, it is easy to make the operation configurable for different quotation marks, delimiters and placeholders.
string quote = "\"";
string delimiter = ",";
string placeholder = "$";
string output = Regex.Replace(
input,
quote + ".*?" + quote,
m => m.ToString().Replace(delimiter, placeholder));

If you'd like to go the regex route here's what you're looking for:
var result = Regex.Replace( text, "(\"[^,]*),([^,]*\")", "$1$$$2" );
The problem with regex in this case is that it won't catch "this, has, two commas".
See it working at http://refiddle.com/1ab

Can you give this a try: "[\w ],[\w ]" (double quotes included)?
And be careful with the replacement because direct replacement will remove the whole string enclosed in the double quotes.

Related

Replace specific repeating characters from a string

I have a string like "aaa\\\\\\\\test.txt".
How do I replace all the repeating \\ characters by a single \\?
I have tried
pPath = new Regex("\\{2,}").Replace(pPath, Path.DirectorySeparatorChar.ToString());
which matches on http://regexstorm.net/tester but doesn't seem to do the trick in my program.
I'm running this on Windows so the Path.DirectorySeparatorChar is a \\.
Use new Regex(#"\\{2,}") and the rest the same.
You need to actually leave the backslash escaped in your regular expression, so you need to produce a string with two backslashes in it. The two equivalent techniques to produce the correct C# string literal are #"\\{2,}" or "\\\\{2,}"
Both of those string literals are the string \\{2,}, which is the correct regular expression. Your regular expression calls for one backslash character occurring two times, and you have to escape the backslash character. At the risk of being pedantic, if you wanted to replace two a characters, you would use the regular expression a{2,} and if you want to replace to \ characters, you would use the regular expression \\{2,} because \\ is the regular expression that matches a single \. Clear as mud? :)
Not being a demi-god at regex, I would use StringBuilder and do something like this:
string txt = "";
int count = 0;
StringBuilder bldr = new StringBuilder();
foreach(char c in txt)
{
if (c == '\')
{
count++;
if (count < 3)
{
bldr.Append(c);
}
}
else
{
count = 0;
bldr.Append(c);
}
}
string result = bldr.ToString();

Regular expression to remove whitespace around a comma, except when quoted

I have a CSV file that has rows resembling this:
1, 4, 2, "PUBLIC, JOHN Q" ,ACTIVE , 1332
I am looking for a regular expression replacement that will match against these rows and spit out something resembling this:
1,4,2,"PUBLIC, JOHN Q",ACTIVE,1332
I thought this would be rather easy: I made the expression ([ \t]+,) and replaced it with ,. I made a complement expression (,[ \t]+) with a replacement of , and I thought I had achieved a good means of right-trimming and left-trimming strings.
...but then I noticed that my "PUBLIC, JOHN Q" was now "PUBLIC,JOHN Q" which isn't what I wanted. (Note the space following the comma is now gone).
What would be the appropriate expression to trim the white space before and after a comma, but leave quoted text untouched?
UPDATE
To clarify, I am using an application to handle the file. This application allows me to define multiple regular expression replacements; it does not provide a parsing capability. While this may not be the ideal mechanism for this, it would sure beat making another application for this one file.
If the engine used by your tool is the C# regular expression engine, then you can try the following expression:
(?<!,\s*"(?:[^\\"]|\\")*)\s+(?!(?:[^\\"]|\\")*"\s*,)
replace with empty string.
The guys answers assumed the quotes are balanced and used counting to determine if the space is part of a quoted value or not.
My expression looks for all spaces that are not part of a quoted value.
RegexHero Demo
Something like this might do the job:
(?<!(^[^"]*"[^"]*(("[^"]*){2})*))[\t ]*,[ \t]*
Which matches [\t ]*,[ \t]*, only when not preceded by an odd number of quotes.
Going with some CSV library or parsing the file yourself would be much more easier, and IMO should be preferable option here.
But if you really insist on a regex, you can use this one:
"\s+(?=([^\"]*\"[^\"]*\")*[^\"]*$)"
And replace it with empty string - ""
This regex matches one or more whitespaces, followed by an even number of quotes. This will of course work only if you have balanced quote.
(?x) # Ignore Whitespace
\s+ # One or more whitespace characters
(?= # Followed by
( # A group - This group captures even number of quotes
[^\"]* # Zero or more non-quote characters
\" # A quote
[^\"]* # Zero or more non-quote characters
\" # A quote
)* # Zero or more repetition of previous group
[^\"]* # Zero or more non-quote characters
$ # Till the end
) # Look-ahead end
string format(string val)
{
if (val.StartsWith("\"")) val = " " + val;
string[] vals = val.Split('\"');
for (int i = 0; i < vals.Length; i += 2) vals[i] = vals[i].Replace(" ", "").Replace("\t", "");
return string.Join("\t", vals);
}
This will work if you have properly closed quoted strings in between
Forget the regex (See Bart's comment on the question, regular expressions aren't suitable for CSV).
public static string ReduceSpaces( string input )
{
char[] a = input.ToCharArray();
int placeComma = 0, placeOther = 0;
bool inQuotes = false;
bool followedComma = true;
foreach( char c in a ) {
inQuotes ^= (c == '\"');
if (c == ' ') {
if (!followedComma)
a[placeOther++] = c;
}
else if (c == ',') {
a[placeComma++] = c;
placeOther = placeComma;
followedComma = true;
}
else {
a[placeOther++] = c;
placeComma = placeOther;
followedComma = false;
}
}
return new String(a, 0, placeComma);
}
Demo: http://ideone.com/NEKm09

Replace char in a string

how to change
XXX#YYY.ZZZ into XXX_YYY_ZZZ
One way i know is to use the string.replace(char, char) method,
but i want to replace "#" & "." The above method replaces just one char.
one more case is what if i have XX.X#YYY.ZZZ...
i still want the output to look like XX.X_YYY_ZZZ
Is this possible?? any suggestions thanks
So, if I'm understanding correctly, you want to replace # with _, and . with _, but only if . comes after #? If there is a guaranteed # (assuming you're dealing with e-mail addresses?):
string e = "XX.X#YYY.ZZZ";
e = e.Substring(0, e.IndexOf('#')) + "_" + e.Substring(e.IndexOf('#')+1).Replace('.', '_');
Here's a complete regex solution that covers both your cases. The key to your second case is to match dots after the # symbol by using a positive look-behind.
string[] inputs = { "XXX#YYY.ZZZ", "XX.X#YYY.ZZZ" };
string pattern = #"#|(?<=#.*?)\.";
foreach (var input in inputs)
{
string result = Regex.Replace(input, pattern, "_");
Console.WriteLine("Original: " + input);
Console.WriteLine("Modified: " + result);
Console.WriteLine();
}
Although this is simple enough to accomplish with a couple of string Replace calls. Efficiency is something you will need to test depending on text size and number of replacements the code will make.
You can use the Regex.Replace method:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.replace(v=VS.90).aspx
You can use the following extension method to do your replacement without creating too many temporary strings (as occurs with Substring and Replace) or incurring regex overhead. It skips to the # symbol, and then iterates through the remaining characters to perform the replacement.
public static string CustomReplace(this string s)
{
var sb = new StringBuilder(s);
for (int i = Math.Max(0, s.IndexOf('#')); i < sb.Length; i++)
if (sb[i] == '#' || sb[i] == '.')
sb[i] = '_';
return sb.ToString();
}
you can chain replace
var newstring = "XX.X#YYY.ZZZ".Replace("#","_").Replace(".","_");
Create an array with characters you want to have replaced, loop through array and do the replace based off the index.
Assuming data format is like XX.X#YYY.ZZZ, here is another alternative with String.Split(char seperator):
string[] tmp = "XX.X#YYY.ZZZ".Split('#');
string newstr = tmp[0] + "_" + tmp[1].Replace(".", "_");

How to use regex in a c#, specifically getting the data inside a double quote?

For example, we just want to get the data inside double quotes from a list of strings:
String result = string.Empty;
List<string> strings = new List<string>();
strings.add("string=\"string\"");
strings.add("stringwithspace=\"string with space\"");
strings.add("untrimmedstring=\" untrimmed string\"");
strings.add("anotheruntrimmedstring=\"here it is \"");
strings.add("number=\"13\"");
strings.add("blank=\"\"");
strings.add("whitespace=\" \"");
StringBuilder z = new StringBuilder();
foreach(string x in strings){
if(x has a string inside the doublequote){ //do the regex here
z.append(x + "\n");
}
}
and the result would be:
string="string"
stringwithspace="string with space"
untrimmedstring="untrimmed string"
anotheruntrimmedstring="here it is"
number="13"
Any way we could do this?
Or are there some other easier and more optimized ways than using regex?
Thanks a lot.
If the string inside the double quotes are guaranteed not to have a double quotes character, you can use the following regex to capture the trimmed string inside the double quotes.
^(\w+)="\s*([^"]+)\s*"$
$2 or matches[2] will contain the trimmed string - make sure its length is not zero and replace the whole string with $1="$2" and append it (I don't speak c#)
This regex assumes that
The part before = is a single word consisting of only alphanumeric characters and underscores.
The string inside the double quotes are free of double quotes.
This should help, replace [] with "" How to extract the contents of square brackets in a string of text in c# using Regex
var regex = new Regex(#"\""([^\""]*)\""");
foreach(string x in strings)
{
var matches = regex.Matches(x);
if (matches.Count > 0)
{
var matchedValue = matches[0].Groups[1].Value;
z.Append(matchedValue);
}
}

Strip double quotes from a string in .NET

I'm trying to match on some inconsistently formatted HTML and need to strip out some double quotes.
Current:
<input type="hidden">
The Goal:
<input type=hidden>
This is wrong because I'm not escaping it properly:
s = s.Replace(""","");
This is wrong because there is not blank character character (to my knowledge):
s = s.Replace('"', '');
What is syntax / escape character combination for replacing double quotes with an empty string?
I think your first line would actually work but I think you need four quotation marks for a string containing a single one (in VB at least):
s = s.Replace("""", "")
for C# you'd have to escape the quotation mark using a backslash:
s = s.Replace("\"", "");
I didn't see my thoughts repeated already, so I will suggest that you look at string.Trim in the Microsoft documentation for C# you can add a character to be trimmed instead of simply trimming empty spaces:
string withQuotes = "\"hellow\"";
string withOutQotes = withQuotes.Trim('"');
should result in withOutQuotes being "hello" instead of ""hello""
s = s.Replace("\"", "");
You need to use the \ to escape the double quote character in a string.
You can use either of these:
s = s.Replace(#"""","");
s = s.Replace("\"","");
...but I do get curious as to why you would want to do that? I thought it was good practice to keep attribute values quoted?
s = s.Replace("\"",string.Empty);
c#: "\"", thus s.Replace("\"", "")
vb/vbs/vb.net: "" thus s.Replace("""", "")
If you only want to strip the quotes from the ends of the string (not the middle), and there is a chance that there can be spaces at either end of the string (i.e. parsing a CSV format file where there is a space after the commas), then you need to call the Trim function twice...for example:
string myStr = " \"sometext\""; //(notice the leading space)
myStr = myStr.Trim('"'); //(would leave the first quote: "sometext)
myStr = myStr.Trim().Trim('"'); //(would get what you want: sometext)
You have to escape the double quote with a backslash.
s = s.Replace("\"","");
s = s.Replace(#"""", "");
This worked for me
//Sentence has quotes
string nameSentence = "Take my name \"Wesley\" out of quotes";
//Get the index before the quotes`enter code here`
int begin = nameSentence.LastIndexOf("name") + "name".Length;
//Get the index after the quotes
int end = nameSentence.LastIndexOf("out");
//Get the part of the string with its quotes
string name = nameSentence.Substring(begin, end - begin);
//Remove its quotes
string newName = name.Replace("\"", "");
//Replace new name (without quotes) within original sentence
string updatedNameSentence = nameSentence.Replace(name, newName);
//Returns "Take my name Wesley out of quotes"
return updatedNameSentence;
s = s.Replace( """", "" )
Two quotes next to each other will function as the intended " character when inside a string.
if you would like to remove a single character i guess it's easier to simply read the arrays and skip that char and return the array. I use it when custom parsing vcard's json.
as it's bad json with "quoted" text identifiers.
Add the below method to a class containing your extension methods.
public static string Remove(this string text, char character)
{
var sb = new StringBuilder();
foreach (char c in text)
{
if (c != character)
sb.Append(c);
}
return sb.ToString();
}
you can then use this extension method:
var text= myString.Remove('"');

Categories