Regex.Replace with large strings and backslashes - c#

I have written a utility which opens a text based file, loads is as a string and performs a find / replace function using RegEx.Replace.
It does this on many files, the user points it at a folder, enters a find string, a replace string and all the files in the folder which contain the string in the file get replaced.
This works great until I try it with a backslash where it falls down.
Quite simply:
newFileContent = Regex.Replace(fileContent, #findString, #replaceString, RegexOptions.IgnoreCase);
fileContent = the contents of a text based file. it will contain carriage returns.
findString = user entered string to find
replaceString = user entered string to replace the found string with
I've tried adding some logic to counter act the backslash as below, but this fails with illegal at end of pattern.
if (culture.CompareInfo.IndexOf(findString, #"\") >= 0)
{
Regex.Replace(findString, #"\", #"\\");
}
What do I need to do to successfully handle backslashes so they can be part of the find / replace logic?
Entire code block below.
//open reader
using (var reader = new StreamReader(f,Encoding.Default))
{
//read file
var fileContent = reader.ReadToEnd();
Globals.AppendTextToLine(string.Format(" replacing string"));
//culture find replace
var culture = new CultureInfo("en-gb", false);
//ensure nothing has changed
if (culture.CompareInfo.IndexOf(fileContent, findString, CompareOptions.IgnoreCase) >= 0)
{
//if find or replace string contains backslahes
if (culture.CompareInfo.IndexOf(findString, #"\") >= 0)
{
Regex.Replace(findString, #"\", #"\\");
}
//perform replace in new string
if (MainWindow.Main.chkIgnoreCase.IsChecked != null && (bool) MainWindow.Main.chkIgnoreCase.IsChecked)
newFileContent = Regex.Replace(fileContent, #findString, #replaceString, RegexOptions.IgnoreCase);
else
newFileContent = Regex.Replace(fileContent, #findString, #replaceString);
result[i].Result = true;
Globals.AppendTextToLine(string.Format(" success!"));
}
else
{
Globals.AppendTextToLine(string.Format(" failure!!"));
break;
}
}

You should be using Regex.Escape when you pass the user-input into the Replace method.
Escapes a minimal set of characters (\, *, +, ?, |, {, [, (,), ^, $, .,
#, and white space) by replacing them with their escape codes. This instructs the regular expression engine to interpret these characters
literally rather than as metacharacters.
For example:
newFileContent = Regex.Replace(fileContent,
Regex.Escape(findString),
replaceString,
RegexOptions.IgnoreCase);

Your fundamental issue is that your letting your user enter an arbitrary regexp and thus, well, its interpreted as a regexp...
either you goal is just to replace literal strings, in which-case use String.Replace OR you want to allow a user to enter a regexp, in which case just accept that the user will need to \ escape their special characters.
Since \ is a regexp escape char (As well as c# one but you seem to be dealing with that with #) "\" is an illegal regexp because what are you escaping
If you Really want a rexexp to replace all \ with \\ then its:
Regex.Replace(findString, #"\\", #"\\\\"); --ie one \ after escape, two chars after escape.
But you've still got [].?* etc to worry about.
My strong advice is a checkbox, user can select if they are entering a regexp or string literal for replacement and then call String.Replace or Regex.Replace accordingly

Related

Replace specific repeating characters from a string

I have a string like "aaa\\\\\\\\test.txt".
How do I replace all the repeating \\ characters by a single \\?
I have tried
pPath = new Regex("\\{2,}").Replace(pPath, Path.DirectorySeparatorChar.ToString());
which matches on http://regexstorm.net/tester but doesn't seem to do the trick in my program.
I'm running this on Windows so the Path.DirectorySeparatorChar is a \\.
Use new Regex(#"\\{2,}") and the rest the same.
You need to actually leave the backslash escaped in your regular expression, so you need to produce a string with two backslashes in it. The two equivalent techniques to produce the correct C# string literal are #"\\{2,}" or "\\\\{2,}"
Both of those string literals are the string \\{2,}, which is the correct regular expression. Your regular expression calls for one backslash character occurring two times, and you have to escape the backslash character. At the risk of being pedantic, if you wanted to replace two a characters, you would use the regular expression a{2,} and if you want to replace to \ characters, you would use the regular expression \\{2,} because \\ is the regular expression that matches a single \. Clear as mud? :)
Not being a demi-god at regex, I would use StringBuilder and do something like this:
string txt = "";
int count = 0;
StringBuilder bldr = new StringBuilder();
foreach(char c in txt)
{
if (c == '\')
{
count++;
if (count < 3)
{
bldr.Append(c);
}
}
else
{
count = 0;
bldr.Append(c);
}
}
string result = bldr.ToString();

C# How may i replace \" with "

I am trying to replace \" in a string with ", how may i do that?
I've tried using replace but i could not find a way to do it.
Ex:
string line = "This is a \"sample\" "
string replaced = "This is a "sample" ".
Thanks.
Because quotes are used to start and end strings (they are a type of control character), you can't have a quote in the middle of a string because it would terminate the string
string replaced = "This is a "sample" ";
/*
You can see from the syntax highlighting (red) that the string is being
detected as <This is a > and <sample> is black meaning it is detected as
code (and will cause a syntax error)
*/
In order to put a quote in the middle of the string we escape it (escaping means to treat it as a character literal instead of a control character) using the escape character, which in C# is backslash.
string line = "This is a \"sample\"";
Console.WriteLine(line);
// Output: This is a "sample"
string literalLine = #"This is a ""sample""";
Console.WriteLine(literalLine);
// Output: This is a "sample"
The # symbol in C# means I want this to be a literal string (ignore control characters), however quotes still start and end strings so in order to print a quote in a literal string you write two of them "" (that's how the language is designed)
Case 1: If the value within the variable line is actually This is a \"sample\", then you could do line.Replace("\\\"", "\"").
If not:
\" is an escape sequence. it shows up as \" in the code, however when it compiles it would show up as " instead of the original \".
The reason for escaping quotes is because the compiler cannot identify whether the quote is within another quote or not. Let's see your example:
"This is a "sample" "
is this This is a as one group, then an unknown token sample, then another quote ? or This is a "sample" all within a quote? We can take a guess by looking at the context, but compiler cannot. Hence, we use escape sequence to tell the compiler "I used this double quote character as a character, not the closing/opening of a string literal."
See Also: https://en.wikipedia.org/wiki/Escape_sequences_in_C
You may try something like this:
String str = "This is a \"sample\" ";
Console.WriteLine("Original string: {0}", str);
Console.WriteLine("Replaced: {0}", str.Replace('\"', '"'));
Desired output : This is a sample
Given string : "This is a \"sample\""
The problem: you have escape characters protecting the double quotes from being interpreted. the \ escape character is an instruction to use a quotation mark literally instead of using it to indicate a break in the string. This means the actual string value is "This is a "sample"" when served as output.
The answer removing the \ may work, but it makes for smelly code because removing an escape character in this way can make it unclear what you intend and prevents you from escaping any character.
Removing the " might work, though it prevents use of any quotes and some IDEs might leave the escape character behind to ruin your day.
We want one specific target, the quotes around "sample".
string sample = "This is a \"sample\"";
List<string> sampleArray = sample.Split(' ').ToList(); // samplearray is now split into ["This", "is", "a", "\"sample\""]
var x = sampleArray.FirstOrDefault(t => t == "\"sample\""); //isolate our needed value
if (x != null) //prevent a null reference in case something went wrong and samplearray wasnt as expected
{
var index = sampleArray.IndexOf(x); //get the location of the value we just picked
x = x.Replace("\"", string.Empty); //replace chars
sampleArray[index] = x; //assign new value to the list
}
return String.Join(" ", sampleArray); //return the string joined together with spaces.
Try this:
string line="This is a \"sample\" " ;
replaced =line.Replace(#"\", "");

Regular expression to remove whitespace around a comma, except when quoted

I have a CSV file that has rows resembling this:
1, 4, 2, "PUBLIC, JOHN Q" ,ACTIVE , 1332
I am looking for a regular expression replacement that will match against these rows and spit out something resembling this:
1,4,2,"PUBLIC, JOHN Q",ACTIVE,1332
I thought this would be rather easy: I made the expression ([ \t]+,) and replaced it with ,. I made a complement expression (,[ \t]+) with a replacement of , and I thought I had achieved a good means of right-trimming and left-trimming strings.
...but then I noticed that my "PUBLIC, JOHN Q" was now "PUBLIC,JOHN Q" which isn't what I wanted. (Note the space following the comma is now gone).
What would be the appropriate expression to trim the white space before and after a comma, but leave quoted text untouched?
UPDATE
To clarify, I am using an application to handle the file. This application allows me to define multiple regular expression replacements; it does not provide a parsing capability. While this may not be the ideal mechanism for this, it would sure beat making another application for this one file.
If the engine used by your tool is the C# regular expression engine, then you can try the following expression:
(?<!,\s*"(?:[^\\"]|\\")*)\s+(?!(?:[^\\"]|\\")*"\s*,)
replace with empty string.
The guys answers assumed the quotes are balanced and used counting to determine if the space is part of a quoted value or not.
My expression looks for all spaces that are not part of a quoted value.
RegexHero Demo
Something like this might do the job:
(?<!(^[^"]*"[^"]*(("[^"]*){2})*))[\t ]*,[ \t]*
Which matches [\t ]*,[ \t]*, only when not preceded by an odd number of quotes.
Going with some CSV library or parsing the file yourself would be much more easier, and IMO should be preferable option here.
But if you really insist on a regex, you can use this one:
"\s+(?=([^\"]*\"[^\"]*\")*[^\"]*$)"
And replace it with empty string - ""
This regex matches one or more whitespaces, followed by an even number of quotes. This will of course work only if you have balanced quote.
(?x) # Ignore Whitespace
\s+ # One or more whitespace characters
(?= # Followed by
( # A group - This group captures even number of quotes
[^\"]* # Zero or more non-quote characters
\" # A quote
[^\"]* # Zero or more non-quote characters
\" # A quote
)* # Zero or more repetition of previous group
[^\"]* # Zero or more non-quote characters
$ # Till the end
) # Look-ahead end
string format(string val)
{
if (val.StartsWith("\"")) val = " " + val;
string[] vals = val.Split('\"');
for (int i = 0; i < vals.Length; i += 2) vals[i] = vals[i].Replace(" ", "").Replace("\t", "");
return string.Join("\t", vals);
}
This will work if you have properly closed quoted strings in between
Forget the regex (See Bart's comment on the question, regular expressions aren't suitable for CSV).
public static string ReduceSpaces( string input )
{
char[] a = input.ToCharArray();
int placeComma = 0, placeOther = 0;
bool inQuotes = false;
bool followedComma = true;
foreach( char c in a ) {
inQuotes ^= (c == '\"');
if (c == ' ') {
if (!followedComma)
a[placeOther++] = c;
}
else if (c == ',') {
a[placeComma++] = c;
placeOther = placeComma;
followedComma = true;
}
else {
a[placeOther++] = c;
placeComma = placeOther;
followedComma = false;
}
}
return new String(a, 0, placeComma);
}
Demo: http://ideone.com/NEKm09

String Manipulation using C#

Using C# we can do string check like if string.contains() method, e.g.:
string test = "Microsoft";
if (test.Contains("i"))
test = test.Replace("i","a");
This is fine. But what if I want to replace a string which contains " symbol to be replaced.
I want to achieve this:
"<html><head>
I want to remove the " symbol present in check so that the result would be:
<html><head>
The " character can also be replaced, just like any other:
test = test.Replace("\"","");
Also, note that you don't have to test if the character exists : your test.Contains("i") could be removed since the .Replace() method won't do anything (no replace, no error thrown) if the character doesn't exist inside the string.
To include a quote symbol in a string, you need to escape it, using a backslash. In your example, you want to use something lik this:
if (test.Contains("\""))
There are two ways to include a '"' character in a string literal. All the answers so far have used the c-style way:
var quotation = "Parting is such sweet sorrow";
var howSweetIsIt = quotation + " that I shall say \"good-night\" till it be morrow.";
In some contexts (especially for users experienced with Visual Basic), the verbatim string literal may be easier to read. A verbatim string literal begins with an # sign, and the only character that requires escaping is the quotation mark -- all other characters are included verbatim (hence the name). Significantly, the method of escaping the quotation mark is different: rather than preceding it with a backslash, it must be doubled:
var howSweetIsIt = quotation + " that I shall say ""good-night"" till it be morrow.";
string SymbolString = "Micro\"so\"ft";
The string above use scape char \ to insert " between the characters
string Result = SymbolString.Replace("\"", string.Empty);
With the following replace I replace the character "" for empty.
This is what you try to achieve?
if (check.Contains("\"")
output = check.Replace("\"", "");
output = check.Replace("\"", "");
Just remember to use "\"" for the quote sign as the backslash is an escape character.
if (str.Contains("\""))
{
str = str.Replace("\"", "");
}

C# Regex To Escape Certain Characters

How can I escape certain characters in a string with a C# Regex?
This is a test for % and ' thing? -> This is a test for \% and \' thing?
resultString = Regex.Replace(subjectString,
#"(?<! # Match a position before which there is no
(?<!\\) # odd number of backlashes
\\ # (it's odd if there is one backslash,
(?:\\\\)* # followed by an even number of backslashes)
)
(?=[%']) # and which is followed by a % or a '",
#"\", RegexOptions.IgnorePatternWhitespace);
However, if you're trying to protect yourself against malevolent SQL queries, regex is not the right way to go.
var escapedString = Regex.Replace(input, #"[%']", #"\$1");
This is pretty much all you need. Inside the square brackets, you should put every character you wish to escape with a backslash, which may include the backslash character itself.
I don't think this could be done with regex in good fashion, but you can simply run a for loop:
var specialChars = new char[]{'%',....};
var stream = "";
for (int i=0;i<myStr.Length;i++)
{
if (specialChars.Contains(myStr[i])
{
stream+= '\\';
}
stream += myStr[i];
}
(1) you can use StringBuilder to prevent from too many string creation.

Categories