Replace Unicode character "�" with a space - c#

I'm a doing an massive uploading of information from a .csv file and I need replace this character non ASCII "�" for a normal space, " ".
The character "�" corresponds to "\uFFFD" for C, C++, and Java, which it seems that it is called REPLACEMENT CHARACTER. There are others, such as spaces type like U+FEFF, U+205F, U+200B, U+180E, and U+202F in the C# official documentation.
I'm trying do the replace this way:
public string Errors = "";
public void test(){
string textFromCsvCell = "";
string validCharacters = "^[0-9A-Za-z().:%-/ ]+$";
textFromCsvCell = "This is my text from csv file"; //All spaces aren't normal space " "
string cleaned = textFromCsvCell.Replace("\uFFFD", "\"")
if (Regex.IsMatch(cleaned, validCharacters ))
//All code for insert
else
Errors=cleaned;
//print Errors
}
The test method shows me this text:
"This is my�texto from csv file"
I try some solutions too:
Trying solution 1: Using Trim
Regex.Replace(value.Trim(), #"[^\S\r\n]+", " ");
Try solution 2: Using Replace
System.Text.RegularExpressions.Regex.Replace(str, #"\s+", " ");
Try solution 3: Using Trim
String.Trim(new char[]{'\uFEFF', '\u200B'});
Try solution 4: Add [\S\r\n] to validCharacters
string validCharacters = "^[\S\r\n0-9A-Za-z().:%-/ ]+$";
Nothing works.
How can I replace it?
Sources:
Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD)
Trying to replace all white space with a single space
Strip the byte order mark from string in C#
Remove extra whitespaces, but keep new lines using a regular expression in C#
EDITED
This is the original string:
"SYSTEM OF MONITORING CONTINUES OF GLUCOSE"
in 0x... notation
SYSTEM OF0xA0MONITORING CONTINUES OF GLUCOSE
Solution
Go to the Unicode code converter. Look at the conversions and do the replace.
In my case, I do a simple replace:
string value = "SYSTEM OF MONITORING CONTINUES OF GLUCOSE";
//value contains non-breaking whitespace
//value is "SYSTEM OF�MONITORING CONTINUES OF GLUCOSE"
string cleaned = "";
string pattern = #"[^\u0000-\u007F]+";
string replacement = " ";
Regex rgx = new Regex(pattern);
cleaned = rgx.Replace(value, replacement);
if (Regex.IsMatch(cleaned,"^[0-9A-Za-z().:<>%-/ ]+$"){
//all code for insert
else
//Error messages
This expression represents all possible spaces: space, tab, page break, line break and carriage return
[ \f\n\r\t\v​\u00a0\u1680​\u180e\u2000​\u2001\u2002​\u2003\u2004​\u2005\u2006​\u2007\u2008​\u2009\u200a​\u2028\u2029​​\u202f\u205f​\u3000]
References
Regular expressions (MDN)

Using String.Replace:
Use a simple String.Replace().
I've assumed that the only characters you want to remove are the ones you've mentioned in the question: � and you want to replace them by a normal space.
string text = "imp�ortant";
string cleaned = text.Replace('\u00ef', ' ')
.Replace('\u00bf', ' ')
.Replace('\u00bd', ' ');
// Returns 'imp ortant'
Or using Regex.Replace:
string cleaned = Regex.Replace(text, "[\u00ef\u00bf\u00bd]", " ");
// Returns 'imp ortant'
Try it out: Dotnet Fiddle

Define a range of ASCII characters, and replace anything that is not within that range.
We want to find only Unicode characters, so we will match on a Unicode character and replace.
Regex.Replace("This is my te\uFFFDxt from csv file", #"[^\u0000-\u007F]+", " ")
The above pattern will match anything that is not ^ in the set [ ] of this range \u0000-\u007F (ASCII characters (everything past \u007F is Unicode)) and replace it with a space.
Result
This is my te xt from csv file
You can adjust the range provided \u0000-\u007F as needed to expand the range of allowed characters to suit your needs.

If you just want ASCII then try the following:
var ascii = new ASCIIEncoding();
byte[] encodedBytes = ascii.GetBytes(text);
var cleaned = ascii.GetString(encodedBytes).Replace("?", " ");

Related

How can I remove the spaces that appear between the words even after splitting the string? [duplicate]

I have the following input:
string txt = " i am a string "
I want to remove space from start of starting and end from a string.
The result should be: "i am a string"
How can I do this in c#?
String.Trim
Removes all leading and trailing white-space characters from the current String object.
Usage:
txt = txt.Trim();
If this isn't working then it highly likely that the "spaces" aren't spaces but some other non printing or white space character, possibly tabs. In this case you need to use the String.Trim method which takes an array of characters:
char[] charsToTrim = { ' ', '\t' };
string result = txt.Trim(charsToTrim);
Source
You can add to this list as and when you come across more space like characters that are in your input data. Storing this list of characters in your database or configuration file would also mean that you don't have to rebuild your application each time you come across a new character to check for.
NOTE
As of .NET 4 .Trim() removes any character that Char.IsWhiteSpace returns true for so it should work for most cases you come across. Given this, it's probably not a good idea to replace this call with the one that takes a list of characters you have to maintain.
It would be better to call the default .Trim() and then call the method with your list of characters.
You can use:
String.TrimStart - Removes all leading occurrences of a set of characters specified in an array from the current String object.
String.TrimEnd - Removes all trailing occurrences of a set of characters specified in an array from the current String object.
String.Trim - combination of the two functions above
Usage:
string txt = " i am a string ";
char[] charsToTrim = { ' ' };
txt = txt.Trim(charsToTrim)); // txt = "i am a string"
EDIT:
txt = txt.Replace(" ", ""); // txt = "iamastring"
I really don't understand some of the hoops the other answers are jumping through.
var myString = " this is my String ";
var newstring = myString.Trim(); // results in "this is my String"
var noSpaceString = myString.Replace(" ", ""); // results in "thisismyString";
It's not rocket science.
txt = txt.Trim();
Or you can split your string to string array, splitting by space and then add every item of string array to empty string.
May be this is not the best and fastest method, but you can try, if other answer aren't what you whant.
text.Trim() is to be used
string txt = " i am a string ";
txt = txt.Trim();
Use the Trim method.
static void Main()
{
// A.
// Example strings with multiple whitespaces.
string s1 = "He saw a cute\tdog.";
string s2 = "There\n\twas another sentence.";
// B.
// Create the Regex.
Regex r = new Regex(#"\s+");
// C.
// Strip multiple spaces.
string s3 = r.Replace(s1, #" ");
Console.WriteLine(s3);
// D.
// Strip multiple spaces.
string s4 = r.Replace(s2, #" ");
Console.WriteLine(s4);
Console.ReadLine();
}
OUTPUT:
He saw a cute dog.
There was another sentence.
He saw a cute dog.
You Can Use
string txt = " i am a string ";
txt = txt.TrimStart().TrimEnd();
Output is "i am a string"

Find and replace the string in paragraph

I want to empty the value between the hyphn for example need to clear the data in between the range of hyphen prefix and suffix then make it has empty string.
string templateContent = "Template content -macro- -UnitDetails- -testEmail- sending Successfully";
Output
templateContent = "Template content sending Successfully";
templateContent = Regex.Replace(templateContent, #"-\w*-\s?", string.Empty).TrimEnd(' ');
#"-\w*-\s" - is regex pattern for '-Word- '
- - pattern for -
\w - word character.
* - zero or any occurrences of \w
\s - pattern for whitespace character
? - marks \s as optional
TrimEnd(' ') - to remove trailing space if there was a pattern at end of the string
There are many ways to do this, however given your example the following should work
var split = templateContent
.Split(' ')
.Where(x => !x.StartsWith("-") && !x.EndsWith("-"));
var result = string.Join(" ",split);
Console.WriteLine(result);
Output
Template content sending Successfully
Full Demo Here
Note : I personally think regex is better suited to this
You can use regex for this
string regExp = "(-[a-zA-Z]*-)";
string tmp = Regex.Replace(templateContent , regExp, "");
string finalStr = Regex.Replace(tmp, " {2,}", " ");
var resultWithSpaces = Regex.Replace(templateContent, #"-\S+-", string.Empty);
This regular expression looks for two hyphens surrounding one or more characters that are not white space.
It will leave the spaces that were around the removed word. To get rid of those you can do another Regex to replace multiple spaces with a single space.
var result = Regex.Replace(resultWithSpaces, #"\s+", " ");

How to compare characters in C#

I'm trying to compare two characters in C#. The "==" operator does not work for strings, you have to use the .Equals() method. In the following code example I want to read each character in the input string, and output another string without spaces.
string inputName, outputName = null;
// read input name from file
foreach (char indexChar in inputName)
{
if (!indexChar.Equals(" "))
outputName += indexChar;
}
This does not work, the comparison always equals false, even when the input name has embedded spaces. I also tried using the overload method Equals(string, string), which did not work either. I'm assuming C# treats char variables as a string of length 1. Microsoft's documentation doesn't seem to mention comparing characters. Does anyone have a better method for comparing characters in a string?
" " is a string of length one; a char and a string never match; you want ' ', the space character:
if (indexChar != ' ')
However, if you're just trying to remove all spaces, it is probably easier to just do:
var outputName = inputName.Replace(" ", "");
This avoids allocating lots of intermediate strings.
Note also that the space character isn't the only whitespace character in unicode. If you need to deal with all whitespace characters, a regex may be a better option:
var outputName = Regex.Replace(inputName, #"\s", "");
You can use .CompareTo(char) to compare characters.
Example :
if('Z'.CompareTo('Z') == 0)
Console.WriteLine("Same character !");
Thanks for all the great suggestions. inputName.CompareTo(" ") is not the way to go for this example, you would still have to have a loop. I ended up using:
var outputName = Regex.Replace(inputName, #"\s", "")
which works, and it's only one line of code!

C# How may i replace \" with "

I am trying to replace \" in a string with ", how may i do that?
I've tried using replace but i could not find a way to do it.
Ex:
string line = "This is a \"sample\" "
string replaced = "This is a "sample" ".
Thanks.
Because quotes are used to start and end strings (they are a type of control character), you can't have a quote in the middle of a string because it would terminate the string
string replaced = "This is a "sample" ";
/*
You can see from the syntax highlighting (red) that the string is being
detected as <This is a > and <sample> is black meaning it is detected as
code (and will cause a syntax error)
*/
In order to put a quote in the middle of the string we escape it (escaping means to treat it as a character literal instead of a control character) using the escape character, which in C# is backslash.
string line = "This is a \"sample\"";
Console.WriteLine(line);
// Output: This is a "sample"
string literalLine = #"This is a ""sample""";
Console.WriteLine(literalLine);
// Output: This is a "sample"
The # symbol in C# means I want this to be a literal string (ignore control characters), however quotes still start and end strings so in order to print a quote in a literal string you write two of them "" (that's how the language is designed)
Case 1: If the value within the variable line is actually This is a \"sample\", then you could do line.Replace("\\\"", "\"").
If not:
\" is an escape sequence. it shows up as \" in the code, however when it compiles it would show up as " instead of the original \".
The reason for escaping quotes is because the compiler cannot identify whether the quote is within another quote or not. Let's see your example:
"This is a "sample" "
is this This is a as one group, then an unknown token sample, then another quote ? or This is a "sample" all within a quote? We can take a guess by looking at the context, but compiler cannot. Hence, we use escape sequence to tell the compiler "I used this double quote character as a character, not the closing/opening of a string literal."
See Also: https://en.wikipedia.org/wiki/Escape_sequences_in_C
You may try something like this:
String str = "This is a \"sample\" ";
Console.WriteLine("Original string: {0}", str);
Console.WriteLine("Replaced: {0}", str.Replace('\"', '"'));
Desired output : This is a sample
Given string : "This is a \"sample\""
The problem: you have escape characters protecting the double quotes from being interpreted. the \ escape character is an instruction to use a quotation mark literally instead of using it to indicate a break in the string. This means the actual string value is "This is a "sample"" when served as output.
The answer removing the \ may work, but it makes for smelly code because removing an escape character in this way can make it unclear what you intend and prevents you from escaping any character.
Removing the " might work, though it prevents use of any quotes and some IDEs might leave the escape character behind to ruin your day.
We want one specific target, the quotes around "sample".
string sample = "This is a \"sample\"";
List<string> sampleArray = sample.Split(' ').ToList(); // samplearray is now split into ["This", "is", "a", "\"sample\""]
var x = sampleArray.FirstOrDefault(t => t == "\"sample\""); //isolate our needed value
if (x != null) //prevent a null reference in case something went wrong and samplearray wasnt as expected
{
var index = sampleArray.IndexOf(x); //get the location of the value we just picked
x = x.Replace("\"", string.Empty); //replace chars
sampleArray[index] = x; //assign new value to the list
}
return String.Join(" ", sampleArray); //return the string joined together with spaces.
Try this:
string line="This is a \"sample\" " ;
replaced =line.Replace(#"\", "");

Replace a part of string containing Password

Slightly similar to this question, I want to replace argv contents:
string argv = "-help=none\n-URL=(default)\n-password=look\n-uname=Khanna\n-p=100";
to this:
"-help=none\n-URL=(default)\n-password=********\n-uname=Khanna\n-p=100"
I have tried very basic string find and search operations (using IndexOf, SubString etc.). I am looking for more elegant solution so as to replace this part of string:
-password=AnyPassword
to:
-password=*******
And keep other part of string intact. I am looking if String.Replace or Regex replace may help.
What I've tried (not much of error-checks):
var pwd_index = argv.IndexOf("--password=");
string converted;
if (pwd_index >= 0)
{
var leftPart = argv.Substring(0, pwd_index);
var pwdStr = argv.Substring(pwd_index);
var rightPart = pwdStr.Substring(pwdStr.IndexOf("\n") + 1);
converted = leftPart + "--password=********\n" + rightPart;
}
else
converted = argv;
Console.WriteLine(converted);
Solution
Similar to Rubens Farias' solution but a little bit more elegant:
string argv = "-help=none\n-URL=(default)\n-password=\n-uname=Khanna\n-p=100";
string result = Regex.Replace(argv, #"(password=)[^\n]*", "$1********");
It matches password= literally, stores it in capture group $1 and the keeps matching until a \n is reached.
This yields a constant number of *'s, though. But telling how much characters a password has, might already convey too much information to hackers, anyway.
Working example: https://dotnetfiddle.net/xOFCyG
Regular expression breakdown
( // Store the following match in capture group $1.
password= // Match "password=" literally.
)
[ // Match one from a set of characters.
^ // Negate a set of characters (i.e., match anything not
// contained in the following set).
\n // The character set: consists only of the new line character.
]
* // Match the previously matched character 0 to n times.
This code replaces the password value by several "*" characters:
string argv = "-help=none\n-URL=(default)\n-password=look\n-uname=Khanna\n-p=100";
string result = Regex.Replace(argv, #"(password=)([\s\S]*?\n)",
match => match.Groups[1].Value + new String('*', match.Groups[2].Value.Length - 1) + "\n");
You can also remove the new String() part and replace it by a string constant

Categories