Remove all non-ASCII characters from string

Remove all non-ASCII characters from string - c#

I have a C# routine that imports data from a CSV file, matches it against a database and then rewrites it to a file. The source file seems to have a few non-ASCII characters that are fouling up the processing routine.
I already have a static method that I run each input field through but it performs basic checks like removing commas and quotes. Does anybody know how I could add functionality that removes non-ASCII characters too?

Here a simple solution:
public static bool IsASCII(this string value)
{
// ASCII encoding replaces non-ascii with question marks, so we use UTF8 to see if multi-byte sequences are there
return Encoding.UTF8.GetByteCount(value) == value.Length;
}
source: http://snipplr.com/view/35806/

string sOut = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s))

Do it all at once
public string ReturnCleanASCII(string s)
{
StringBuilder sb = new StringBuilder(s.Length);
foreach(char c in s)
{
if((int)c > 127) // you probably don't want 127 either
continue;
if((int)c < 32) // I bet you don't want control characters
continue;
if(c == ',')
continue;
if(c == '"')
continue;
sb.Append(c);
}
return sb.ToString();
}

If you wanted to test a specific character, you could use
if ((int)myChar <= 127)
Just getting the ASCII encoding of the string will not tell you that a specific character was non-ASCII to begin with (if you care about that). See MSDN.

Here's an improvement upon the accepted answer:
string fallbackStr = "";
Encoding enc = Encoding.GetEncoding(Encoding.ASCII.CodePage,
new EncoderReplacementFallback(fallbackStr),
new DecoderReplacementFallback(fallbackStr));
string cleanStr = enc.GetString(enc.GetBytes(inputStr));
This method will replace unknown characters with the value of fallbackStr, or if fallbackStr is empty, leave them out entirely. (Note that enc can be defined outside the scope of a function.)

It sounds kind of strange that it's accepted to drop the non-ASCII.
Also I always recommend the excellent FileHelpers library for parsing CSV-files.

strText = Regex.Replace(strText, #"[^\u0020-\u007E]", string.Empty);

public string RunCharacterCheckASCII(string s)
{
string str = s;
bool is_find = false;
char ch;
int ich = 0;
try
{
char[] schar = str.ToCharArray();
for (int i = 0; i < schar.Length; i++)
{
ch = schar[i];
ich = (int)ch;
if (ich > 127) // not ascii or extended ascii
{
is_find = true;
schar[i] = '?';
}
}
if (is_find)
str = new string(schar);
}
catch (Exception ex)
{
}
return str;
}

Related

remove string between "|" and "," in stringbuilder in C#

I use VS2019 in Windows7.
I want to remove string between "|" and "," in a StringBuilder.
That is , I want to convert StringBuilder from
"578.552|0,37.986|317,38.451|356,23"
to
"578.552,37.986,38.451,23"
I have tried Substring but failed, what other method I could use to achieve this?

If you have a huge StringBuilder and that's why converting it into String and applying regular expression is not the option,
you can try implementing Finite State Machine (FSM):
StringBuilder source = new StringBuilder("578.552|0,37.986|317,38.451|356,23");
int state = 0; // 0 - keep character, 1 - discard character
int index = 0;
for (int i = 0; i < source.Length; ++i) {
char c = source[i];
if (state == 0)
if (c == '|')
state = 1;
else
source[index++] = c;
else if (c == ',') {
state = 0;
source[index++] = c;
}
}
source.Length = index;

StringBuilder isn't really setup for much by way of inspection and mutation in the middle. It would be pretty easy to do once you have a string (probably via a Regex), but StringBuilder? not so much. In reality, StringBuilder is mostly intended for forwards-only append, so the answer would be:
if you didn't want those characters, why did you add them?
Maybe just use the string version here; then:
var s = "578.552|0,37.986|317,38.451|356,23";
var t = Regex.Replace(s, #"\|.*?(?=,)", ""); // 578.552,37.986,38.451,23
The regex translation here is "pipe (\|), non-greedy anything (.*?), followed by a comma where the following comma isn't part of the match ((?=,)).

If you don't know very much of Regex patterns, you can write your own custom method to filter out data; its always instructive and a good practicing exercise:
public static String RemoveDelimitedSubstrings(
this StringBuilder s,
char startDelimitter,
char endDelimitter,
char newDelimitter)
{
var buffer = new StringBuilder(s.Length);
var ignore = false;
for (var i = 0; i < s.Length; i++)
{
var currentChar = s[i];
if (currentChar == startDelimitter && !ignore)
{
ignore = true;
}
else if (currentChar == endDelimitter && ignore)
{
ignore = false;
buffer.Append(newDelimitter);
}
else if (!ignore)
buffer.Append(currentChar);
}
return buffer.ToString();
}
And youd obvisouly use it like:
var buffer= new StringBuilder("578.552|0,37.986|317,38.451|356,23");
var filteredBuffer = b.RemoveDelimitedSubstrings('|', ',', ','));

How to unescape a sequence include \u and \U?

I have some strings in a .resx file include some sequences like this:
\u26A0 warning
So i use the following code to unscape it
str = Regex.Unescape(str);
Now, when i see the result everything works well (with \u) and it show the related emoji.
But Regex.Unescape(...) method dose not work when the input string is include \U like this:
\U0001F4D8 book
and it return this error:
Error: Unrecognized escape sequence \U
My question:
Is there another method in .Net framework to Unescape the sequences include \u and \U?
If there is not an embed method, how can i write a helper method manually to do it?
Edit:
When i read string from the resx file it has double backslash, i should convert these Unicode sequences to their characters:

Indeed, according to source code of Regex.Unescape, RegexParser.ScanCharEscape, \U is not handled.
Instead, you could consider a manual conversion with help of char.ConnvertFromUtf32:
string converted = char.ConvertFromUtf32(int.Parse("0001F4D8", NumberStyles.HexNumber));
This is a draft implementation. (The annoying complexity comes from an attempt to distinguish \U and \\U.)
static string Unescape(string str)
{
StringBuilder builder = new StringBuilder();
int startIndex = 0;
while(true)
{
int index = IndexOfBackslashU(str, startIndex);
if (index == -1)
return builder.Append(Regex.Unescape(str.Substring(startIndex))).ToString();
builder.Append(Regex.Unescape(str.Substring(startIndex, index - startIndex)));
string number = str.Substring(index + 2, 8);
builder.Append(char.ConvertFromUtf32(int.Parse(number, NumberStyles.HexNumber)));
startIndex = index + 10;
}
}
static int IndexOfBackslashU(string str, int startIndex)
{
while (true)
{
int index = str.IndexOf(#"\U", startIndex);
if (index == -1)
return index;
bool evenNumberOfPreviousBackslashes = true;
for (int k = index-1; k >= 0 && str[k] == '\\'; k--)
evenNumberOfPreviousBackslashes = !evenNumberOfPreviousBackslashes;
if (evenNumberOfPreviousBackslashes)
return index;
startIndex = index + 2;
}
}

I wrote this method and the problem solved:
public static string UnescapeIt(string str)
{
var regex = new Regex(#"(?<!\\)(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8})", RegexOptions.Compiled);
return regex.Replace(str,
m =>
{
if (m.Value.IndexOf("\\U", StringComparison.Ordinal) > -1)
return char.ConvertFromUtf32(int.Parse(m.Value.Replace("\\U", ""), NumberStyles.HexNumber));
return Regex.Unescape(m.Value);
});
}
It unescape \u sequences and convert \U sequences to related character. So we can see the emojis.
Use:
str= UnescapeIt(str);
Result:
Update:
I changed the regex from
\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}
to
(?<!\\)(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8})
Now it will fail the match if we have a backslash before \u or \U

C# replace different characters in a string

Everyone knows how to replace a character in a string with:
string text = "Hello World!";
text = text.Replace("H","J");
but what I need is to replace multiple characters in a string
something like:
string text = textBox1.Text;
text = text.Replace("a","b")
text = text.Replace("b","a")
now the result is aa , but if the user types ab I want the result to be ba

There's multiple ways to do this.
Using a loop
char[] temp = input.ToCharArray();
for (int index = 0; index < temp.Length; index++)
switch (temp[index])
{
case 'a':
temp[index] = 'b';
break;
case 'b':
temp[index] = 'a';
break;
}
string output = new string(temp);
This will simply copy the string to a character array, fix each character by itself, then convert the array back into a string. No risk of getting any of the characters confused with any of the others.
Using a regular expression
You can exploit this overload of Regex.Replace:
public static string Replace(
string input,
string pattern,
MatchEvaluator evaluator
)
This takes a delegate that will be called for each match, and return the final result. The delegate is responsible for returning what each match should be replaced with.
string output = Regex.Replace(input, ".", ma =>
{
if (ma.Value == "a")
return "b";
if (ma.Value == "b")
return "a";
return ma.Value;
});

For your particular requirement I would suggest you to use like the following:
string input = "abcba";
string outPut=String.Join("",input.ToCharArray()
.Select(x=> x=='a'? x='b':
(x=='b'?x='a':x))
.ToArray());
The output string will be bacab for this particular input

Do not call String.Replace multiple times for the same string! It creates a new string every time (also it has to cycle through the whole string every time) causing memory pressure and processor time waste if used a lot.
What you could do:
Create a new char array with the same length as the input string. Iterate over all chars of the input strings. For every char, check whether it should be replaced. If it should be replaced, write the replacement into the char array you created earlier, otherwise write the original char into that array. Then create a new string using that char array.
string inputString = "aabbccdd";
char[] chars = new char[inputString.Length];
for (int i = 0; i < inputString.Length; i++)
{
if (inputString[i] == 'a')
{
chars[i] = 'b';
}
else if (inputString[i] == 'b')
{
chars[i] = 'a';
}
else
{
chars[i] = inputString[i];
}
}
string outputString = new string(chars);
Consider using a switch when intending to replace a lot of different characters.

Use should use StringBuilder when you are concatenating many strings in a loop like this, so I suggest the following solution:
StringBuilder sb = new StringBuilder(text.Length);
foreach(char c in text)
{
sb.Append(c == 'a' ? 'b' : 'a');
}
var result = sb.ToString();

Email address splitting

So I have a string that I need to split by semicolon's
Email address: "one#tw;,.'o"#hotmail.com;"some;thing"#example.com
Both of the email addresses are valid
So I want to have a List<string> of the following:
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
But the way I am currently splitting the addresses is not working:
var addresses = emailAddressString.Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries)
.Select(x => x.Trim()).ToList();
Because of the multiple ; characters I end up with invalid email addresses.
I have tried a few different ways, even going down working out if the string contains quotes and then finding the index of the ; characters and working it out that way, but it's a real pain.
Does anyone have any better suggestions?

Assuming that double-quotes are not allowed, except for the opening and closing quotes ahead of the "at" sign #, you can use this regular expression to capture e-mail addresses:
((?:[^#"]+|"[^"]*")#[^;]+)(?:;|$)
The idea is to capture either an unquoted [^#"]+ or a quoted "[^"]*" part prior to #, and then capture everything up to semicolon ; or the end anchor $.
Demo of the regex.
var input = "\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world";
var mm = Regex.Matches(input, "((?:[^#\"]+|\"[^\"]*\")#[^;]+)(?:;|$)");
foreach (Match m in mm) {
Console.WriteLine(m.Groups[1].Value);
}
This code prints
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
Demo 1.
If you would like to allow escaped double-quotes inside double-quotes, you could use a more complex expression:
((?:(?:[^#\"]|(?<=\\)\")+|\"([^\"]|(?<=\\)\")*\")#[^;]+)(?:;|$)
Everything else remains the same.
Demo 2.

I obviously started writing my anti regex method at around the same time as juharr (Another answer). I thought that since I already have it written I would submit it.
public static IEnumerable<string> SplitEmailsByDelimiter(string input, char delimiter)
{
var startIndex = 0;
var delimiterIndex = 0;
while (delimiterIndex >= 0)
{
delimiterIndex = input.IndexOf(';', startIndex);
string substring = input;
if (delimiterIndex > 0)
{
substring = input.Substring(0, delimiterIndex);
}
if (!substring.Contains("\"") || substring.IndexOf("\"") != substring.LastIndexOf("\""))
{
yield return substring;
input = input.Substring(delimiterIndex + 1);
startIndex = 0;
}
else
{
startIndex = delimiterIndex + 1;
}
}
}
Then the following
var input = "blah#blah.com;\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world;asdasd#asd.co.uk;";
foreach (var email in SplitEmailsByDelimiter(input, ';'))
{
Console.WriteLine(email);
}
Would give this output
blah#blah.com
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
asdasd#asd.co.uk

You can also do this without using regular expressions. The following extension method will allow you to specify a delimiter character and a character to begin and end escape sequences. Note it does not validate that all escape sequences are closed.
public static IEnumerable<string> SpecialSplit(
this string str, char delimiter, char beginEndEscape)
{
int beginIndex = 0;
int length = 0;
bool escaped = false;
foreach (char c in str)
{
if (c == beginEndEscape)
{
escaped = !escaped;
}
if (!escaped && c == delimiter)
{
yield return str.Substring(beginIndex, length);
beginIndex += length + 1;
length = 0;
continue;
}
length++;
}
yield return str.Substring(beginIndex, length);
}
Then the following
var input = "\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world;\"D;D#blah;blah.com\"";
foreach (var address in input.SpecialSplit(';', '"'))
Console.WriteLine(v);
While give this output
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
"D;D#blah;blah.com"
Here's the version that works with an additional single escape character. It assumes that two consecutive escape characters should become one single escape character and it's escaping both the beginEndEscape charter so it will not trigger the beginning or end of an escape sequence and it also escapes the delimiter. Anything else that comes after the escape character will be left as is with the escape character removed.
public static IEnumerable<string> SpecialSplit(
this string str, char delimiter, char beginEndEscape, char singleEscape)
{
StringBuilder builder = new StringBuilder();
bool escapedSequence = false;
bool previousEscapeChar = false;
foreach (char c in str)
{
if (c == singleEscape && !previousEscapeChar)
{
previousEscapeChar = true;
continue;
}
if (c == beginEndEscape && !previousEscapeChar)
{
escapedSequence = !escapedSequence;
}
if (!escapedSequence && !previousEscapeChar && c == delimiter)
{
yield return builder.ToString();
builder.Clear();
continue;
}
builder.Append(c);
previousEscapeChar = false;
}
yield return builder.ToString();
}
Finally you probably should add null checking for the string that is passed in and note that both will return a sequence with one empty string if you pass in an empty string.

How to encode custom HTTP headers in C#

Is there a class similar to HttpUtility to encode the content of a custom header? Ideally I would like to keep the content readable.

You can use the HttpEncoder.HeaderNameValueEncode Method in the .NET Framework 4.0 and above.
For previous versions of the .NET Framework, you can roll your own encoder, using the logic noted on the HttpEncoder.HeaderNameValueEncode reference page:
All characters whose Unicode value is less than ASCII character 32,
except ASCII character 9, are URL-encoded into a format of %NN where
the N characters represent hexadecimal values.
ASCII character 9 (the horizontal tab character) is not URL-encoded.
ASCII character 127 is encoded as %7F.
All other characters are not encoded.
Update:
As OliverBock point out the HttpEncoder.HeaderNameValueEncode Method is protected and internal. I went to open source Mono project and found the mono's implementation
void HeaderNameValueEncode (string headerName, string headerValue, out string encodedHeaderName, out string encodedHeaderValue)
{
if (String.IsNullOrEmpty (headerName))
encodedHeaderName = headerName;
else
encodedHeaderName = EncodeHeaderString (headerName);
if (String.IsNullOrEmpty (headerValue))
encodedHeaderValue = headerValue;
else
encodedHeaderValue = EncodeHeaderString (headerValue);
}
static void StringBuilderAppend (string s, ref StringBuilder sb)
{
if (sb == null)
sb = new StringBuilder (s);
else
sb.Append (s);
}
static string EncodeHeaderString (string input)
{
StringBuilder sb = null;
for (int i = 0; i < input.Length; i++) {
char ch = input [i];
if ((ch < 32 && ch != 9) || ch == 127)
StringBuilderAppend (String.Format ("%{0:x2}", (int)ch), ref sb);
}
if (sb != null)
return sb.ToString ();
return input;
}
Just FYI
[here ] (https://github.com/mono/mono/blob/master/mcs/class/System.Web/System.Web.Util/HttpEncoder.cs)

For me helped Uri.EscapeDataString(headervalue)

This does the same job as HeaderNameValueEncode(), but will also encode % characters so the header can be reliably decoded later.
static string EncodeHeaderValue(string value)
{
return Regex.Replace(value, #"[\u0000-\u0008\u000a-\u001f%\u007f]", (m) => "%"+((int)m.Value[0]).ToString("x2"));
}
static string DecodeHeaderValue(string encoded)
{
return Regex.Replace(encoded, #"%([0-9a-f]{2})", (m) => new String((char)Convert.ToInt32(m.Groups[1].Value, 16), 1), RegexOptions.IgnoreCase);
}

Mayeb This one ?
UrlEncode Function

sorry its off the top of my head but for your request object there should be a headers object you can add to.
i.e. request.headers.add("blah");
Thats not spot on but it should point you in the right direction.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove all non-ASCII characters from string - c#

Here a simple solution: public static bool IsASCII(this string value) { // ASCII encoding replaces non-ascii with question marks, so we use UTF8 to see if multi-byte sequences are there return Encoding.UTF8.GetByteCount(value) == value.Length; } source: http://snipplr.com/view/35806/

string sOut = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s))

If you wanted to test a specific character, you could use if ((int)myChar <= 127) Just getting the ASCII encoding of the string will not tell you that a specific character was non-ASCII to begin with (if you care about that). See MSDN.

It sounds kind of strange that it's accepted to drop the non-ASCII. Also I always recommend the excellent FileHelpers library for parsing CSV-files.

strText = Regex.Replace(strText, #"[^\u0020-\u007E]", string.Empty);

Related

remove string between "|" and "," in stringbuilder in C#

How to unescape a sequence include \u and \U?

C# replace different characters in a string

Email address splitting

How to encode custom HTTP headers in C#

Categories

Resources