Replace all non ascii characters with a code - c#

Is it possible in c# string to replace all non ASCII characters with a code. I have an application that prints to Zebra label printer with ZPL. It needs all UTF-8 characters to a code with leading underscore. For example if the user wants to print µ (micro symbol) I have to do
text = text.replace("µ", "_c2_b5"); //c2b5 is the UTF8 code for µ
Example "Helloµ±" should become "Hello_c2_b5_c2_b1"

This will help:
var source = "Helloµ±";
var sb = new StringBuilder();
foreach (char c in source)
{
if (c == '_')
{
// special case: Replace _ With _5f
sb.Append("_5f");
}
else if (c < 32 || c > 127)
{
// handle non-ascii by using hex representation of bytes
// TODO: check whether "surrogate pairs" are handled correctly (if required)
var ba = Encoding.UTF8.GetBytes(new[] { c });
foreach (byte b in ba)
{
sb.AppendFormat("_{0:x2}", b);
}
}
else
{
// in printable ASCII range, so just copy
sb.Append(c);
}
}
Console.WriteLine(sb.ToString());
This results in "Hello_c2_b5_c2_b1"
It is up to you to wrap this in a nice method.
Late addition: The first two tests can be combined, as _ just has to be replaced by its byte representation, to avoid confusion about what an _ means in the result:
if (c == '_' || c < 32 || c > 127)
{
var ba = Encoding.UTF8.GetBytes(new[] { c });
foreach (byte b in ba)
{
sb.AppendFormat("_{0:x2}", b);
}
}
else
{
sb.Append(c);
}

You can try this.
var bytes = System.Text.Encoding.ASCII.GetBytes("søme string");
string result = System.Text.Encoding.UTF8.GetString(bytes);

Here is the example where it replaces all non-ascii character code from a string:
string s = "søme string";
s = Regex.Replace(s, #"[^\u0000-\u007F]+", string.Empty);

Related

Remove unwanted characters from a huge file

EDIT : Here's my current code (21233664 chars)
string str = myInput.Text;
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_' || c==' ')
{
sb.Append(c);
}
}
output.Text = sb.ToString();
Let's say I have a huge text file which contains special characters and normal expressions with underscores.
Here are a few examples of the strings that I'm looking for :
super_test
test
another_super_test
As you can see, only lower case letters are allowed with underscores.
Now, if I have those strings in a text file that looks like this :
> §> ˜;# ®> l? super_test D>ÿÿÿÿ “G? tI> €[> €? È
The problem I'm facing is that some lonely letters are still saved. In the example given above, the output would be :
l super_test t
To get ridden of those chars, I must go through the whole file again but here's my question : how can I know whether a letter is lonely or not?
I'm not sure I understand the possibilities with regex, so if anyone can give me a hint I'd really appreciate it.
You clearly need a regular expression. A simple one would be [a-z_]{2,}, which takes all strings of lowercase a to z letters and underscore that are at least 2 characters long.
Just be careful when you are parsing the big file. Being huge, I imagine you use some sort of buffers. You need to make sure you don't get half of a word in one buffer and the other in the next.
You can't treat the space just like the other acceptable characters. In addition to being acceptable, the space also serves as a delimiter for your lonesome characters. (This might be a problem with the proposed regular expressions as well; I couldn't say for sure.) Anyway, this does what (I think) you want:
string str = "> §> ˜;# ®> l? super_test D>ÿÿÿÿ “G? tI> €[> €? È";
StringBuilder sb = new StringBuilder();
char? firstLetterOfWord = null;
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_')
{
int length = sb.Length;
if (firstLetterOfWord != null)
{
// c is the second character of a word
sb.Append(firstLetterOfWord);
sb.Append(c);
firstLetterOfWord = null;
}
else if (length == 0 || sb[length - 1] == ' ')
{
// c is the first character of a word; save for next iteration
firstLetterOfWord = c;
}
else
{
// c is part of a word; we're not first, and prev != space
sb.Append(c);
}
}
else if (c == ' ')
{
// If you want to eliminate multiple spaces in a row,
// this is the place to do so
sb.Append(' ');
firstLetterOfWord = null;
}
else
{
firstLetterOfWord = null;
}
}
Console.WriteLine(sb.ToString());
It works with singletons and full words at both start and end of string.
If your input contains something like one#two, the output will run together (onetwo with no intervening space). Assuming that's not what you want, and also assuming that you have no need for multiple spaces in a row:
StringBuilder sb = new StringBuilder();
bool previousWasSpace = true;
char? firstLetterOfWord = null;
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_')
{
if (firstLetterOfWord != null)
{
sb.Append(firstLetterOfWord).Append(c);
firstLetterOfWord = null;
previousWasSpace = false;
}
else if (previousWasSpace)
{
firstLetterOfWord = c;
}
else
{
sb.Append(c);
}
}
else
{
firstLetterOfWord = null;
if (!previousWasSpace)
{
sb.Append(' ');
previousWasSpace = true;
}
}
}
Console.WriteLine(sb.ToString());

Remove a char from a String

I'm looking for a method which can remove a character of a string.
for example I have " 3*X^4" and I want to remove characters '*' & '^' then the string would be like this "3X4" .
Maybe:
string s = Regex.Replace(input, "[*^]", "");
var s = "3*X^4";
var simplified = s.Replace("*", "").Replace("^", "");
// simplified is now "3X4"
try this..it will remove all special character from string
public static string RemoveSpecialCharacters(string str)
{
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
|| c == '.' || c == '_')
{
sb.Append(c);
}
}
return sb.ToString();
}
Another solution would be extracting the unwanted characters manually - this might be slightly more performant than repeatedly calling string.Replace especially for larger numbers of unwanted characters:
StringBuilder result = new StringBuilder(input.Length);
foreach (char ch in input) {
switch (ch) {
case '*':
case '^':
break;
default:
result.Append(ch);
break;
}
}
string s = result.ToString();
Or maybe extracting is the wrong word: Rather, you copy all characters except for those that you don't want.
Try this: String.Replace(Old String, New String)
string S = "3*X^4";
string str = S.Replace("*","").Replace("^","");

Remove non printable characters C# multilanguage

I have a multi-language application in asp.net C#. Here I have to create a zip file and use some items from the database to construct file name. I strip out special characters from file name. However if the language is German for example my trimming algorithm will remove some german characters like Umlaut.
Could someone provide me with a language adaptable trimming algorithm.
Here is my code:
private string RemoveSpecialCharacters(string str)
{
return str;
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') | c == '.' || c == '_' || c == ' ' || c == '+')
{
sb.Append(c);
}
}
return sb.ToString();
}
thanks
Assuming you mean the name of the ZIP file, instead of the names inside the ZIP file, you probably want to check if the character is valid for a filename, which will allow you to use more than just letters or digits:
char[] invalid = System.IO.Path.GetInvalidFileNameChars();
string s = "abcöü*/";
var newstr = new String(s.Where(c => !invalid.Contains(c)).ToArray());
string s = "abcöü*/";
var newstr = new String( s.Where(Char.IsLetterOrDigit).ToArray() );
A more versatile variant that will mangle the string less is:
public static string RemoveDiacritics(this string s)
{
// split accented characters into surrogate pairs
IEnumerable<char> chars = s.Normalize(NormalizationForm.FormD);
// remove all non-ASCII characters – i.e. the accents
return new string(chars.Where(c => c < 0x7f && !char.IsControl(c)).ToArray());
}
This should remove most problematic characters while still preserving most of the text. (If you're creating filenames, you might also want to replace newlines and tabs with the space character.)
One-liner, assuming ASCII where non-printable are essentially all chars before the space:
var safeString = new string(str.Select(c=>c<' '?'_':c).ToArray());

How to encode custom HTTP headers in C#

Is there a class similar to HttpUtility to encode the content of a custom header? Ideally I would like to keep the content readable.
You can use the HttpEncoder.HeaderNameValueEncode Method in the .NET Framework 4.0 and above.
For previous versions of the .NET Framework, you can roll your own encoder, using the logic noted on the HttpEncoder.HeaderNameValueEncode reference page:
All characters whose Unicode value is less than ASCII character 32,
except ASCII character 9, are URL-encoded into a format of %NN where
the N characters represent hexadecimal values.
ASCII character 9 (the horizontal tab character) is not URL-encoded.
ASCII character 127 is encoded as %7F.
All other characters are not encoded.
Update:
As OliverBock point out the HttpEncoder.HeaderNameValueEncode Method is protected and internal. I went to open source Mono project and found the mono's implementation
void HeaderNameValueEncode (string headerName, string headerValue, out string encodedHeaderName, out string encodedHeaderValue)
{
if (String.IsNullOrEmpty (headerName))
encodedHeaderName = headerName;
else
encodedHeaderName = EncodeHeaderString (headerName);
if (String.IsNullOrEmpty (headerValue))
encodedHeaderValue = headerValue;
else
encodedHeaderValue = EncodeHeaderString (headerValue);
}
static void StringBuilderAppend (string s, ref StringBuilder sb)
{
if (sb == null)
sb = new StringBuilder (s);
else
sb.Append (s);
}
static string EncodeHeaderString (string input)
{
StringBuilder sb = null;
for (int i = 0; i < input.Length; i++) {
char ch = input [i];
if ((ch < 32 && ch != 9) || ch == 127)
StringBuilderAppend (String.Format ("%{0:x2}", (int)ch), ref sb);
}
if (sb != null)
return sb.ToString ();
return input;
}
Just FYI
[here ] (https://github.com/mono/mono/blob/master/mcs/class/System.Web/System.Web.Util/HttpEncoder.cs)
For me helped Uri.EscapeDataString(headervalue)
This does the same job as HeaderNameValueEncode(), but will also encode % characters so the header can be reliably decoded later.
static string EncodeHeaderValue(string value)
{
return Regex.Replace(value, #"[\u0000-\u0008\u000a-\u001f%\u007f]", (m) => "%"+((int)m.Value[0]).ToString("x2"));
}
static string DecodeHeaderValue(string encoded)
{
return Regex.Replace(encoded, #"%([0-9a-f]{2})", (m) => new String((char)Convert.ToInt32(m.Groups[1].Value, 16), 1), RegexOptions.IgnoreCase);
}
Mayeb This one ?
UrlEncode Function
sorry its off the top of my head but for your request object there should be a headers object you can add to.
i.e. request.headers.add("blah");
Thats not spot on but it should point you in the right direction.

Remove all non-ASCII characters from string

I have a C# routine that imports data from a CSV file, matches it against a database and then rewrites it to a file. The source file seems to have a few non-ASCII characters that are fouling up the processing routine.
I already have a static method that I run each input field through but it performs basic checks like removing commas and quotes. Does anybody know how I could add functionality that removes non-ASCII characters too?
Here a simple solution:
public static bool IsASCII(this string value)
{
// ASCII encoding replaces non-ascii with question marks, so we use UTF8 to see if multi-byte sequences are there
return Encoding.UTF8.GetByteCount(value) == value.Length;
}
source: http://snipplr.com/view/35806/
string sOut = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s))
Do it all at once
public string ReturnCleanASCII(string s)
{
StringBuilder sb = new StringBuilder(s.Length);
foreach(char c in s)
{
if((int)c > 127) // you probably don't want 127 either
continue;
if((int)c < 32) // I bet you don't want control characters
continue;
if(c == ',')
continue;
if(c == '"')
continue;
sb.Append(c);
}
return sb.ToString();
}
If you wanted to test a specific character, you could use
if ((int)myChar <= 127)
Just getting the ASCII encoding of the string will not tell you that a specific character was non-ASCII to begin with (if you care about that). See MSDN.
Here's an improvement upon the accepted answer:
string fallbackStr = "";
Encoding enc = Encoding.GetEncoding(Encoding.ASCII.CodePage,
new EncoderReplacementFallback(fallbackStr),
new DecoderReplacementFallback(fallbackStr));
string cleanStr = enc.GetString(enc.GetBytes(inputStr));
This method will replace unknown characters with the value of fallbackStr, or if fallbackStr is empty, leave them out entirely. (Note that enc can be defined outside the scope of a function.)
It sounds kind of strange that it's accepted to drop the non-ASCII.
Also I always recommend the excellent FileHelpers library for parsing CSV-files.
strText = Regex.Replace(strText, #"[^\u0020-\u007E]", string.Empty);
public string RunCharacterCheckASCII(string s)
{
string str = s;
bool is_find = false;
char ch;
int ich = 0;
try
{
char[] schar = str.ToCharArray();
for (int i = 0; i < schar.Length; i++)
{
ch = schar[i];
ich = (int)ch;
if (ich > 127) // not ascii or extended ascii
{
is_find = true;
schar[i] = '?';
}
}
if (is_find)
str = new string(schar);
}
catch (Exception ex)
{
}
return str;
}

Categories