Remove non printable characters C# multilanguage - c#

I have a multi-language application in asp.net C#. Here I have to create a zip file and use some items from the database to construct file name. I strip out special characters from file name. However if the language is German for example my trimming algorithm will remove some german characters like Umlaut.
Could someone provide me with a language adaptable trimming algorithm.
Here is my code:
private string RemoveSpecialCharacters(string str)
{
return str;
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') | c == '.' || c == '_' || c == ' ' || c == '+')
{
sb.Append(c);
}
}
return sb.ToString();
}
thanks

Assuming you mean the name of the ZIP file, instead of the names inside the ZIP file, you probably want to check if the character is valid for a filename, which will allow you to use more than just letters or digits:
char[] invalid = System.IO.Path.GetInvalidFileNameChars();
string s = "abcöü*/";
var newstr = new String(s.Where(c => !invalid.Contains(c)).ToArray());

string s = "abcöü*/";
var newstr = new String( s.Where(Char.IsLetterOrDigit).ToArray() );

A more versatile variant that will mangle the string less is:
public static string RemoveDiacritics(this string s)
{
// split accented characters into surrogate pairs
IEnumerable<char> chars = s.Normalize(NormalizationForm.FormD);
// remove all non-ASCII characters – i.e. the accents
return new string(chars.Where(c => c < 0x7f && !char.IsControl(c)).ToArray());
}
This should remove most problematic characters while still preserving most of the text. (If you're creating filenames, you might also want to replace newlines and tabs with the space character.)

One-liner, assuming ASCII where non-printable are essentially all chars before the space:
var safeString = new string(str.Select(c=>c<' '?'_':c).ToArray());

Related

Replace all non ascii characters with a code

Is it possible in c# string to replace all non ASCII characters with a code. I have an application that prints to Zebra label printer with ZPL. It needs all UTF-8 characters to a code with leading underscore. For example if the user wants to print µ (micro symbol) I have to do
text = text.replace("µ", "_c2_b5"); //c2b5 is the UTF8 code for µ
Example "Helloµ±" should become "Hello_c2_b5_c2_b1"
This will help:
var source = "Helloµ±";
var sb = new StringBuilder();
foreach (char c in source)
{
if (c == '_')
{
// special case: Replace _ With _5f
sb.Append("_5f");
}
else if (c < 32 || c > 127)
{
// handle non-ascii by using hex representation of bytes
// TODO: check whether "surrogate pairs" are handled correctly (if required)
var ba = Encoding.UTF8.GetBytes(new[] { c });
foreach (byte b in ba)
{
sb.AppendFormat("_{0:x2}", b);
}
}
else
{
// in printable ASCII range, so just copy
sb.Append(c);
}
}
Console.WriteLine(sb.ToString());
This results in "Hello_c2_b5_c2_b1"
It is up to you to wrap this in a nice method.
Late addition: The first two tests can be combined, as _ just has to be replaced by its byte representation, to avoid confusion about what an _ means in the result:
if (c == '_' || c < 32 || c > 127)
{
var ba = Encoding.UTF8.GetBytes(new[] { c });
foreach (byte b in ba)
{
sb.AppendFormat("_{0:x2}", b);
}
}
else
{
sb.Append(c);
}
You can try this.
var bytes = System.Text.Encoding.ASCII.GetBytes("søme string");
string result = System.Text.Encoding.UTF8.GetString(bytes);
Here is the example where it replaces all non-ascii character code from a string:
string s = "søme string";
s = Regex.Replace(s, #"[^\u0000-\u007F]+", string.Empty);

Remove unwanted characters from a huge file

EDIT : Here's my current code (21233664 chars)
string str = myInput.Text;
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_' || c==' ')
{
sb.Append(c);
}
}
output.Text = sb.ToString();
Let's say I have a huge text file which contains special characters and normal expressions with underscores.
Here are a few examples of the strings that I'm looking for :
super_test
test
another_super_test
As you can see, only lower case letters are allowed with underscores.
Now, if I have those strings in a text file that looks like this :
> §> ˜;# ®> l? super_test D>ÿÿÿÿ “G? tI> €[> €? È
The problem I'm facing is that some lonely letters are still saved. In the example given above, the output would be :
l super_test t
To get ridden of those chars, I must go through the whole file again but here's my question : how can I know whether a letter is lonely or not?
I'm not sure I understand the possibilities with regex, so if anyone can give me a hint I'd really appreciate it.
You clearly need a regular expression. A simple one would be [a-z_]{2,}, which takes all strings of lowercase a to z letters and underscore that are at least 2 characters long.
Just be careful when you are parsing the big file. Being huge, I imagine you use some sort of buffers. You need to make sure you don't get half of a word in one buffer and the other in the next.
You can't treat the space just like the other acceptable characters. In addition to being acceptable, the space also serves as a delimiter for your lonesome characters. (This might be a problem with the proposed regular expressions as well; I couldn't say for sure.) Anyway, this does what (I think) you want:
string str = "> §> ˜;# ®> l? super_test D>ÿÿÿÿ “G? tI> €[> €? È";
StringBuilder sb = new StringBuilder();
char? firstLetterOfWord = null;
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_')
{
int length = sb.Length;
if (firstLetterOfWord != null)
{
// c is the second character of a word
sb.Append(firstLetterOfWord);
sb.Append(c);
firstLetterOfWord = null;
}
else if (length == 0 || sb[length - 1] == ' ')
{
// c is the first character of a word; save for next iteration
firstLetterOfWord = c;
}
else
{
// c is part of a word; we're not first, and prev != space
sb.Append(c);
}
}
else if (c == ' ')
{
// If you want to eliminate multiple spaces in a row,
// this is the place to do so
sb.Append(' ');
firstLetterOfWord = null;
}
else
{
firstLetterOfWord = null;
}
}
Console.WriteLine(sb.ToString());
It works with singletons and full words at both start and end of string.
If your input contains something like one#two, the output will run together (onetwo with no intervening space). Assuming that's not what you want, and also assuming that you have no need for multiple spaces in a row:
StringBuilder sb = new StringBuilder();
bool previousWasSpace = true;
char? firstLetterOfWord = null;
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_')
{
if (firstLetterOfWord != null)
{
sb.Append(firstLetterOfWord).Append(c);
firstLetterOfWord = null;
previousWasSpace = false;
}
else if (previousWasSpace)
{
firstLetterOfWord = c;
}
else
{
sb.Append(c);
}
}
else
{
firstLetterOfWord = null;
if (!previousWasSpace)
{
sb.Append(' ');
previousWasSpace = true;
}
}
}
Console.WriteLine(sb.ToString());

Remove characters before reaching a set of wanted characters in a string

I have this string coming from a virtual serial port with an ID/Passport reader:
b\0OU0IDBGR9247884874<<<<<<<<<<<<<<<|8601130M1709193BGR8601138634<3|IVANOV<
I have replaced the "/r" with "|" character.
The problem I would like to solve is always remove the characters before reaching the first combination of characters in the string which might be:
"ID","I<","P<" & "V<" "VI".
This is the issue at this stage I have tried the following to remove the characters but with no success:
public static string RemoveSpecialCharacters(string str)
{
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '.' || c == '_' || c == '<' )
{
sb.Append(c);
}
}
return sb.ToString();
}
Remove the characters before reaching the first combination of characters in the string which might be: "ID","I<","P<" & "V<" "VI".
I think will do what you want.
string s = "b\0OU0IDBGR9247884874<<<<<<<<<<<<<<<|8601130M1709193BGR8601138634<3|IVANOV<";
s = string.Join("", Regex.Split(s, "(ID|I<|P<|V<|VI)").Skip(1));
First it splits the string on the character combinations you wanted (While preserving them in the resulting array - see the regex character group) and then discards the first member of the array (Which will be all the characters before the character combinations.) Then it joins the array together.

Removing all non letter characters from a string in C#

I want to remove all non letter characters from a string. When I say all letters I mean anything that isn't in the alphabet, or an apostrophe. This is the code I have.
public static string RemoveBadChars(string word)
{
char[] chars = new char[word.Length];
for (int i = 0; i < word.Length; i++)
{
char c = word[i];
if ((int)c >= 65 && (int)c <= 90)
{
chars[i] = c;
}
else if ((int)c >= 97 && (int)c <= 122)
{
chars[i] = c;
}
else if ((int)c == 44)
{
chars[i] = c;
}
}
word = new string(chars);
return word;
}
It's close, but doesn't quite work. The problem is this:
[in]: "(the"
[out]: " the"
It gives me a space there instead of the "(". I want to remove the character entirely.
The Char class has a method that could help out. Use Char.IsLetter() to detect valid letters (and an additional check for the apostrophe), then pass the result to the string constructor:
var input = "(the;':";
var result = new string(input.Where(c => Char.IsLetter(c) || c == '\'').ToArray());
Output:
the'
You should use Regular Expression (Regex) instead.
public static string RemoveBadChars(string word)
{
Regex reg = new Regex("[^a-zA-Z']");
return reg.Replace(word, string.Empty);
}
If you don't want to replace spaces:
Regex reg = new Regex("[^a-zA-Z' ]");
A regular expression would be better as this is pretty inefficient, but to answer your question, the problem with your code is that you should use a different variable other than i inside your for loop. So, something like this:
public static string RemoveBadChars(string word)
{
char[] chars = new char[word.Length];
int myindex=0;
for (int i = 0; i < word.Length; i++)
{
char c = word[i];
if ((int)c >= 65 && (int)c <= 90)
{
chars[myindex] = c;
myindex++;
}
else if ((int)c >= 97 && (int)c <= 122)
{
chars[myindex] = c;
myindex++;
}
else if ((int)c == 44)
{
chars[myindex] = c;
myindex++;
}
}
word = new string(chars);
return word;
}
private static Regex badChars = new Regex("[^A-Za-z']");
public static string RemoveBadChars(string word)
{
return badChars.Replace(word, "");
}
This creates a Regular Expression that consists of a character class (enclosed in square brackets) that looks for anything that is not (the leading ^ inside the character class) A-Z, a-z, or '. It then defines a function that replaces anything that matches the expression with an empty string.
This is the working answer, he says he want to remove none-letters chars
public static string RemoveNoneLetterChars(string word)
{
Regex reg = new Regex(#"\W");
return reg.Replace(word, " "); // or return reg.Replace(word, String.Empty);
}
word.Aggregate(new StringBuilder(word.Length), (acc, c) => acc.Append(Char.IsLetter(c) ? c.ToString() : "")).ToString();
Or you can substitute whatever function in place of IsLetter.

Remove a char from a String

I'm looking for a method which can remove a character of a string.
for example I have " 3*X^4" and I want to remove characters '*' & '^' then the string would be like this "3X4" .
Maybe:
string s = Regex.Replace(input, "[*^]", "");
var s = "3*X^4";
var simplified = s.Replace("*", "").Replace("^", "");
// simplified is now "3X4"
try this..it will remove all special character from string
public static string RemoveSpecialCharacters(string str)
{
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
|| c == '.' || c == '_')
{
sb.Append(c);
}
}
return sb.ToString();
}
Another solution would be extracting the unwanted characters manually - this might be slightly more performant than repeatedly calling string.Replace especially for larger numbers of unwanted characters:
StringBuilder result = new StringBuilder(input.Length);
foreach (char ch in input) {
switch (ch) {
case '*':
case '^':
break;
default:
result.Append(ch);
break;
}
}
string s = result.ToString();
Or maybe extracting is the wrong word: Rather, you copy all characters except for those that you don't want.
Try this: String.Replace(Old String, New String)
string S = "3*X^4";
string str = S.Replace("*","").Replace("^","");

Categories