Replace in a string all characters outside the set Windows-1252 - c#

Having to maintain old programs written in VB6, I find myself having this issue.
I need to find an efficient way to search a string for all characters OUTSIDE the Windows-1252 set and replace them with "_". I can do this in C#
So far I have done this by creating a string with all 1252 characters, is there a faster way?
I may have to do this for a few million records in a text file
string 1252chars = ""!\""#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶•¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿŸžœ›š™˜—–•""’’ŽŽ‹Š‰vˆ‡†…„ƒ‚€ ""
//Replace all characters not in the string above...

Have you tried to normalize the string? string.Normalize() method is used to remove all characters that are not part of the Windows-1252 character set. https://learn.microsoft.com/de-de/dotnet/api/system.string.normalize?view=net-7.0
string inputString = "Some input string";
string outputString = inputString.Normalize(NormalizationForm.FormD);
Alternatively, you can use a loop to check each character of the string and remove the characters that are not in the Windows-1252 set using the StringBuilder class.
string inputString = "Some input string";
StringBuilder sb = new StringBuilder();
foreach (char c in inputString)
{
if (c <= '\u00FF')
{
sb.Append(c);
}
}
string outputString = sb.ToString();

The Encoding class can achieve this, most likely very efficiently. When converting to and from the encoding, a replacement character can be specified.
using System;
using System.Text;
public class Program
{
public static void Main()
{
// For .NET core only:
// Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
var text = "abc絵de😂fgh";
text = Win1252Safe(text);
Console.WriteLine(text);
}
private static Encoding Win1252R = Encoding.GetEncoding(1252,
new EncoderReplacementFallback("_"),
new DecoderReplacementFallback("_"));
public static string Win1252Safe(string text) {
var bytes = Win1252R.GetBytes(text);
return Win1252R.GetString(bytes);
}
}
Output
abc_de__fgh

Related

How can I extract all non-alphanumeric characters from an input string using Regular Expressions?

Objective: To get all the non-alphanumeric characters even though they are not contiguous.
Setup: I have a textbox on an ASP.Net page that calls a C# code behind method on TextChanged. This event handler runs the textbox input against a Regex pattern.
Problem: I cannot create the proper Regex pattern that extracts all the non-alphanumeric characters.
This is the string input: string titleString = #"%2##$%^&";
These are the C# Regex Patterns that I have tried:
string titlePattern = #"(\b[^A-Za-z0-9]+)"; results with ##$%^& (Note: if I use this input string %2#35%^&, then the above regex pattern will identify the # sign, and then the %^&), but never the leading % sign).
string titlePattern = #"(\A[^A-Za-z0-9]+)"; results with %
string titlePattern = #"(\b\W[^A-Za-z0-9]+)"; results with ##$%^&
Side Note: I am also running this in a MS Visual Studio Console Application with a foreach loop in an effort to get all invalid characters into a collection and I also test the input and pattern using the web site: http://regexstorm.net/tester
Use the replace method with your selection string.
EDIT: After a closer reading I see that you wanted the opposite string. Here's both.
using System;
using System.Text.RegularExpressions;
namespace ConsoleApp
{
class Program
{
static void Main(string[] args)
{
string Source = #"H)*e/.?l\l{}*o ][W!~`##""or^-_=+ld!";
string Trash = #"[^a-zA-Z0-9]";
string InvertedTrash = #"[a-zA-Z0-9]";
Output(Source, Trash);
Console.WriteLine($"{System.Environment.NewLine}Opposite Day!{System.Environment.NewLine}");
Output(Source, InvertedTrash);
Console.ReadKey();
}
static string TakeOutTheTrash(string Source, string Trash)
{
return (new Regex(Trash)).Replace(Source, string.Empty);
}
static void Output(string Source, string Trash)
{
string Sanitized = TakeOutTheTrash(Source, Trash);
Console.WriteLine($"Started with: {Source}");
Console.WriteLine($"Ended with: {Sanitized}");
}
}
}

Can't decode UTF-8 umlaut in C# [duplicate]

How do I decode this string 'Sch\u00f6nen' (#"Sch\u00f6nen") in C#, I've tried HttpUtility but it doesn't give me the results I need, which is "Schönen".
Regex.Unescape did the trick:
System.Text.RegularExpressions.Regex.Unescape(#"Sch\u00f6nen");
Note that you need to be careful when testing your variants or writing unit tests: "Sch\u00f6nen" is already "Schönen". You need # in front of string to treat \u00f6 as part of the string.
If you landed on this question because you see "Sch\u00f6nen" (or similar \uXXXX values in string constant) - it is not encoding. It is a way to represent Unicode characters as escape sequence similar how string represents New Line by \n and Return by \r.
I don't think you have to decode.
string unicodestring = "Sch\u00f6nen";
Console.WriteLine(unicodestring);
Schönen was outputted.
Wrote a code that covnerts unicode strings to actual chars. (But the best answer in this topic works fine and less complex).
string stringWithUnicodeSymbols = #"{""id"": 10440119, ""photo"": 10945418, ""first_name"": ""\u0415\u0432\u0433\u0435\u043d\u0438\u0439""}";
var splitted = Regex.Split(stringWithUnicodeSymbols, #"\\u([a-fA-F\d]{4})");
string outString = "";
foreach (var s in splitted)
{
try
{
if (s.Length == 4)
{
var decoded = ((char) Convert.ToUInt16(s, 16)).ToString();
outString += decoded;
}
else
{
outString += s;
}
}
catch (Exception e)
{
outString += s;
}
}

How to convert a regular string to an ASCII hexadecimal string in C#?

I was recently working on a project where I needed to convert a regular string of numbers into ASCIII hexadecimal and store the hex in a string.
So I had something like
string random_string = "4000124273218347581"
and I wanted to convert it into a hexadecimal string in the form
string hex_string = "34303030313234323733323138333437353831"
This might seem like an oddly specific task but it's one I encountered and, when I tried to find out how to perform it, I couldn't find any answers online.
Anyway, I figured it out and created a class to make things tidier in my code.
In case anyone else needs to convert a regular string into a hexadecimal string I'll be posting an answer in a moment which will contain my solution.
(I'm fairly new to stackoverflow so I hope that doing this is okay)
=========================================
Turns out I can't answer my question myself within the first 8 hours of asking due to not having a high enough reputation.
So I'm sticking my answer here instead:
Okay, so here's my solution:
I created a class called StringToHex in the namespace
public class StringToHex
{
private string localstring;
private char[] char_array;
private StringBuilder outputstring = new StringBuilder();
private int value;
public StringToHex(string text)
{
localstring = text;
}
public string ToAscii()
{
/* Convert text into an array of characters */
char_array = localstring.ToCharArray();
foreach (char letter in char_array)
{
/* Get the integral value of the character */
value = Convert.ToInt32(letter);
/* Convert the decimal value to a hexadecimal value in string form */
string hex = String.Format("{0:X}", value);
/* Append hexadecimal version of the char to the string outputstring*/
outputstring.Append(Convert.ToString(hex));
}
return outputstring.ToString();
}
}
And to use it you need to do something of the form:
/* Convert string to hexadecimal */
StringToHex an_instance_of_stringtohex = new StringToHex(string_to_convert);
string converted_string = an_instance_of_stringtohex.ToAscii();
If it's working properly, the converted string should be twice the length of the original string (due to hex using two bytes to represent each character).
Now, as someone's already pointed out, you can find an article doing something similar here:
http://www.c-sharpcorner.com/UploadFile/Joshy_geo/HexConverter10282006021521AM/HexConverter.aspx
But I didn't find it much help for my specific task and I'd like to think that my solution is more elegant ;)
This works as long as the character codes in the string is not greater than 255 (0xFF):
string hex_string =
String.Concat(random_string.Select(c => ((int)c).ToString("x2")));
Note: This also works for character codes below 16 (0x10), e.g. it will produce the hex codes "0D0A" from the line break characters "\r\n", not "DA".
you need to read the following article -
http://www.c-sharpcorner.com/UploadFile/Joshy_geo/HexConverter10282006021521AM/HexConverter.aspx
the main function that converts data into hex format
public string Data_Hex_Asc(ref string Data)
{
string Data1 = "";
string sData = "";
while (Data.Length > 0)
//first take two hex value using substring.
//then convert Hex value into ascii.
//then convert ascii value into character.
{
Data1 = System.Convert.ToChar(System.Convert.ToUInt32(Data.Substring(0, 2), 16)).ToString();
sData = sData + Data1;
Data = Data.Substring(2, Data.Length - 2);
}
return sData;
}
see if this what you are looking for.
Okay, so here's my solution:
I created a class called StringToHex in the namespace
public class StringToHex
{
private string localstring;
private char[] char_array;
private StringBuilder outputstring = new StringBuilder();
private int value;
public StringToHex(string text)
{
localstring = text;
}
public string ToAscii()
{
/* Convert text into an array of characters */
char_array = localstring.ToCharArray();
foreach (char letter in char_array)
{
/* Get the integral value of the character */
value = Convert.ToInt32(letter);
/* Convert the decimal value to a hexadecimal value in string form */
string hex = String.Format("{0:X}", value);
/* Append hexadecimal version of the char to the string outputstring*/
outputstring.Append(Convert.ToString(hex));
}
return outputstring.ToString();
}
}
And to use it you need to do something of the form:
/* Convert string to hexadecimal */
StringToHex an_instance_of_stringtohex = new StringToHex(string_to_convert);
string converted_string = an_instance_of_stringtohex.ToAscii();
If it's working properly, the converted string should be twice the length of the original string (due to hex using two bytes to represent each character).
Now, as someone's already pointed out, you can find an article doing something similar here:
http://www.c-sharpcorner.com/UploadFile/Joshy_geo/HexConverter10282006021521AM/HexConverter.aspx
But I didn't find it much help for my specific task and I'd like to think that my solution is more elegant ;)

Replace Unicode escape sequences in a string [duplicate]

This question already has answers here:
Unicode characters string
(5 answers)
Closed 6 years ago.
We have one text file which has the following text
"\u5b89\u5fbd\u5b5f\u5143"
When we read the file content in C# .NET it shows like:
"\\u5b89\\u5fbd\\u5b5f\\u5143"
Our decoder method is
public string Decoder(string value)
{
Encoding enc = new UTF8Encoding();
byte[] bytes = enc.GetBytes(value);
return enc.GetString(bytes);
}
When I pass a hard coded value,
string Output=Decoder("\u5b89\u5fbd\u5b5f\u5143");
it works well, but when we use a variable value it is not working.
When we use the string this is what we get from the text file:
value=(text file content)
string Output=Decoder(value);
It returns the wrong output.
How can I fix this?
Use the below code. This unescapes any escaped characters from the input string
Regex.Unescape(value);
You could use a regular expression to parse the file:
private static Regex _regex = new Regex(#"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);
public string Decoder(string value)
{
return _regex.Replace(
value,
m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
);
}
And then:
string data = Decoder(File.ReadAllText("test.txt"));
So your file contains the verbatim string
\u5b89\u5fbd\u5b5f\u5143
in ASCII and not the string represented by those four Unicode codepoints in some given encoding?
As it happens, I just wrote some code in C# that can parse strings in this format for a JSON parser project -- here's a variant that only handles \uXXXX escapes:
private static string ReadSlashedString(TextReader reader) {
var sb = new StringBuilder(32);
bool q = false;
while (true) {
int chrR = reader.Read();
if (chrR == -1) break;
var chr = (char) chrR;
if (!q) {
if (chr == '\\') {
q = true;
continue;
}
sb.Append(chr);
}
else {
switch (chr) {
case 'u':
case 'U':
var hexb = new char[4];
reader.Read(hexb, 0, 4);
chr = (char) Convert.ToInt32(new string(hexb), 16);
sb.Append(chr);
break;
default:
throw new Exception("Invalid backslash escape (\\ + charcode " + (int) chr + ")");
}
q = false;
}
}
return sb.ToString();
}
And you could use it like:
var str = ReadSlashedString(new StringReader("\\u5b89\\u5fbd\\u5b5f\\u5143"));
(or using a StreamReader to read from a file).
Darin Dimitrov's regexp-utilizing answer is probably faster, but I happened to have this code at hand. :)
UTFEncoding (or any other encoding) won't translate escape sequences like \u5b89 into the corresponding character.
The reason why it works when you pass a string constant is that the C# compiler is interpreting the escape sequences and translating them in the corresponding character before calling the decoder (actually even before the program is executed...).
You have to write code that recognizes the escape sequences and convert them into the corresponding characters.
When you are reading "\u5b89\u5fbd\u5b5f\u5143" you get exactly what you read. The debugger escapes your strings before displaying them. The double backslashes in the string are actually single backslashes that have been escaped.
When you pass you hardcoded value, you are not actually passing in what you see on the screen. You are passing in four Unicode characters, since the C# string is unescaped by the compiler.
Darin already posted a way to unescape Unicode characters from the file, so I won't repeat it.
I think this will give you some idea.
string str = "ivandro\u0020";
str = str.Trim();
If you try to print the string, you will notice that the space, which is \u0020, is removed.

Is there a way of making strings file-path safe in c#?

My program will take arbitrary strings from the internet and use them for file names. Is there a simple way to remove the bad characters from these strings or do I need to write a custom function for this?
Ugh, I hate it when people try to guess at which characters are valid. Besides being completely non-portable (always thinking about Mono), both of the earlier comments missed more 25 invalid characters.
foreach (var c in Path.GetInvalidFileNameChars())
{
fileName = fileName.Replace(c, '-');
}
Or in VB:
'Clean just a filename
Dim filename As String = "salmnas dlajhdla kjha;dmas'lkasn"
For Each c In IO.Path.GetInvalidFileNameChars
filename = filename.Replace(c, "")
Next
'See also IO.Path.GetInvalidPathChars
To strip invalid characters:
static readonly char[] invalidFileNameChars = Path.GetInvalidFileNameChars();
// Builds a string out of valid chars
var validFilename = new string(filename.Where(ch => !invalidFileNameChars.Contains(ch)).ToArray());
To replace invalid characters:
static readonly char[] invalidFileNameChars = Path.GetInvalidFileNameChars();
// Builds a string out of valid chars and an _ for invalid ones
var validFilename = new string(filename.Select(ch => invalidFileNameChars.Contains(ch) ? '_' : ch).ToArray());
To replace invalid characters (and avoid potential name conflict like Hell* vs Hell$):
static readonly IList<char> invalidFileNameChars = Path.GetInvalidFileNameChars();
// Builds a string out of valid chars and replaces invalid chars with a unique letter (Moves the Char into the letter range of unicode, starting at "A")
var validFilename = new string(filename.Select(ch => invalidFileNameChars.Contains(ch) ? Convert.ToChar(invalidFileNameChars.IndexOf(ch) + 65) : ch).ToArray());
This question has been asked many times before and, as pointed out many times before, IO.Path.GetInvalidFileNameChars is not adequate.
First, there are many names like PRN and CON that are reserved and not allowed for filenames. There are other names not allowed only at the root folder. Names that end in a period are also not allowed.
Second, there are a variety of length limitations. Read the full list for NTFS here.
Third, you can attach to filesystems that have other limitations. For example, ISO 9660 filenames cannot start with "-" but can contain it.
Fourth, what do you do if two processes "arbitrarily" pick the same name?
In general, using externally-generated names for file names is a bad idea. I suggest generating your own private file names and storing human-readable names internally.
I agree with Grauenwolf and would highly recommend the Path.GetInvalidFileNameChars()
Here's my C# contribution:
string file = #"38?/.\}[+=n a882 a.a*/|n^%$ ad#(-))";
Array.ForEach(Path.GetInvalidFileNameChars(),
c => file = file.Replace(c.ToString(), String.Empty));
p.s. -- this is more cryptic than it should be -- I was trying to be concise.
Here's my version:
static string GetSafeFileName(string name, char replace = '_') {
char[] invalids = Path.GetInvalidFileNameChars();
return new string(name.Select(c => invalids.Contains(c) ? replace : c).ToArray());
}
I'm not sure how the result of GetInvalidFileNameChars is calculated, but the "Get" suggests it's non-trivial, so I cache the results. Further, this only traverses the input string once instead of multiple times, like the solutions above that iterate over the set of invalid chars, replacing them in the source string one at a time. Also, I like the Where-based solutions, but I prefer to replace invalid chars instead of removing them. Finally, my replacement is exactly one character to avoid converting characters to strings as I iterate over the string.
I say all that w/o doing the profiling -- this one just "felt" nice to me. : )
Here's the function that I am using now (thanks jcollum for the C# example):
public static string MakeSafeFilename(string filename, char replaceChar)
{
foreach (char c in System.IO.Path.GetInvalidFileNameChars())
{
filename = filename.Replace(c, replaceChar);
}
return filename;
}
I just put this in a "Helpers" class for convenience.
If you want to quickly strip out all special characters which is sometimes more user readable for file names this works nicely:
string myCrazyName = "q`w^e!r#t#y$u%i^o&p*a(s)d_f-g+h=j{k}l|z:x\"c<v>b?n[m]q\\w;e'r,t.y/u";
string safeName = Regex.Replace(
myCrazyName,
"\W", /*Matches any nonword character. Equivalent to '[^A-Za-z0-9_]'*/
"",
RegexOptions.IgnoreCase);
// safeName == "qwertyuiopasd_fghjklzxcvbnmqwertyu"
Here's what I just added to ClipFlair's (http://github.com/Zoomicon/ClipFlair) StringExtensions static class (Utils.Silverlight project), based on info gathered from the links to related stackoverflow questions posted by Dour High Arch above:
public static string ReplaceInvalidFileNameChars(this string s, string replacement = "")
{
return Regex.Replace(s,
"[" + Regex.Escape(new String(System.IO.Path.GetInvalidPathChars())) + "]",
replacement, //can even use a replacement string of any length
RegexOptions.IgnoreCase);
//not using System.IO.Path.InvalidPathChars (deprecated insecure API)
}
static class Utils
{
public static string MakeFileSystemSafe(this string s)
{
return new string(s.Where(IsFileSystemSafe).ToArray());
}
public static bool IsFileSystemSafe(char c)
{
return !Path.GetInvalidFileNameChars().Contains(c);
}
}
Why not convert the string to a Base64 equivalent like this:
string UnsafeFileName = "salmnas dlajhdla kjha;dmas'lkasn";
string SafeFileName = Convert.ToBase64String(Encoding.UTF8.GetBytes(UnsafeFileName));
If you want to convert it back so you can read it:
UnsafeFileName = Encoding.UTF8.GetString(Convert.FromBase64String(SafeFileName));
I used this to save PNG files with a unique name from a random description.
private void textBoxFileName_KeyPress(object sender, KeyPressEventArgs e)
{
e.Handled = CheckFileNameSafeCharacters(e);
}
/// <summary>
/// This is a good function for making sure that a user who is naming a file uses proper characters
/// </summary>
/// <param name="e"></param>
/// <returns></returns>
internal static bool CheckFileNameSafeCharacters(System.Windows.Forms.KeyPressEventArgs e)
{
if (e.KeyChar.Equals(24) ||
e.KeyChar.Equals(3) ||
e.KeyChar.Equals(22) ||
e.KeyChar.Equals(26) ||
e.KeyChar.Equals(25))//Control-X, C, V, Z and Y
return false;
if (e.KeyChar.Equals('\b'))//backspace
return false;
char[] charArray = Path.GetInvalidFileNameChars();
if (charArray.Contains(e.KeyChar))
return true;//Stop the character from being entered into the control since it is non-numerical
else
return false;
}
From my older projects, I've found this solution, which has been working perfectly over 2 years. I'm replacing illegal chars with "!", and then check for double !!'s, use your own char.
public string GetSafeFilename(string filename)
{
string res = string.Join("!", filename.Split(Path.GetInvalidFileNameChars()));
while (res.IndexOf("!!") >= 0)
res = res.Replace("!!", "!");
return res;
}
I find using this to be quick and easy to understand:
<Extension()>
Public Function MakeSafeFileName(FileName As String) As String
Return FileName.Where(Function(x) Not IO.Path.GetInvalidFileNameChars.Contains(x)).ToArray
End Function
This works because a string is IEnumerable as a char array and there is a string constructor string that takes a char array.
Many anwer suggest to use Path.GetInvalidFileNameChars() which seems like a bad solution to me. I encourage you to use whitelisting instead of blacklisting because hackers will always find a way eventually to bypass it.
Here is an example of code you could use :
string whitelist = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.";
foreach (char c in filename)
{
if (!whitelist.Contains(c))
{
filename = filename.Replace(c, '-');
}
}

Categories