C# string.Replace for removing �? - c#

I am converting hex to a UTF8 string using below line.
var obInstruction = Encoding.UTF8.GetString(ob.Bits);
In the result I got � between every character as shown in the picture below.
what is � ?
So I added replace to the line and changed it to
var obInstruction = Encoding.UTF8.GetString(ob.Bits).Replace("�", ""); but � won't go away.
When I tried to replace other characters Replace work fine but not for �.
What is � and how can I remove it?
In Power query, Text.Clean will remove such strange characters but I am not sure how to do in C#.
*Edit: added a picture for the result with UTF32
Empty boxes with UTF32:

This should do the job:
var input = new byte[5];
var encoding = Encoding.GetEncoding(Encoding.ASCII.EncodingName,
new EncoderReplacementFallback(""),
new DecoderReplacementFallback(""));
var converted = Encoding.Convert(Encoding.UTF8, encoding, input);
var output = encoding.GetString(converted);
This would remove all non-ascii chars with an empty string

Related

Convert string which contains ascii mixed with text [duplicate]

I have a Unicode string from a text file such that. And I want to display the real character.
For example:
\u8ba1\u7b97\u673a\u2022\u7f51\u7edc\u2022\u6280\u672f\u7c7b
When read this string from text file, using StreamReader.ReadToLine(), it escape the \ to '\\' such as "\\u8ba1", which is not wanted.
It will display the Unicode string same as from text. Which I want is to display the real character.
How can change the "\\u8ba1" to "\u8ba1" in the result string.
Or should use another Reader to read the string?
If you have a string like
var input1 = "\u8ba1\u7b97\u673a\u2022\u7f51\u7edc\u2022\u6280\u672f\u7c7b";
// input1 == "计算机•网络•技术类"
you don't need to unescape anything. It's just the string literal that contains the escape sequences, not the string itself.
If you have a string like
var input2 = #"\u8ba1\u7b97\u673a\u2022\u7f51\u7edc\u2022\u6280\u672f\u7c7b";
you can unescape it using the following regex:
var result = Regex.Replace(
input2,
#"\\[Uu]([0-9A-Fa-f]{4})",
m => char.ToString(
(char)ushort.Parse(m.Groups[1].Value, NumberStyles.AllowHexSpecifier)));
// result == "计算机•网络•技术类"
This question came out in the first result when googling, but I thought there should be a simpler way... this is what I ended up using:
using System.Text.RegularExpressions;
//...
var str = "Ingl\\u00e9s";
var converted = Regex.Unescape(str);
Console.WriteLine($"{converted} {str != converted}"); // Inglés True

Decode a Javascript hex literal in C#

I have the following string:
string s = #"a=q\x26T=1";
I want to unescape this to:
"a=q&T=1"
How do I do this is C# other than just replacing the characters? There are various other escaped characters, so I'm not sure what encoding to use.
This works:
var decodedString = Regex.Unescape(#"source=s_q\x26hl=en");
but this works even better:
var regex = new Regex(#"\\x([a-fA-F0-9]{2})");
json = regex.Replace(json, match => char.ConvertFromUtf32(Int32.Parse(match.Groups[1].Value, System.Globalization.NumberStyles.HexNumber)));

How to detect SUB character and remove it from a text file in C#?

I am writing a program to process special text files. Some of these text files end with a SUB character (a substitute character. It may be 0x1A.) How do I detect this character and remove it from the text file using C#?
If it's really 0x1A in the binary data, and if you're reading it as an ASCII or UTF-8 file, it should end up as U+001A when read in .NET. So you may be able to write something like:
string text = File.ReadAllText("file.txt");
text = text.Replace("\u001a", "");
File.WriteAllText("file.txt", text);
Note that the "\u001a" part is a string consisting of a single character: \uxxxx is an escape sequence for a single UTF-16 code point with the given Unicode value expressed in hex.
The easiest answer would probably be a Regex:
public static string RemoveAll(this string input, char toRemove)
{
//produces a pattern like "\x1a+" which will match any occurrence
//of one or more of the character with that hex value
var pattern = #"\x" + ((int)toRemove).ToString("x") + "+";
return Regex.Replace(input, pattern, String.Empty);
}
//usage
var cleanString = dirtyString.RemoveAll((char)0x1a);
Yes, you could just pass in the int, but that requires knowing the integer value of the character. using a char as a parameter allows you to specify a literal or char variable with less muck.
C# has a method to detect control characters (including SUB).
See msdn : https://msdn.microsoft.com/en-us/library/9s05w2k9(v=vs.110).aspx
You could also try something like this it should work
using (FileStream f = File.OpenRead("path\\file")) //Your filename + extension
{
using (StreamReader sr = new StreamReader(f))
{
string text = sr.ReadToEnd();
text = text.Replace("\u001a", string.Empty);
}
}

C# UTF7Encoding for first bracket ' { '

While reading bytes from a file containing UTF7 encoded characters the first bracket '{' is supposed to be encoded to 123 or 007B but it is not happening.All other characters are encoded right but not '{'.The code I am using is given below.
StreamReader _HistoryLocation = new StreamReader("abc.txt");
String _ftpInformation = _HistoryLocation.ReadLine();
UTF7Encoding utf7 = new UTF7Encoding();
Byte[] encodedBytes = utf7.GetBytes(_ftpInformation);
What might be the problem ?
As per RFC2152 that you reference '{' and similar characters may only optionally be encoded as directly - they may instead be encoded.
Notice that UTF7Encoding has an overloaded constructor with an allowOptionals flag that will directly encode the RFC2152 optional characters.

string replace for special character

I have string as like the following \0\0\0\0\0\0\0\0. I would like to replace the \ symbol in between the string
Could anybody tell me how I can replace or remove those \ back slash from that string.
I have used string replace with # symbol ex: string.Replace(#"\","") & also used string.Trim('\0') and string.TrimEnd('\0')
Tell me how I can remove those special character from the symbol.
Vinay
If you tried s.Replace(#"\", "") and this didn't yield the expected results it means that in reality there is no \ character in your actual string. It is what you see in Visual Studio debugger. The actual string maybe contains the 0 byte. To remove it you could:
string s = Encoding.UTF8.GetString(new byte[] { 0, 0, 0, 0 });
s = s.Trim('\0');
Notice that because of the strings being immutable in .NET you need to reassign the string to the result of the Trim method as it doesn't modify the original string.
Maybe String.Replace("\\","")
Try this
var str=#"\0\0\0\0\0\0\0\0";
str.Replace(#"\","");
This works for me without issues:
string s1 = #"\0\0\0\0\0\0\0\0";
string s2 = s1.Replace("\\", "");
Console.WriteLine(s2);
Output:
00000000

Categories