.txt to String with Encoding

.txt to String with Encoding - c#

The goal is simple. Grab some French text containing special characters from a .txt file and stick it into the variable "content". All is working well except that all instances of the character "à" are being interpreted as "À". Did I pick the wrong encoding (UTF7) or what?
Any help would be much appreciated.
// Using ecoding to ensure special characters work
Encoding enc = Encoding.UTF7;
// Make sure we get all the information about special chars
byte[] fileBytes = System.IO.File.ReadAllBytes(path);
// Convert the new byte[] into a char[] and then into a string.
char[] fileChars = new char[enc.GetCharCount(fileBytes, 0, fileBytes.Length)];
enc.GetChars(fileBytes, 0, fileBytes.Length, fileChars, 0);
string fileString = new string(fileChars);
// Insert the resulting encoded string "fileString" into "content"
content = fileString;

Your code is correct besides the wrong encoding. Find the correct one and plug it in. Nobody uses UTF7 so this is probably not it.
Maybe it is a non-Unicode one. Try Encoding.Default. That one empirically often helps in Germany.
Also, just use File.ReadAllText. It does everything you are doing.

Related

how to fix corrupt japanese character encoding

i have the following string that i know is suppose to be displayed as Japanese text
25“ú‚¨“¾‚ÈƒAƒ‹ƒeƒBƒƒbƒgƒRƒXƒZƒbƒg‹L”O
is there any way to decode and re-encode the text so it displays properly? i already tried using shift-jis but it did not produce a readable string.
string main = "25“ú‚¨“¾‚ÈƒAƒ‹ƒeƒBƒƒbƒgƒRƒXƒZƒbƒg‹L”O.zip";
byte[] mainBytes = System.Text.Encoding.GetEncoding("shift-jis").GetBytes(main);
string jpn = System.Text.Encoding.GetEncoding("shift-jis").GetString(mainBytes);
thanks!

I think that the original is Shift-JIS, but you didn't show how you did try. So here is my try to re-code it::
string s1 = "25“ú‚¨“¾‚ÈƒAƒ‹ƒeƒBƒƒbƒgƒRƒXƒZƒbƒg‹L”O";
byte[] bs = Encoding.GetEncoding(1252).GetBytes(s1);
string s2 = Encoding.GetEncoding(932).GetString(bs);
And s2 is now "25日お得なアルティャbトコスセット記念", that looks a lot more like Japanese.
What I assume it that some byte array that represent text Shift-JIS encoded, what read by using a different encoding, maybe Windows-1252. So first I try to get back the original byte array. Then I use the proper encoding to get the correct text.
A few notes about my code:
1252 is the numeric ID for Windows-1252, the most usually used-by-mistake encoding. But this is just a guess, you can try with other encodings and see if it makes more sense.
932 is de numeric ID for Shift-JIS (you can also use the string name). This is also a guess, but likely right.
Take into account that using a wrong encoding is not generally a reversible procedure so there may be characters that are lost in the translation.

What's going wrong with C#'s string formatter?

I'm getting the following behavior from C#s string encoder:
[Test Case Screenshot][1]
poundFromBytes should be "£", but instead it's "?".
It's as if it's trying to encode the byte array using ASCII instead of UTF-8.
Is this a bug in Windows 7 / C#'s string encoder, or am I missing something?
My real issue here is that I get the same problem when I use File.ReadAllText on an ANSI text file, and I get a related issue in a third party library.
EDIT
I found my problem, I was running under the assumption that UTF-8 was backwards compatible with ANSI, but it's actually only backwards compatible with ASCII. Cheers anyway, at least I'll know to make sure I have no immaterial problems with my test case next time.

The single-byte representation of the pound sign is not valid UTF-8.
Use Encoding.GetBytes instead:
byte[] poundBytes = Encoding.GetEncoding("UTF-8").GetBytes(sPound)

The correct block of code should read something like:
var testChar = '£';
var bytes = Encoding.UTF32.GetBytes(new []{testChar});
string testConvert = Encoding.UTF32.GetString(bytes, 0, bytes.Length);
As others have said, you need to use a UTF encoder to get the bytes for a character. Incidentally characters are UTF-16 format by default (see: http://msdn.microsoft.com/en-us/library/x9h8tsay.aspx)

If you want to use an Encoding's GetString() method, you should probably also use it's corresponding GetBytes() method:
static void Main(string[] args)
{
char cPound = '£';
byte bPound = (byte)cPound; //not really valid
string sPound = "" + cPound;
byte[] poundBytes = Encoding.UTF8.GetBytes(sPound);
string poundFromBytes = Encoding.UTF8.GetString(pountBytes);
Console.WriteLine(poundFromBytes);
Console.ReadKey(True);
}

Check out the documents here. As mentioned in the comments you can't just cast your char to a byte. I'll edit with a more succinct answer but I want to avoid just copy/pasting what msdn has. http://msdn.microsoft.com/en-us/library/ds4kkd55(v=vs.110).aspx
char[] pound = new char[] { '£' };
byte[] poundAsBytes = Encoding.UTF8.GetBytes(pound);
Also, why is everyone using this GetEncoding with a hard coded argument rather than accessing UTF8 directly?

German character ß encoding in Livelink using C#

I have folder name that contains German special character such äÄéöÖüß.The following screenshot display contents of LiveLink server.
I want to extract folder from Livelink server using C#.
valueis obtained from LLserver.
var bytes = new List<byte>(value.Length);
foreach (var c in value)
{
bytes.Add((byte)c);
}
var result = Encoding.UTF8.GetString(bytes.ToArray());
Finally, the result is äÄéöÖü�x .where ß is seen as box character '�x'. All other characters present in folder name are decoded successfully/properly except the ß character.
I am just wondering why the same code works for all other German special characters but not for ß.
Could anybody help to fix this problem in C#?
Thanks in advance.

Go to admin panel of server Livelink/livelink.exe?func=admin.sysvars
and set Character Set: UTF-8
and code section change as follow
byte[] bytes = Encoding.Default.GetBytes(value);
var retValue = Encoding.UTF8.GetString(bytes);
It works fine.

You guessed your encoding to be UTF8 and it obviously is not. You will need to find out what encoding the byte stream really represents and use that instead. We cannot help you with that, you will have to ask the sender of said bytes.

How do I read and write smart quotes (and other silly characters) in C#

I'm writing a program that reads all the text in a file into a string, loops over that string looking at the characters, and then appends the characters back to another string using a Stringbuilder. The issue I'm having is when it's written back out, the special characters such as “ and ” , come out looking like ï¿½ characters instead. I don't need to do a conversion, I just want it written back out the way I read it in:
StringBuilder sb = new StringBuilder();
string text = File.ReadAllText(filePath);
for (int i = 0; i < text.Length; ++i) {
if (text[i] != '{') { // looking for opening curly brace
sb.Append(text[i]);
continue;
}
// Do stuff
}
File.WriteAllText(destinationFile, sb.ToString());
I tried using different Encodings (UTF-8, UTF-16, ASCII), but then it just came out even worse; I started getting question mark symbols and Chinese characters (yes, a bit of a shotgun approach, but I was just experimenting).
I did read this article: http://www.joelonsoftware.com/articles/Unicode.html
...but it didn't really explain why I was seeing what I saw, unless in C#, the reader starts cutting off bits when it hits weird characters like that. Thanks in advance for any help!

TL;DR that is definitely not UTF-8 and you are not even using UTF-8 to read the resulting file. Read as Windows1252, write as Windows1252 (If you are going to use the same viewing method to view the resulting file)
Well let's first just say that there is no way a file made by a regular user will be in UTF-8. Not all programs in windows even support it (excel, notepad..), let alone have it as default encoding (even most developer tools don't default to utf-8, which drives me insane). Since a lot of developers don't understand that such a thing as encoding even exists, then what chances do regular users have of saving their files in an utf-8 hostile environment?
This is where your problems first start. According to documentation, the overload you are using File.ReadAllText(filePath); can only detect UTF-8 or UTF-32.
Indeed, simply reading a file encoded normally in Windows-1252 that contains "a”a" results in a string "a�a", where � is the unicode replacement character (Read the wikipedia section, it describes exactly the situation you are in!) used to replace invalid bytes. When the replacement character is again encoded as UTF-8, and interpreted as Windows-1252, you will see ï¿½ because the bytes for � in UTF-8 are 0xEF, 0xBF, 0xBD which are the bytes for ï¿½ in Windows-1252.
So read it as Windows-1252 and you're half-way there:
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
String result = File.ReadAllText(#"C:\myfile.txt", windows1252);
Console.WriteLine(result); //Correctly prints "a”a" now
Because you saw ï¿½, the tool you are viewing the newly made file with is also using Windows-1252. So if the goal is to have the file show correct characters in that tool, you must encode the output as Windows-1252:
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
File.WriteAllText(#"C:\myFile", sb.toString(), windows1252);

Chances are the text will be UTF8.
File.ReadAllText(filePath, Encoding.UTF8)
coupled with
File.WriteAllText(destinationFile, sb.ToString(), Encoding.UTF8)
should cover off dealing with the Unicode characters. If you do one or the other you're going to get garbage output, both or nothing.

How to convert hebrew (unicode) to Ascii in c#?

I have to create some sort of text file in which there are numbers and Hebrew letters decoded to ASCII.
This is file creation method which triggers on ButtonClick
protected void ToFile(object sender, EventArgs e)
{
filename = Transactions.generateDateYMDHMS();
string path = string.Format("{0}{1}.001", Server.MapPath("~/transactions/"), filename);
StreamWriter sw = new StreamWriter(path, false, Encoding.ASCII);
sw.WriteLine("hello");
sw.WriteLine(Transactions.convertUTF8ASCII("שלום"));
sw.WriteLine("bye");
sw.Close();
}
as you can see, i use Transactions.convertUTF8ASCII() static method to convert from probably Unicode string from .NET to ASCII representation of it. I use it on term Hebrew 'shalom' and get back '????' instead of result i need.
Here is the method.
public static string convertUTF8ASCII(string initialString)
{
byte[] unicodeBytes = Encoding.Unicode.GetBytes(initialString);
byte[] asciiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, unicodeBytes);
return Encoding.ASCII.GetString(asciiBytes);
}
Instead of having initial word decoded to ASCII i get '????' in the file i create even if i run debbuger i get same result.
What i'm doing wrong ?

You can't simply translate arbitrary unicode characters to ASCII. The best it can do is discard the unsupportable characters, hence ????. Obviously the basic 7-bit characters will work, but not much else. I'm curious as to what the expected result is?
If you need this for transfer (rather than representation) you might consider base-64 encoding of the underlying UTF8 bytes.

Do you perhaps mean ANSI, not ASCII?
ASCII doesn't define any Hebrew characters. There are however some ANSI code pages which do such as "windows-1255"
In which case, you may want to consider looking at:
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx
In short, where you have:
Encoding.ASCII
You would replace it with:
Encoding.GetEncoding(1255)

Are you perhaps asking about transliteration (as in "Romanization") instead of encoding conversion, if you really are talking about ASCII?

I just faced the same issue when original xml file was in ASCII Encoding.
As Userx suggested
Encoding.GetEncoding(1255)
XDocument.Parse(System.IO.File.ReadAllText(xmlPath, Encoding.GetEncoding(1255)));
So now my XDocument file can read hebrew even if the xml file was saved as ASCII

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

.txt to String with Encoding - c#

Your code is correct besides the wrong encoding. Find the correct one and plug it in. Nobody uses UTF7 so this is probably not it. Maybe it is a non-Unicode one. Try Encoding.Default. That one empirically often helps in Germany. Also, just use File.ReadAllText. It does everything you are doing.

Related

how to fix corrupt japanese character encoding

What's going wrong with C#'s string formatter?

German character ß encoding in Livelink using C#

How do I read and write smart quotes (and other silly characters) in C#

How to convert hebrew (unicode) to Ascii in c#?

Categories

Resources