Keep umlauts in text file with ASCII encoding - c#

My code saves necessarily the string e.g. "Günther" with System.IO.File.WriteAllText(filePath, "Günther", Encoding.ASCII); but comes out with G?nther. I did some research but yet can't figure out how to solve this problem. It seems like there's no way because ASCII only is 7bit. But I need the text file in ASCII and with the umlaut "ü".
Is there any way to do this?

As you said: There is no umlaut in ASCII.
If it's not possible to change the File to UTF-8, the only possible way i can think of, is to replace the "ü" in the String with, e.g. "ue".

Related

How do I extract UTF-8 strings out of a JSON file using LitJSON, as JsonData does not seem to convert?

I've tried many methods to extract some strings out of a JSON file using LitJson in Unity.
I've encoding converts all over, tried getting byte arrays and sending them around and nothing seems to work.
I went to the very start of where I create the JsonData object and tried to run the following test:
public JsonData CreateJSONDataObject()
{
Debug.Assert(pathName != null, "No JSON Data path name set. Please set before commencing read.");
string jsonString = File.ReadAllText(Application.dataPath + pathName, System.Text.Encoding.UTF8);
JsonData jsonDataObject = JsonMapper.ToObject(jsonString);
Debug.Log("Test compatibility: ë | " + jsonDataObject["Roots"][2]["name"]);
return jsonDataObject;
}
I made sure my jsonString is using UTF-8, however the output shows this:
Test compatibility: ë | W�den
I've tried many other methods, but as this is making sure to encode right when creating a JsonData object I can't think of what I am doing wrong as I just don't know enough about JSON.
Thank you in advance.
This type of problem occurs when a text file is written with one encoding and read using a different one. I was able to reproduce your problem with the following program, which removes the JSON serialization from the equation entirely:
string file = #"c:\temp\test.txt";
string text = "Wöden";
File.WriteAllText(file, text, Encoding.Default));
string text2 = File.ReadAllText(file, Encoding.UTF8);
Debug.WriteLine(text2);
Since you are reading with UTF-8 and it is not working, the real question is, what encoding was used to write the file originally? You should be using the same encoding to read it back. I suspect that the file was originally created using either Windows-1252 or iso-8859-1 instead of UTF-8. Try using one of those when you read the file, e.g.:
string jsonString = File.ReadAllText(Application.dataPath + pathName,
Encoding.GetEncoding("Windows-1252"));
You said in the comments that your JSON file was not created programmatically, but was "written by hand", meaning you used Notepad or some other text editor to make the file. If that is so, then that explains how you got into this situation. When you save the file, you should have the option to choose an encoding. For Notepad at least, the default encoding is "ANSI", which most likely maps to Windows-1252 (Western European), but depends on your locale. If you are in the Baltic region, for example, it would be Windows-1257 (Baltic). In any case, "ANSI" is not UTF-8. If you want to save the file in UTF-8 encoding, you have to specifically choose that option. Whatever option you use to save the file, that is the encoding you need to use to read it the next time, whether it is with a text editor or with code. Using the wrong encoding to read the file is what causes the corruption.
To change the encoding of a file, you first have to read it in using the same encoding that it was saved in originally, and then you can write it back out using a different encoding. You can do that with your text editor, simply by re-saving the file with a different encoding, or you can do that programmatically:
string text = File.ReadAllText(file, originalEncoding);
File.WriteAllText(file, text, newEncoding);
The key is knowing which encoding was used originally, and therein lies the rub. For legacy encodings (such as Windows-12xx) there is no way to tell because there is no marker in the file which identifies it. Unicode encodings (e.g. UTF-8, UTF-16), on the other hand, do write out a marker at the beginning of the file, called a BOM, or byte-order mark, which can be detected programmatically. That, coupled with the fact that Unicode encodings can represent all characters, is why they are much preferred over legacy encodings.
For more information, I highly recommend reading What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

How to decode a utf string in c#

I have been trying to decode the following string:
Crédit
in c# using the following code:
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
string msg = iso.GetString(utf8.GetBytes(#"Crédit"));
which is yielding:
Crédit
I looked online http://jeppesn.dk/utf-8.html and this is in correct utf 8 and should yield:
Crédit
Can someone please point out where i am going wrong?
Thanks
It should be the other way around, and Windows-1252, not ISO-8859-1. Depending on context, people usually mean Windows-1252 when they say Latin-1 or ISO-8859-1, but actually using ISO-8859-1 will fail when there are characters like € because it was a mislabeling in the first place. Even browsers use Windows-1252 when ISO-8859-1 is specified as encoding.
Encoding w1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
string msg = utf8.GetString(w1252.GetBytes(#"Crédit"));
You're trying to do something that doesn't make sense, basically. You should almost never1 be interpreting the output of one encoding as the input to another encoding. It's like saying, "Suppose I save this image as a gif... then load that file using a jpeg loader... what does it look like?"
I suspect that if you use:
// Just an example: don't actually do this.
string msg = utf8.GetString(iso.GetBytes(#"Crédit"));
... it will do what you want, but you shouldn't be doing this at all.
Now, what is your real input (in what form) and what are you trying to achieve?
1 If you're doing so, it's usually because someone else has already done the wrong thing, or there's a configuration problem somewhere. If you find yourself doing this, you should think very carefully about whether you should really be doing it, or whether you're just working around a different problem which should be tackled differently.

Error reading a from file - Encoding issue

Im reading a CSV file that was created from MS Excel. When I open it up in notepad it looks ok, but in Notepad++ I change the Encoding from ANSI to UTF8 and a few non printed characters turn up.
Specifically xFF. -(HEX Value)
In my C# app this character is causing an issue when reading the file so is there a way I can do a String.replace('xFF', ' '); on this?
Update
I found this link on SO, as it turns out it is the answer to my question but not my problem.
Link
Instead of String.Replace, Specify encoding while reading the file.
Example
File.ReadAllText("test.csv",System.Text.UTF8Encoding)
Guess your unicode representation is wrong. Try this
string foo = "foo\xff";
foo.Replace('\xff',' ');

Default C# String encoding

I am having some issues with the default string encoding in C#. I need to read strings from certain files/packets. However, these strings include characters from the 128-256 range (extended ascii), and all of these characters show up as question marks , instead of the proper character. For example, when reading a string ,it could come up as "S?meStr?n?" if the string contained the extended ascii characters.
Now, is there any way to change the default encoding for my application? I know in java you could define the default character set from command line.
There's no one single "extended ASCII" encoding. There are lots of different 8-bit encodings which are compatible with ASCII for the bottom 128 values.
You need to find out what encoding your files actually use, and specific that when reading the data with StreamReader (or whatever else you're using). For example, you may want encoding Windows-1252:
Encoding encoding = Encoding.GetEncoding(1252);
.NET strings are always sequences of UTF-16 code points. You can't change that, and you shouldn't try. (That's true in Java as well, and you really shouldn't use the platform default encoding when calling getBytes() etc unless that's what you really, really mean.)
An Encoding can be specified in at least one overload of functions for reading text - for example, ReadAllText(string, Encoding).
So if you no a file's encoded using Windows-1252, then you can specify it like so:
string contents = File.ReadAllText(someFilePath, Encoding.GetEncoding(1252));
Of course, doing this requires knowing ahead of time which code page is being used.

Question about Encodings: How can I output from HtmlAgilityPack to a StringWriter and keep the encoding?

I am reading html in with HtmlAgilityPack, editing it, then outputting it to a StreamWriter. The HtmlAgilityPack Encoding is Latin1, and the StreamWriter is UnicdeEncoding.
I am losing some characters in the conversion, and I do not want to be.
I don't seem to be able to change the Encoding of a StreamWriter. What is the best around this problem?
If the web page is really Latin-1 (ISO-8859-1), it can't have any curly quotes in it; Latin-1 has no mappings for those characters. If you can see curly quotes when you open the page in your browser, they could be in the form of HTML entities (“ and ” or “ and ”). But I suspect the page's encoding is really windows-1252 despite what the headers and embedded declarations say.
windows-1252 is identical to Latin-1 except that it replaces the control characters in the \x80..\x9F range (decimal 128..159) with more useful (or at least prettier) printing characters. If HtmlAgilityPack is taking the page at its word and decoding it as ISO-8859-1, it will convert \x93 to the control character \u0093, which will look like garbage if you can get it to display at all. The browser, meanwhile, will convert it to \u201C, the Unicode code point for the Left Double Quotation Mark.
I'm not familiar with HtmlAgilityPack and I can't find any docs for it, but I would try to force it to use windows-1252. For example, you could create a windows-1252 (or "ANSI") StreamReader and have HAP use that.
At a guess; write to a Stream (not a string). If you write to a string (inc. StringWriter/StringBuilder, you are implicitly using .NET's UTF-16 string.
If you just want to tweak the reported encoding (but use a string), then look at Jon's answer here.
It is not clear which end you're losing characters at. In any case, a mere encoding mismatch isn't by itself an issue - you're still supposed to get the correct characters. If a Unicode StreamWriter writes out garbled characters, it means that it had received garbage on input in the first place. Which probably means that HtmlAgilityPack got encoding for your page wrong. If it has an option of setting the encoding manually, you might want to do just that.
It may also be that you have an HTML page which has a wrong encoding declaration in it. E.g. it might be an UTF-8 file which contains <meta> element declaring it as Latin-1. Where do you get the text from? Do you download it straight from the Web, or do you have it in a text file - and if it's the latter, how do you create that file? If you did it manually via Notepad, or in the code via StreamWriter, then you might have an UTF-8 file.

Categories