Error reading a from file - Encoding issue

Error reading a from file - Encoding issue - c#

Im reading a CSV file that was created from MS Excel. When I open it up in notepad it looks ok, but in Notepad++ I change the Encoding from ANSI to UTF8 and a few non printed characters turn up.
Specifically xFF. -(HEX Value)
In my C# app this character is causing an issue when reading the file so is there a way I can do a String.replace('xFF', ' '); on this?
Update
I found this link on SO, as it turns out it is the answer to my question but not my problem.
Link

Instead of String.Replace, Specify encoding while reading the file.
Example
File.ReadAllText("test.csv",System.Text.UTF8Encoding)

Guess your unicode representation is wrong. Try this
string foo = "foo\xff";
foo.Replace('\xff',' ');

Related

Keep umlauts in text file with ASCII encoding

My code saves necessarily the string e.g. "Günther" with System.IO.File.WriteAllText(filePath, "Günther", Encoding.ASCII); but comes out with G?nther. I did some research but yet can't figure out how to solve this problem. It seems like there's no way because ASCII only is 7bit. But I need the text file in ASCII and with the umlaut "ü".
Is there any way to do this?

As you said: There is no umlaut in ASCII.
If it's not possible to change the File to UTF-8, the only possible way i can think of, is to replace the "ü" in the String with, e.g. "ue".

Can not read turkish characters from text file to string array

I am trying to do some kind of sentence processing in turkish, and I am using text file for database. But I can not read turkish characters from text file, because of that I can not process the data correctly.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt");
textBox1.Text = Tempdatabase[5];
Output:

It's probably an encoding issue. Try using one of the Turkish code page identifiers.
var Tempdatabase =
File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("iso-8859-9"));

You can fiddle around using Encoding as much as you like. This might eventually yield the expected result, but bear in mind that this may not work with other files.
Usually, C# processes strings and files using Unicode by default. So unless you really need something else, you should try this instead:
Open your text file in notepad (or any other program) and save it as an UTF-8 file. Then, you should get the expected results without any modifications in your code. This is because C# reads the file using the encoding you saved it with. This is default behavior, which should be preferred.
When you save your text file as UTF-8, then C# will interpret it as such.
This also applies to .html files inside Visual Studio, if you notice that they are displayed incorrectly (parsed with ASCII)

The file contains the text in a specific Turkish character set, not Unicode. If you don't specify any other behaviour, .net will assume Unicode text when reading text from a text file. You have two possible solutions:
Either change the text file to use Unicode (for example utf8) using an external text editor.
Or specify a specific character set to read for example:
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.Default);
This will use the local character set of the Windows system.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("Windows-1254");
This will use the Turkish character set defined by Microsoft.

C# Reading files and encoding issue

I've searched everywhere for this answer so hopefully it's not a duplicate. I decided I'm just finally going to ask it here.
I have a file named Program1.exe When I drag that file into Notepad or Notepad++ I get all kinds of random symbols and then some readable text. However, when I try to read this file in C#, I either get inaccurate results, or just a big MZ. I've tried all supported encodings in C#. How can notepad programs read a file like this but I simply can't? I try to convert bytes to string and it doesn't work. I try to directly read line by line and it doesn't work. I've even tried binary and it doesn't work.
Thanks for the help! :)

Reading a binary file as text is a peculiar thing to do, but it is possible. Any of the 8-bit encodings will do it just fine. For example, the code below opens and reads an executable and outputs it to the console.
const string fname = #"C:\mystuff\program.exe";
using (var sw = new StreamReader(fname, Encoding.GetEncoding("windows-1252")))
{
var s = sw.ReadToEnd();
s = s.Replace('\x0', ' '); // replace NUL bytes with spaces
Console.WriteLine(s);
}
The result is very similar to what you'll see in Notepad or Notepad++. The "funny symbols" will differ based on how your console is configured, but you get the idea.
By the way, if you examine the string in the debugger, you're going to see something quite different. Those funny symbols are encoded as C# character escapes. For example, nul bytes (value 0) will display as \0 in the debugger, as NUL in Notepad++, and as spaces on the console or in Notepad. Newlines show up as \r in the debugger, etc.
As I said, reading a binary file as text is pretty peculiar. Unless you're just looking to see if there's human-readable data in the file, I can't imagine why you'd want to do this.
Update
I suspect the reason that all you see in the Windows Forms TextBox is "MZ" is that the Windows textbox control (which is what the TextBox ultimately uses), uses the NUL character as a string terminator, so won't display anything after the first NUL. And the first thing after the "MZ" is a NUL (shows as `\0' in the debugger). You'll have to replace the 0's in the string with spaces. I edited the code example above showing how you'd do that.

The exe is a binary file and if you try to read it as a text file you'll get the effect that you are describing. Try using something like a FileStream instead that does not care about the structure of the file but treats it just as a series of bytes.

System.IO.StreamWrite and Spanish Characters

I need to write the following string to a txt a File:
SEGS,AUS1,1,0,0,712205,584,8659094,2,NUÑEZ FELIX ARTURO,584
I when I use:
using (System.IO.StreamWriter sw = new System.IO.StreamWriter(#fileSobrantes, true)) {
sw.WriteLine("SEGS,AUS1,1,0,0,712205,584,8659094,2,NUÑEZ FELIX ARTURO,584");
}
I get this in the file
SEGS,AUS1,1,0,0,712205,584,8659094,2,NUÃ‘EZ FELIX ARTURO,584
I try with the Encoding parameters in ASCII, UNICODE and ALL UTF and does not work.
System.IO.StreamWriter(#fileSobrantes, true,Encoding.UTF32 ))

You can't easily (and accurately) represent what you get in the file without giving a hex dump. What are you trying to use to read the file? My guess is that if you try Encoding.Default that will work for you, but it's hard to say for sure without knowing what you're trying to use to read it.
The other alternative is that your source string is incorrect. If you've really got it as a string literal in your source code, are you sure you've got Visual Studio set up to interpret it correctly?
See my unicode debugging page for suggested techniques.
EDIT: By the way, why are you prefixing fileSobrantes with #? For identifiers you only need to do that if they're keywords. You may be getting confused with verbatim string literals - but this isn't a string literal, it's a variable name.

How to get correctly-encoded HTML from the clipboard?

Has anyone noticed that if you retrieve HTML from the clipboard, it gets the encoding wrong and injects weird characters?
For example, executing a command like this:
string s = (string) Clipboard.GetData(DataFormats.Html)
Results in stuff like:
<FONT size=-2>Â Â <A href="/advanced_search?hl=en">Advanced
Search</A><BR>Â Â Preferences<BR>Â Â <A
href="/language_tools?hl=en">Language
Tools</A></FONT>
Not sure how MarkDown will process this, but there are weird characters in the resulting markup above.
It appears that the bug is with the .NET framework. What do you think is the best way to get correctly-encoded HTML from the clipboard?

In this case it is not so visible as it was in my case. Today I tried to copy data from clipboard but there were a few unicode characters. The data I got were as if I would read a UTF-8 encoded file in Windows-1250 encoding (local encoding in my Windows).
It seems you case is the same. If you save the html data (remember to put non-breakable space = 0xa0 after the Â character, not a standard space) in Windows-1252 (or Windows-1250; both works). Then open this file as a UTF-8 file and you will see what there should be.
For my other project I made a function that fix data with corrupted encoding.
In this case simple conversion should be sufficient:
byte[] data = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(data);
My original function is a little bit more complex and contains tests to ensure that data are not corrupted...
public static bool FixMisencodedUTF8(ref string text, Encoding encoding)
{
if (string.IsNullOrEmpty(text))
return false;
byte[] data = encoding.GetBytes(text);
// there should not be any character outside source encoding
string newStr = encoding.GetString(data);
if (!string.Equals(text, newStr)) // if there is any character "outside"
return false; // leave, the input is in a different encoding
if (IsValidUtf8(data) == 0) // test data to be valid UTF-8 byte sequence
return false; // if not, can not convert to UTF-8
text = Encoding.UTF8.GetString(data);
return true;
}
I know that this is not the best (or correct solution) but I did not found any other way how to fix the input...
EDIT: (July 20, 2017)
It Seems like the Microsoft already found this error and now it works correctly. I'm not sure whether the problem is in some frameworks, but I know for sure, that now the application uses a different framework as in time, when I wrote the answer. (Now it is 4.5; the previous version was 2.0)
(Now all my code fails in parsing the data. There is another problem to determine the correct behaviour for application with fix already aplied and without fix.)

You have to interpret the data as UTF-8. See MS Office hyperlinks change code page?.

DataFormats.Html specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework and lower, and it actually reads as UTF-8 as Windows-1252.
You get allot of wrong encodings, leading funny/bad characters such as
'Å','â€¹','Å’','Å½','Å¡','Å“','Å¾','Å¸','Â','Â¡','Â¢','Â£','Â¤','Â¥','Â¦','Â§','Â¨','Â©'
Full explanation here
Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters
Soln: Create a translation dictionary and search and replace.

I don't know what your original source document is, but be aware that Word and Outlook provide several versions of the clipboard in different encodings. One is usually Windows-1252 and another is UTF-8. Possibly you're grabbing the UTF-8 encoded version by default, when you're expecting the Windows-1252 (Latin-1 + Smart Quotes)? Non-ASCII characters would show up as multiple odd Latin-1 accented characters. Most "Smart Quotes" are not in the Latin-1 set and are often three bytes in UTF-8.
Can you specify which encoding you want the clipboard contents in?

Try this:
System.Windows.Forms.Clipboard.GetText(System.Windows.Forms.TextDataFormat.Html);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Error reading a from file - Encoding issue - c#

Instead of String.Replace, Specify encoding while reading the file. Example File.ReadAllText("test.csv",System.Text.UTF8Encoding)

Guess your unicode representation is wrong. Try this string foo = "foo\xff"; foo.Replace('\xff',' ');

Related

Keep umlauts in text file with ASCII encoding

Can not read turkish characters from text file to string array

C# Reading files and encoding issue

System.IO.StreamWrite and Spanish Characters

How to get correctly-encoded HTML from the clipboard?

Categories

Resources