I have a string that contains a czech character.
the string is "0bálka"
This string then gets used to create a file, and its part of the folder and filename.
but when I save the file, in windows explorer I get funny block characters in the name of the file.
Any idea how I can save this file with the valid czech characters preserved?
Two guesses what might be going on here:
The font used in Explorer doesn't cover that specific character. Unlikely, since á is hardly a special character.
The more likely variant would be that your source file encoding in C# is non-Unicode, gets interpreted as some random codepage (probably CP 1252 or 1251) and the resulting character you use is something entirely different than what you wrote in the source file. And then the font issue appears.
You can save your source file with a specific encoding in Visual Studio by clicking "Save as ..." in the File menu, then click the little arrow at the right of the save button and then select "Save with encoding". You should then pick a value such as "Unicode (UTF-8 with signature) - Codepage 65001" from the list.
Your windows should have czech language to be able to show the correct file name, this is not related to c#.
Don't worry, this is to do with how windows is displaying the filename, it may still have the correct name if you change your windows regional settings.
However you need to be careful what encoding you use when you save text to files.
You can learn some general background on what might be happening here:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Related
Well, when using IO.File.ReadAllText(path) or ReadAllText(path, System.Text.Encoding.UTF8) to read a text file which is saved in ANSI encoding, non-latin characters aren't displayed correctly.
So, I decided to use Encoding.Default. It worked just fine, but I see recommendations against using it everywhere (like here and here) because it "will only guarantee that all UTF-7 character sets will be read correctly". Also Microsoft
says:
Gets an encoding for the operating system's current ANSI code page.
However, it seems to me that it can recognize a file with any encoding. I tested that on a file that contains Chinese, Japanese, and Arabic characters -the file is saved in utf8 encoding-, and I was able to display the file correctly.
Code used:
Dim loadedText As String = IO.File.ReadAllText(path, System.Text.Encoding.Default)
MessageBox.Show(loadedText, "utf8")
Output:
So my question in points:
Is there something I'm missing here?
Why is it not recommended to use Encoding.Default when reading a file?
I know that a file with ANSI encoding would be displayed incorrectly if the default system encoding/system locale is changed, which is something I don't care about in my current case. But..
Is there even another way to prevent this from happening?
Side note: Please don't mind me using the c# tag. Although my code is in VB, any answer with C# code is welcomed.
File.ReadAllText actually tries to auto-detect the encoding. If the encoding cannot be determined from a BOM, then the encoding argument is used to decode the file.
This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.
If you used Encoding.UTF8 to write the file, then it would include a BOM. Your Encoding.Default is likely being ignored.
Using Encoding.Default is not recommended because it is operating system's ANSI code page, which is limited to given code page's character set. In other words, text file created in Notepad (ANSI encoding) in Czech Windows will be displayed incorrectly in English Windows. For this reason, everything should be saved and opened in UTF-8 encoding.
Saved in ANSI and opened in Unicode may not work
Saved in Unicode and opened in ANSI will not work
Saved in ANSI and opened in another ANSI may not work
I receive a file from a supplier that I download per SFTP. Our systems are all working on Windows.
When I open the File in Notepad++ the status bar says "UNIX" and "UTF-8"
The special characters aren't displayed correctly.
I tried to convert the file to the different formats Notepad++ allows but no one converted the char 'OSC' to the german letter 'ä'. Is this a known Unix-Windows-thing? My google-foo obviously isn't good enough.
Which kind of conversion should I try to display the file correctly?
How can I achieve the same programmatically in C#?
It is common on windows that a file's encoding doesn't match what the editor or even its xml header say it is. People are sloppy. Maybe it's really UTF-16, or the unstandard windows extended ascii thing which I think is probably cp-1252. (It's not common on *nix since we all usually just use utf-8, no need for others... not saying *nix users are much less sloppy)
To figure out which encoding it is, I would make a copy of the file, then delete the bits that are not a problem (leaving Mägenwil as the entire file) and then save, and use the linux command "file" which will tell what the right encoding is (reliable only for small files... it doesn't read the whole file; maybe notepad++ will do the exact same thing). The reason for deleting the other bits is that it might be a mix of UTF-8 which the editor has used for detection, plus something else.
I would try the iconv command in linux to test. For example:
iconv -f UTF-16 -t UTF-8 -o outfile infile
And any encoding conversion should be possible in C# or any featureful language, as long as you know how it was mutilated so you can reverse it. And if you find that it is part utf-8 and part something else, then remember not to convert the whole file, but only the important parts.
I am attempting to create a UDL file programmatically in C#. In my program, I want to show the user the Data Link properties window but with my own default values for the connection string. I initially thought to do the following:
string[] lines = new string[]
{
"[oledb]",
"; Everything after this line is an OLE DB initstring",
"Provider=SQLOLEDB.1;Persist Security Info=False"
};
File.WriteAllLines("Test.udl", lines);
Process p = Process.Start("Test.udl");
p.WaitForExit();
However, I get this error when trying to open the file:
File cannot be opened. Ensure it is a valid Data Link file.
This is strange because I created an empty file, named it something.udl, opened it, clicked OK, and then opened the contents of the file which was:
[oledb]
; Everything after this line is an OLE DB initstring
Provider=SQLOLEDB.1;Persist Security Info=False
But there was a newline character at the end of the connection string. I used KDiff to compare the this file and the file I created in my program and it said the "Files are equal text but the they are not binary equal" or something to that effect.
I believe it has to do with how the File.WriteAllLines method writes the strings. So I attempted to use different encodings with the method but with no success. Any ideas on where I am going wrong?
I am using this MSDN link as a reference about UDL files. Its also interesting to note that if I open a new text file and past in all of the lines in my lines array, I arrive at the same error.
All you need to do is use the Unicode encoding:
File.WriteAllLines("Test.udl", lines, Encoding.Unicode);
When creating the file in a plain-text editor, use the UTF-16 Little Endian encoding and include a Byte Order Mark (since Microsoft started on the Intel platform they consider that the "default" when they talk about UTF-16).
When using a program, make sure to use that particular encoding as well, programming languages might still default to a legacy codepage or use UTF-8, in which case opening the UDL file will trigger the error shown in the question.
I have been trying to figure the difference for quite sometime now. The issue is with a file that is in ANSI encoding has japanese characters like: ‚È‚‚Æ‚à1‚‚ÌINCREMENTs‚ª•K—v‚Å‚·. It equivalent in shift-jis is 少なくとも1つのINCREMENT行が必要です. which is expected to be in japanese.
I need to display these characters after reading from file(in ANSI) on a webpage. There are some other files in UTF-8 displaying characters right not seeing this. I am finding it difficult to figure out whats the difference and how do I change encoding to do right things here..
I use c# for reading this file and displaying it, I also need to write the string back into file if its modified on web. Any encoding and decoding schemas here?
As far as code pages are concerned, "ANSI" (and Encoding.Default in .NET) basically just means "the non-Unicode codepage used by this system" - exactly what codepage that is, depends on how the system is configured, but on a Western European system, it's likely to be Windows-1252.
For the system where that text comes from, then "ANSI" would appear to mean Shift-JIS - so unless your system has the same code page, you'll need to tell your code to read the text as Shift-JIS.
Assuming you're reading the file with a StreamReader, there are various constructors that take an Encoding, so just grab a Shift-JIS encoding with Encoding.GetEncoding("shift_jis") or Encoding.GetEncoding(932) and use it to construct your StreamReader.
I have an application in C# using .net 3.5. Using this application I save an file and zip it using vjlib library and while opening the file I unzip it. However when I try to give file name as Japanese when saving it in English OS machine the file while opening it the application it not able to understand Japanese character. It due to some Windows Language pack etc.
This problem is most likely induced by the application that created the .zip file. File names are encoded in 8 bit characters inside the file. The ZIP specification says that the name should be encoded either in code page 437 or in utf-8. Code page 437 is the original IBM PC character set, an encoding that doesn't support any Japanese characters. It is not unusual for an app to just use its own 8-bit encoding, not untypically determined by the default system code page.
The library you use is the .NET runtime support library for JScript. Not sure it support specifying a different encoding, it is hard to find docs for it these days since it has been deprecated for so long. Consider, say, dotnetzip. Its ZipFile class has an AlternateEncoding property you can initialize from Encoding.GetEncoding(). You still need to find out what encoding was used, knowing where the file came from is important help to make the right guess. One common code page for Japanese is 932.