I recently added a .gitattributes file to a c# repository with the following settings:
* text=auto
*.cs text diff=csharp
I renormalized the repository following these instructions from github and it seemed to work OK.
The problem I have is when I checkout some files (not all of them) I see lots of weird characters mixed in with the actual code. It seems to happen when git runs the files through the lf->crlf conversion specified by the .gitattributes file above.
According to Notepad++ the files that get messed up are using UCS-2 Little Endian or UCS-2 Big Endian encoding. The files that seem to work OK are either ANSI or UTF-8 encoded.
For reference my git version is 1.8.0.msysgit.0 and my OS is Windows 8.
Any ideas how I can fix this? Would changing the encoding of the files be enough?
This happens if you use an encoding where every character is two bytes.
CRLF would then be encoded as \0\r\0\n.
Git thinks it's a single-byte encoding, so it turns that into \0\r\0\r\n.
This makes the next line one byte off, causing every other line be full of Chinese. (because the \0 becomes the low-order byte rather than the high-order byte)
You can convert files to UTF8 using this LINQPad script:
const string path = #"C:\...";
foreach (var file in Directory.EnumerateFiles(path, "*", SearchOption.AllDirectories))
{
if (!new [] { ".html", ".js"}.Contains(Path.GetExtension(file)))
continue;
File.WriteAllText(file, String.Join("\r\n", File.ReadAllLines(file)), new UTF8Encoding(encoderShouldEmitUTF8Identifier: true));
file.Dump();
}
This will not fix broken files; you can fix the files by replacing \r\n with \n in a hex editor. I don't have a LINQPad script for that. (since there's no simple Replace() method for byte[]s)
To fix this, either convert the encoding of the files (UTF-8 should be ok) or disable the line break auto conversion (git config core.autocrlf false and .gitattributes stuff you have).
Related
I've got a file that looks OK in Notepad (and Notepad++) but when I try to read it with a C# program, the dash shows up as a replacement character (�) instead. After some trial and error, I can reproduce the error as follows:
File.WriteAllBytes("C:\\Misc\\CharTest\\wtf.txt", new byte[] { 150 });
var readFile = File.ReadAllText("C:\\Misc\\CharTest\\wtf.txt");
Console.WriteLine(readFile);
Now, if you go and look in the wtf.txt file using Notepad, you'll see a dash... but I don't get it. I know that's not a "real" Unicode value so that's probably the root of the issue, but I don't get why it looks fine in Notepad and not when I read in the file. And how do I get the file to read it as a dash?
As an aside, a VB6 program I'm trying to rewrite in C# also reads it as a dash.
The File.ReadAllText(string) overload defaults to UTF8 encoding, in which a standalone byte with value 150 is invalid.
Specify the actual encoding of the file, for example:
var encoding = Encoding.GetEncoding(1252);
string content = File.ReadAllText(fileName, encoding);
I used the Windows-1252 encoding, which has a dash at codepoint 150.
Edit: Notepad displays the file correctly because for non-Unicode files the Windows-1252 codepage is the default for western regional settings. So likely you can use also Encoding.Default to get the correct result but keep in mind that Encoding.Default can return different code pages with different regional settings.
You are writing bytes in a textfile. And the you are reading those bytes and interpret them as chars.
Now, when you write bytes, you don't care about encoding, while you have to, in order to read those very same bytes as char.
Notepad++ seems to interpret the byte as Unicode char and therefore prints the _.
Now File.ReadAllText reads the bytes in the specified encoding, which you did not specify and there will be set to one of these and seems to be UTF-8, where 150 is not a valid entry.
So I read the Spolsky Article twice, this question too and tried a lot. Now I'm here.
I created a tarball of a directory structure on a Linux Machine with locale ISO-8859-1 and untarred it on Windows with 7zip. As a result, the filenames are scrambled up when I view them in Windows Explorer (and in my C# program, too): Where I expect to see a German umlaut ü it's a ³ - No wonder, because the filenames are written to the tar file using the ISO-8859-1 codepage and Windows obviously does not know about this.
I want to fix this by renaming the files to their correct names. So I think I have to tell the program "read the filename, think of it as ISO-8859-1 and return every character as UTF-16 character."
My code to find the correct filename:
void Main()
{
string[] files = Directory.GetFiles(#"C:\test", #"*", SearchOption.AllDirectories);
var e1 = Encoding.GetEncoding("ISO-8859-1");
var e2 = Encoding.GetEncoding("UTF-16");
foreach (var f in files)
{
Console.WriteLine($"Source: {f}");
var source = e1.GetBytes(f);
var dest = Encoding.Convert(e1, e2, source);
Console.WriteLine($"Result: {e2.GetString(dest)}");
}
}
Result - nothing happend:
Source: C:\test\Brief-mrl³.odt
Result: C:\test\Brief-mrl³.odt
expected Result:
Source: C:\test\Brief-mrl³.odt
Result: C:\test\Brief-mrlü.odt
When I exchange e1 and e2 I get weird results. My brain hurts. What am I not getting?
Edit: I know that the mistake has been made earlier, but now I have wrong filenames on the Windows machine that I need to correct. However, it might not be solvable via the Encoding-Class. I found this blog post and the author states
It turns out, this isn't a problem with the encoding at all, but the same character address meaning different things to different character sets.
In conclusion, he wrote a method to replace the characters between 130 and 173 with specific, different characters. This does not look straightforward to me, but is it possible that this is the only way? Can anyone comment on this, please?
After some more reading I got the solution myself. This excellent article helped. The point is: Once a wrong encoding was used, you can only guess (or have to know) what went wrong exactly. If you know, you can revert the whole thing in code.
void Main()
{
// We get the source string e.g. reading files from a directory. We see a "³" when
// we expect a German umlaut "ü". The reason can be a poorly configured smb share
// on a Linux server or other problems.
string source = "M³nch";
// We are in a .NET program, so the source string (here in the
// program) is Unicode in UTF-16 encoding. I.e., the codepoints
// M, ³, n, c and h are encoded in UTF-16.
byte[] bytesFromSource = Encoding.Unicode.GetBytes(source); //
// The source encoding is UTF-16, hence we get two bytes per character.
// We accidently worked with the OEM850 Codepage, we now have look up the bytes of
// the codepoints on the OEM850 codepage: We convert our bytesFromSource to the wrong Codepage
byte[] bytesInWrongCodepage = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(850), bytesFromSource);
// Here's the trick: Although converting to OEM850, we now assume that the bytes are Codepage ISO-8859-1.
// We convert the bytes from ISO-8859-1 to Unicode.
byte[] bytesFromCorrectCodepage = Encoding.Convert(Encoding.GetEncoding("ISO-8859-1"), Encoding.Unicode, bytesInWrongCodepage);
// And finally we get the right character.
string result = Encoding.Unicode.GetString(bytesFromCorrectCodepage);
Console.WriteLine(result); // Münch
}
CAVEAT: Do not run this method over its results. This is likely to produce non-printable characters or other mayhem.
I am trying to extract a zip with multiple files. Some files have the "§" character in their names ("abc(§7)abc.txt").
When unpacking,
System.IO.Compression.ZipFile.ExtractToDirectory(string sourceArchiveFileName, string destinationDirectoryName);
however, the '§' character is translated into the 'õ' (Latin Small Letter O with Tilde) character.
I have already tried to change the encoding. But there is only ASCII or UTF-8 (default)
System.IO.Compression.ZipFile.ExtractToDirectory(sourceArchiveFileName, destinationDirectoryName, Encoding entryNameEncoding);
Is someone able to show me the mistake?
Windows don't behave nicely with unicode file names inside zip.
Using the Enconding 850 solves the problem.
Encoding.GetEncoding(850);
It looks like it got fixed in .Net framework 4.8 but I can't test it right now.
Sources:
https://devblogs.microsoft.com/oldnewthing/20180515-00/?p=98755
http://archives.miloush.net/michkap/archive/2012/01/04/10252916.html
I am trying to create a flat file for a legacy system and they mandates that the data to be presented in TextEncoding of MS DOS .txt file (Text Document - MS-DOS Format CP_OEM). I am a bit confused between files generated by using UTF8Encoding class in C# (.net4.0 framework) and I think it produce a file in default txt file (Encoding: CP_ACP).
I think Encoding names CP_ACP , Winodows and ANSI refers to same thing and Windows default is ANSI and it will omit any unicode character information.
If I use UTF8Encoding class in C# library to create a text file(as below), is it going to be in the MS DOS txt file format?
byte[] title = new UTF8Encoding(true).GetBytes("New Text File");
As per the answer supplied it is evident that UTF8 is NOT equivalent to MSDOS txt format and should use Encoding.GetEncoding(850) method to get the encoding library.
I read the following posts to check on my information but nothing conclusive yet.
https://blogs.msdn.microsoft.com/oldnewthing/20120220-00?p=8273
https://blog.mh-nexus.de/2015/01/character-encoding-confusion
https://blogs.msdn.microsoft.com/oldnewthing/20090115-00?p=19483
Finally the conclusion is to go with Encoding.GetEncoding(850) when creating a byte array to be converted back to the actual file(note: i am using byte array as i can leverage existing middle wares).
You can use the File.ReadXY(String, Encoding) and File.WriteXY(String, String[], Encoding) methods, where XY is either AllLines, Lines or AllText working with string[], IEnumerable<string> and string respectively.
MS-DOS uses different code pages. Probably the code page 850 "Western European / Latin-1" or code page 437 "OEM-US / OEM / PC-8 / DOS Latin US" (as #HansPassant suggests) will be okay. If you are not sure, which code page you need, create example files containing letters like ä, ö, ü, é, è, ê, ç, à or greek letters with the legacy system and see whether they work. If you don't use such letters or other special characters, then the code page is not very critical.
File.WriteAllText(path, "Hello World", Encoding.GetEncoding(850));
The character codes from 0 to 127 (7-bit) are the same for all MS-DOS code pages, for ANSI and UTF-8. UTF files are sometimes introduced with a BOM (byte order mark).
MS-DOS knows only 8-bit characters. The codes 128 to 255 differ for the different national code pages.
See: File Class, Encoding Class and Wikipedia: Code Page.
I've searched everywhere for this answer so hopefully it's not a duplicate. I decided I'm just finally going to ask it here.
I have a file named Program1.exe When I drag that file into Notepad or Notepad++ I get all kinds of random symbols and then some readable text. However, when I try to read this file in C#, I either get inaccurate results, or just a big MZ. I've tried all supported encodings in C#. How can notepad programs read a file like this but I simply can't? I try to convert bytes to string and it doesn't work. I try to directly read line by line and it doesn't work. I've even tried binary and it doesn't work.
Thanks for the help! :)
Reading a binary file as text is a peculiar thing to do, but it is possible. Any of the 8-bit encodings will do it just fine. For example, the code below opens and reads an executable and outputs it to the console.
const string fname = #"C:\mystuff\program.exe";
using (var sw = new StreamReader(fname, Encoding.GetEncoding("windows-1252")))
{
var s = sw.ReadToEnd();
s = s.Replace('\x0', ' '); // replace NUL bytes with spaces
Console.WriteLine(s);
}
The result is very similar to what you'll see in Notepad or Notepad++. The "funny symbols" will differ based on how your console is configured, but you get the idea.
By the way, if you examine the string in the debugger, you're going to see something quite different. Those funny symbols are encoded as C# character escapes. For example, nul bytes (value 0) will display as \0 in the debugger, as NUL in Notepad++, and as spaces on the console or in Notepad. Newlines show up as \r in the debugger, etc.
As I said, reading a binary file as text is pretty peculiar. Unless you're just looking to see if there's human-readable data in the file, I can't imagine why you'd want to do this.
Update
I suspect the reason that all you see in the Windows Forms TextBox is "MZ" is that the Windows textbox control (which is what the TextBox ultimately uses), uses the NUL character as a string terminator, so won't display anything after the first NUL. And the first thing after the "MZ" is a NUL (shows as `\0' in the debugger). You'll have to replace the 0's in the string with spaces. I edited the code example above showing how you'd do that.
The exe is a binary file and if you try to read it as a text file you'll get the effect that you are describing. Try using something like a FileStream instead that does not care about the structure of the file but treats it just as a series of bytes.