I have a weird situation that I don't understand relating to newlines '\n' that I am sending to a file. The newlines do not seem to be treated the according to the NewLine properties of TextWriter and Environment. This code demonstrates:
String baseDir = Environment.GetEnvironmentVariable("USERPROFILE") + '\\';
String fileName = baseDir + "deleteme5.txt";
FileInfo fi = new FileInfo(fileName);
fi.Delete();
FileStream fw = new FileStream(fileName, FileMode.CreateNew, FileAccess.Write);
StreamWriter sw = new StreamWriter(fw);
Console.WriteLine(Environment.NewLine.Length);
Console.WriteLine(sw.NewLine.Length);
sw.Write("1\uf0f1\n2\ue0e1\n3\ud0d1\n");
sw.Flush();
sw.Close();
When I run this the console output is
2
2
When I look at my file in hex mode I get:
00000000h: 31 EF 83 B1 0A 32 EE 83 A1 0A 33 ED 83 91 0A ;
1.2.3탑.
Clearly, the API says two characters and when you look in the file there is only one character. Now when I look to the description of the Write method in TextWriter it indicates that the Write method does not substitute 0A with the NewLine property. Well, if the Write method doesn't take that into account, what is the use of having not one but two NewLine properties? What are these things for?
Programmers have a very long history of not agreeing how text should be encoded when it is written to a file. ASCII and Unicode helped level the Tower of Babel to some degree. But what characters denote the ending of a line was never agreed upon.
Windows uses the carriage return + line feed control codes. "\r\n" in C# code.
Unix flavors use just a single line feed control code, '\n' in C# code.
Apple historically used just a single carriage return control code, '\r' in C# code.
.NET needed to be compatible with all these incompatible choices. So it added the Environment.NewLine property, it has the default line ending sequence for your operating system. Note how you can run .NET code on Unix and Apple machines with Mono or Silverlight.
The abstract TextWriter class needs to know what sequence to use since it writes text files. So it has a NewLine property, its default is the same as Environment.NewLine. Which you almost always use, but you might want to change it if you need to create a text file that's read by a program on another operating system.
The mistake you made in your program is that you hard-coded the line terminator. You used '\n' in your string. This completely bypasses the .NET properties, you'll only ever see the single line feed control code in the text file. 0x0A is a line feed. Your console output displays "2" since that just displays the string length of the NewLine property. Which is 2 on Windows for "\r\n".
The simplest way to use the .NET property is to use WriteLine() instead of Write():
sw.WriteLine("1\uf0f1");
sw.WriteLine("2\ue0e1");
sw.WriteLine("3\ud0d1");
Which makes your code nicely readable as well, it isn't any slower at runtime. If you want to keep the one-liner then you could use composite formatting:
sw.Write("1\uf0f1{0}2\ue0e1{0}3\ud0d1{0}", Environment.NewLine);
If choose to generate the 'linebreaks' your self by sending \n to the streamwriter there is no way the framework is going to interfer with that. If you want the framework to honor the NewLine property use the WriteLine method of the writer and set the NewLine property of the Writer.
Adapt your code like so:
sw.NewLine = Environment.NewLine; // StreamWriter uses \r\n by default
sw.WriteLine("1\uf0f1")
sw.WriteLine("2\ue0e1");
sw.WriteLine("3\ud0d1");
Or have a Custom StreamWriter that overrides the Write method:
public class MyStreamWriter:StreamWriter
{
public MyStreamWriter(Stream s):base(s)
{
}
public override void Write(string s)
{
base.Write(s.Replace("\n",Environment.NewLine));
}
}
Or if you only have one line that you want to handle:
sw.Write("1\uf0f1\n2\ue0e1\n3\ud0d1\n".Replace("\n", Environment.NewLine));
If you use implicitly, as in calling a WriteLine method or explicitly as in Write(String.Concat("Hello", Environment.NewLine), you get the end of line character(s) defined for your environment. If you don't use it and use say '\n' or even '$', then you are saying no matter what environment I'm in, lines will end like I say. If you want to compare behaviour, write a bit of code and run it under windows and linux (mono)
All newlines escaped as \n in a string are single-character ASCII newlines (0x0A) (not Windows newlines 0D0A) and output to streams in writers as 0x0A unless the programmer takes some explicit step to convert these within the string to the format 0D0A.
The TextWriter.NewLine property is used only by methods like WriteLine, and controls the formatting of the implicit newline that is appended as part of the invocation.
The distinction between Environment.NewLine and TextWriter.NewLine is that Environment.NewLine is readonly, only meant to be queried by programmers. (This is different from Java, for instance, where you can change the "system-wide" newline formatting default with System.setProperty("line.separator", x);
In C# you can modify the format of the implicit newline when writing using TextWriter.NewLine, which is initialized to Environment.NewLine. When using TextReader methods that read lines, there is no TextReader.NewLine property. The implicit newline behavior for readers is to break at any 0x0A, 0x0D, or 0D0A
As pointed out by rene the original problem could be resolved by writing:
sw.Write("1\uf0f1\n2\ue0e1\n3\ud0d1\n".Replace("\n", Environment.NewLine));
Related
Here is a code that writes the string to a file
System.IO.File.WriteAllText("test.txt", "P ");
It's basically the character 'P' followed by a total of 513 space character.
When I open the file in Notepad++, it appears to be fine. However, when I open in windows Notepad, all I see is garbled characters.
If instead of 513 space character, I add 514 or 512, it opens fine in Notepad.
What am I missing?
What you are missing is that Notepad is guessing, and it is not because your length is specifically 513 spaces ... it is because it is an even number of bytes and the file size is >= 100 total bytes. Try 511 or 515 spaces ... or 99 ... you'll see the same misinterpretation of your file contents. With an odd number of bytes, Notepad can assume that your file is not any of the double-byte encodings, because those would all result in 2 bytes per character = even number of total bytes in the file. If you give the file a few more low-order ASCII characters at the beginning (e.g., "PICKLE" + spaces), Notepad does a much better job of understanding that it should treat the content as single-byte chars.
The suggested approach of including Encoding.UTF8 is the easiest fix ... it will write a BOM to the beginning of the file which tells Notepad (and Notepad++) what the format of the data is, so that it doesn't have to resort to this guessing behavior (you can see the difference between your original approach and the BOM approach by opening both in Notepad++, then look in the bottom-right corner of the app. With the BOM, it will tell you the encoding is UTF-8-BOM ... without it, it will just say UTF-8).
I should also say that the contents of your file are not 'wrong', per se... the weird format is purely due to Notepad's "guessing" algorithm. So unless it's a requirement that people use Notepad to read your file with 1 letter and a large, odd number of spaces ... maybe just don't sweat it. If you do change to writing the file with Encoding.UTF8, then you do need to ensure that any other system that reads your file knows how to honor the BOM, because it is a real change to the contents of your file. If you cannot verify that all consumers of your file can/will handle the BOM, then it may be safer to just understand that Notepad happens to make a bad guess for your specific use case, and leave the raw contents exactly how you want them.
You can verify the physical difference in your file with the BOM by doing a binary read and then converting them to a string (you can't "see" the change with ReadAllText, because it honors & strips the BOM):
byte[] contents = System.IO.File.ReadAllBytes("test.txt");
Console.WriteLine(Encoding.ASCII.GetString(contents));
Try passing in a different encoding:
i. System.IO.File.WriteAllText(filename , stringVariable, Encoding.UTF8);
ii. System.IO.File.WriteAllText(filename , stringVariable, Encoding.UTF32);
iii. etc.
Also You could try using another way to build your string, to make it be easier to read, change and count, instead of tapping the space bar 513 times;
i. Use the string constructor (like #Tigran suggested)
var result = "P" + new String(' ', 513);
ii. Use the stringBuilder
var stringBuilder = new StringBuilder();
stringBuilder.Append("P");
for (var i = 1; i <= 513; i++) { stringBuilder.Append(" "); }
iii. Or both
public string AppendSpacesToString(string stringValue, int numberOfSpaces)
{
var stringBuilder = new StringBuilder();
stringBuilder.Append(stringValue);
stringBuilder.Append(new String(' ', numberOfSpaces));
return stringBuilder.ToString();
}
I'm writing a program that reads all the text in a file into a string, loops over that string looking at the characters, and then appends the characters back to another string using a Stringbuilder. The issue I'm having is when it's written back out, the special characters such as “ and ” , come out looking like � characters instead. I don't need to do a conversion, I just want it written back out the way I read it in:
StringBuilder sb = new StringBuilder();
string text = File.ReadAllText(filePath);
for (int i = 0; i < text.Length; ++i) {
if (text[i] != '{') { // looking for opening curly brace
sb.Append(text[i]);
continue;
}
// Do stuff
}
File.WriteAllText(destinationFile, sb.ToString());
I tried using different Encodings (UTF-8, UTF-16, ASCII), but then it just came out even worse; I started getting question mark symbols and Chinese characters (yes, a bit of a shotgun approach, but I was just experimenting).
I did read this article: http://www.joelonsoftware.com/articles/Unicode.html
...but it didn't really explain why I was seeing what I saw, unless in C#, the reader starts cutting off bits when it hits weird characters like that. Thanks in advance for any help!
TL;DR that is definitely not UTF-8 and you are not even using UTF-8 to read the resulting file. Read as Windows1252, write as Windows1252 (If you are going to use the same viewing method to view the resulting file)
Well let's first just say that there is no way a file made by a regular user will be in UTF-8. Not all programs in windows even support it (excel, notepad..), let alone have it as default encoding (even most developer tools don't default to utf-8, which drives me insane). Since a lot of developers don't understand that such a thing as encoding even exists, then what chances do regular users have of saving their files in an utf-8 hostile environment?
This is where your problems first start. According to documentation, the overload you are using File.ReadAllText(filePath); can only detect UTF-8 or UTF-32.
Indeed, simply reading a file encoded normally in Windows-1252 that contains "a”a" results in a string "a�a", where � is the unicode replacement character (Read the wikipedia section, it describes exactly the situation you are in!) used to replace invalid bytes. When the replacement character is again encoded as UTF-8, and interpreted as Windows-1252, you will see � because the bytes for � in UTF-8 are 0xEF, 0xBF, 0xBD which are the bytes for � in Windows-1252.
So read it as Windows-1252 and you're half-way there:
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
String result = File.ReadAllText(#"C:\myfile.txt", windows1252);
Console.WriteLine(result); //Correctly prints "a”a" now
Because you saw �, the tool you are viewing the newly made file with is also using Windows-1252. So if the goal is to have the file show correct characters in that tool, you must encode the output as Windows-1252:
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
File.WriteAllText(#"C:\myFile", sb.toString(), windows1252);
Chances are the text will be UTF8.
File.ReadAllText(filePath, Encoding.UTF8)
coupled with
File.WriteAllText(destinationFile, sb.ToString(), Encoding.UTF8)
should cover off dealing with the Unicode characters. If you do one or the other you're going to get garbage output, both or nothing.
I have a very simple console application that creates a text file. Below is a recap of the code:
StreamWriter writer = File.CreateText("c:\\temp.txt");
foreach (blah...)
{
writer.Write(body.ToString() + "\n");
writer.Flush();
}
writer.Close();
The client is claiming there are carriage returns at the end of each line. Where are these carriage returns coming from?
Update: After opening in VS binary editor and Notepad++, there were no occurrences of 0d 0a. I'm going to go back to the client.
Open the file in the Visual Studio binary editor (File.Open.File, click down-arrow on Open button, choose Open With... and pick Binary Editor), and look for 0D bytes. If none are present, then either:
your client can't tell the the difference between a line feed and a carriage return,
your transmission method is modifying the file en-route. Is there any FTP binary/ascii mismatch going on?
If there are 0D bytes, then they are present in your body variable.
I tested your code.
alt text http://img830.imageshack.us/img830/5443/18414385.png
The code you posted does not have any carriage returns (0D) only new lines (0A). Something else is creating the carriage returns or the client does not know what a carriage return really is.
In your code you put a line feed (\n).
Your customer is talking about a carriage return (\r). Maybe your customer is taking a line feed per a carriage return ?
The "\n" at the end of each write call
EDIT: I know this is a new line, not a carriage return but I bet any money the client is getting confused between the two and it's actually this that is causing the problem
Does the client distinguish between a CR and LF? Is the flush() necessary? Are you overloading the buffer if you don't flush?
Unless you have a massive amount of text you might find more use out of creating a StringBuilder to format the text exactly as you want it with \n, \r, \t or whatever and then pumping that directly into a StreamWriter.
If each body string's first character was '\r', it would explain what you're seeing.
Have you checked whether body is also ending with characters you don't want printed? This is the other potential problem source.
I'm writing a utility that takes in a .resx file and creates a javascript object containing properties for all the name/value pairs in the .resx file. This is all well and good, until one of the values in the .resx is
This dealer accepts electronic orders.
/r/nClick to order {0} from this dealer.
I'm adding the name/value pairs to the js object like this:
streamWriter.Write(string.Format("\n{0} : \"{1}\"", kvp.Key, kvp.Value));
When kvp.Value = "This dealer accepts electronic orders./r/nClick to order {0} from this dealer."
This causes StreamWriter.Write() to actually place a newline in between 'orders.' and 'Click', which naturally screws up my javascript output.
I've tried different things with # and without using string.Format, but I've had no luck. Any suggestions?
Edit: This application is run during build to get some javascript files deployed later, so at no point is it accessible to / run by anyone but the app developers. So while I obviously need a way to escape characters here, XSS as such is not really a concern.
Your problem has already happened by the time you get to this code. String.Format will not "expand" literal \n and \r in the substituted strings ({0} etc) into newline and CR, so it must have happened at some earlier point, possibly while reading the .resx file.
You have two possible solutions. One, as you discovered in the comments to DonaldRay's answer, is to explicitly reverse this replacement, and replace literal newlines with the two characters \n:
kvp.Value.Replace("\r", // <-- replaced by the C# compiler with a literal CR character
"\\r"); // <-- "\\" replaced by the C# compiler with a single "\",
// leaving the two-char string "\r"
You will need to do the same for every character that could appear in your strings. \n and \r are the most common, and then \t (tab); that's probably enough for most dev tools.
string formatted = kvp.Value.Replace("\r", "\\r")
.Replace("\n", "\\n")
.Replace("\t", "\\t");
Alternatively, you could look upstream at the .resx file reading code, and try to find and remove the part that's explicitly expanding these character sequences. This would be a better general solution, if it's possible.
You need to escape the strings, using Microsoft's Anti-XSS Library.
Just escape the backslashes.
kvp.Value = kvp.Value.Replace(#"\", #"\\");
You may need to do this when you are reading from the resx file.
I'm using the textwriter to write data to a text file but if the line exceeds 1024 characters a line break is inserted and this is a problem for me. Any suggestions on how to work round this or increase the character limit?
textWriter.WriteLine(strOutput);
Many thanks
Use Write, not WriteLine
Well you're using TextWriter.WriteLine(string) which appends \r\n after strOutput. As the docs say:
Writes a string followed by a line terminator to the text stream.
(Emphasis mine.) That has nothing to do with 1024 characters though - my guess is that that's how you're reading it in (e.g. with a buffer of 1024 characters).
To avoid the extra line break, just use
textWriter.Write(strOutput);
EDIT: You say in the comment that you need a line break after "the full line has been written out" - but it sounds like strOutput isn't always the same line.
I suspect the easiest way of accomplishing what you want is to separate the "copying" side out from the "line break" side. Use Write for all the text you want to copy, and then just call
textWriter.WriteLine();
when you want a line break. If this doesn't help, I think we're going to need more context - please provide a code sample of exactly what you're doing.
I wrote a sample app that writes and read a 1025 character string. The size never changes. Although if I opened it with notepad.exe (Windows) I can see the extra character in the second line. These seems like a notepad limitation. Here is my sample code
static void Main(string[] args)
{
using (TextWriter streamWriter = new StreamWriter("lineLimit.txt")) {
String s=String.Empty;
for(int i=0;i<1025;i++){
s+= i.ToString().Substring(0,1);
}
streamWriter.Write(s);
streamWriter.Close();
}
using (TextReader streamReader = new StreamReader("lineLimit.txt"))
{
String s = streamReader.ReadToEnd();
streamReader.Close();
Console.Out.Write(s.Length);
}
}
if you need to add the line breaks at the end of your output just append them.
textWriter.Write(strOutput +"\r\n");