Text Parsing Tab Delimited file - c#

I have a method that reads a file. This file has roughly 30000 lines. However when I read it into an array I get a random length for my array. I have seen it as low 6000.
I used both
string[] lines = System.IO.File.ReadAllLines(#"C:\out\qqqqq.txt");
and
System.IO.StreamReader file = new System.IO.StreamReader(#"C:\out\qqqqq.txt");
(and use a counter.)
But I get the same result. I can see in Excel these are too small.

If the line endings in the file are inconsistent (sometimes \n, sometimes \r\n and sometimes \r) then you could try reading the entire file as a string and splitting it yourself:
string file = System.IO.File.ReadAllText(#"C:\out\qqqqq.txt");
var lines = file.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
For large files, this is inefficient, because it needs to read the entire file - using StreamReader you would be able to read the file line-by-line as you're processing it. If performance is an issue, then you could write simple tool that first corrects the line endings.

Related

Write text to file in C# with 513 space characters

Here is a code that writes the string to a file
System.IO.File.WriteAllText("test.txt", "P ");
It's basically the character 'P' followed by a total of 513 space character.
When I open the file in Notepad++, it appears to be fine. However, when I open in windows Notepad, all I see is garbled characters.
If instead of 513 space character, I add 514 or 512, it opens fine in Notepad.
What am I missing?
What you are missing is that Notepad is guessing, and it is not because your length is specifically 513 spaces ... it is because it is an even number of bytes and the file size is >= 100 total bytes. Try 511 or 515 spaces ... or 99 ... you'll see the same misinterpretation of your file contents. With an odd number of bytes, Notepad can assume that your file is not any of the double-byte encodings, because those would all result in 2 bytes per character = even number of total bytes in the file. If you give the file a few more low-order ASCII characters at the beginning (e.g., "PICKLE" + spaces), Notepad does a much better job of understanding that it should treat the content as single-byte chars.
The suggested approach of including Encoding.UTF8 is the easiest fix ... it will write a BOM to the beginning of the file which tells Notepad (and Notepad++) what the format of the data is, so that it doesn't have to resort to this guessing behavior (you can see the difference between your original approach and the BOM approach by opening both in Notepad++, then look in the bottom-right corner of the app. With the BOM, it will tell you the encoding is UTF-8-BOM ... without it, it will just say UTF-8).
I should also say that the contents of your file are not 'wrong', per se... the weird format is purely due to Notepad's "guessing" algorithm. So unless it's a requirement that people use Notepad to read your file with 1 letter and a large, odd number of spaces ... maybe just don't sweat it. If you do change to writing the file with Encoding.UTF8, then you do need to ensure that any other system that reads your file knows how to honor the BOM, because it is a real change to the contents of your file. If you cannot verify that all consumers of your file can/will handle the BOM, then it may be safer to just understand that Notepad happens to make a bad guess for your specific use case, and leave the raw contents exactly how you want them.
You can verify the physical difference in your file with the BOM by doing a binary read and then converting them to a string (you can't "see" the change with ReadAllText, because it honors & strips the BOM):
byte[] contents = System.IO.File.ReadAllBytes("test.txt");
Console.WriteLine(Encoding.ASCII.GetString(contents));
Try passing in a different encoding:
i. System.IO.File.WriteAllText(filename , stringVariable, Encoding.UTF8);
ii. System.IO.File.WriteAllText(filename , stringVariable, Encoding.UTF32);
iii. etc.
Also You could try using another way to build your string, to make it be easier to read, change and count, instead of tapping the space bar 513 times;
i. Use the string constructor (like #Tigran suggested)
var result = "P" + new String(' ', 513);
ii. Use the stringBuilder
var stringBuilder = new StringBuilder();
stringBuilder.Append("P");
for (var i = 1; i <= 513; i++) { stringBuilder.Append(" "); }
iii. Or both
public string AppendSpacesToString(string stringValue, int numberOfSpaces)
{
var stringBuilder = new StringBuilder();
stringBuilder.Append(stringValue);
stringBuilder.Append(new String(' ', numberOfSpaces));
return stringBuilder.ToString();
}

CSV file double-spacing lines

I am having trouble with OpenOffice Calc opening a CSV file that I create using StreamWriter C#. When it opens it has empty lines between every line that should be there(double-spaced). There seems to be some kind of doubling of the carriage returns. When I open it in Notepad it reads correctly. When I changed the program to write integers instead of strings the problem went away. It seems to be adding a return on the end of each string and then the formating adds another return that I'm not seeing.
Output looks like this...
1...
2...
3...
Output should look like this...
1...
2...
3...
Here is the ForEach loop I use to write the List to file...
using (StreamWriter sw = new StreamWriter(#"c:\andy\Arduino StreamWriter.csv", false, Encoding.UTF8))
{
foreach (string element in SerialPortString)
{
sw.WriteLine(element);
}
}
There is only one field of data per line, so there are no delimiters, just new lines. I tried formatting so that it would write with quotes around each field hoping that would eliminate confusion for the CSV format, but I wasn't able to figure that out either.
Any help would be appreciated.
Thanks.
Change
sw.WriteLine(element);
to
sw.WriteLine(element.Trim());
or maybe
sw.WriteLine(element.TrimEnd());
Trim the element first. That will remove any LineFeeds or other whitespace characters around the 'edges' of the characters. Then the StreamWriter's CRLFs will be the only newlines present.

Output streamreader differs from textline

I have to read lines of a .log text-file.
When exaiming the output during the run it shows the text with '/' or '/0' and such in between the chars.
I have tried several reading methods (also read byte[] ) but this didn't solve the situation. I can't figure why it would do such.
This is the last form of reading I have tried.
string[] fileLinesRaw = System.IO.File.ReadAllLines(filePath);
To be readed text-line:
Output reader
Interleaved null characters like that indicate that the file is encoded with a scheme that uses 2+ bytes to encode characters, such as UTF-16.
ReadAllLines uses UTF-8 encoding by default, instead:
ReadAllLines(filePath, Encoding.Unicode);

Reading a text file from Unity3d

I have a error in a script which reads from a text file outside the program.
The error is
FormatException: Input string was not in the correct format
Its obvious whats wrong, but I just don't understand why it cant read it properly.
My code:
using (FileStream fs = new FileStream(#"D:\Program Files (x86)\Steam\SteamApps\common\blabla...to my file.txt))
{
byte[] b = new byte[1024];
UTF8Encoding temp = new UTF8Encoding(true);
while (fs.Read(b, 0, b.Length) > 0)
{
//Debug.Log(temp.GetString(b));
var converToInt = int.Parse(temp.GetString(b));
externalAmount = converToInt;
}
fs.Close();
}
The text file has 4 lines of values.
Each line represent a object in a game. All I am trying to do is read these values. Unfortunately I get the error which I can't explain.
So how can I read new lines without getting the error?
the text file looks like this
12
5
6
0
4 lines no more, all values on a seperate line.
There's no closing " on your new Filestream(" ...); but I'm gonna assume that's an issue when copy pasting your code to Stackoverflow.
The error you're getting is likely because you're trying to parse spaces to int, which wont work; the input string (" " in this case) was not in the correct format (int).
Split your lines on spaces (Split.(' ')) and parse every item in the created array.
A couple problems:
Problem 1
fs.Read(b, 0, b.Length) may read one byte, or all of them. The normal way to read a text file like this is to use StreamReader instead of FileStream. The Streamreader has a convenience constructor for opening a file that works the same way, but it can read line by line and is much more convenient. Here's the documentation and an excellent example: https://msdn.microsoft.com/en-us/library/f2ke0fzy(v=vs.110).aspx
If you insist on reading directly from a filestream, you will either need to
Parse your string outside the loop so you can be certain you've read the whole file into your byte buffer (b), or
Parse the new content byte by byte until you find a particular separator (for example a space or a newline) and then parse everything in your buffer and reset the buffer.
Problem 2
Most likely your buffer already contains everything in the file. Your file is so small that the filestream object is probably reading the whole thing in a single shot, even though that's not gauranteed.
Since your string buffer contains ALL the characters in the file you are effectively trying to parse "12\n5\n6\n0" as an integer and the parser is choking on the newline characters. Since newlines are non-numeric, it has no idea how to interpret them.

Why does StreamReader.ReadLine() return a value for a one line file with no newline?

I want to append two text files together.
I have one file with a carriage return line feed at the end. Observe file A which is 28 bytes.
this is a line in the file\n
then I have another file which is the same thing without the new line. Observe file B which is 26 bytes.
this is a line in the file
I want to append the same file to itself (file A to A, and file B to B) and compare the byte counts.
However, when using StreamReader.ReadLine() on file A, I get a value returned but MSDN says:
A line is defined as a sequence of characters followed by a line feed ("\n"), a carriage return ("\r") or a carriage return immediately followed by a line feed ("\r\n"). The string that is returned does not contain the terminating carriage return or line feed. The returned value is null if the end of the input stream is reached.
However, there is no crlf in the file.
How can I safely append these files without adding an extra line break at the end? For example, StreamWriter.WriteLine() will put an extra line break on file A when I don't want it to. What would be an ideal approach?
You'll only get null if you call ReadLine at the end of the stream. Otherwise, you'll get all data up until either a CRLF or the end of the stream.
If you're trying to do a byte-for-byte duplication (and comparison), you're better off reading either characters (using StreamReader/StreamWriter as you're using now) or bytes (using just using the Stream class) using the normal Read and Write functions rather than ReadLine and WriteLine.
You could also just read the entire contents of the file using ReadToEnd then write it by calling Write (not WriteLine), though this isn't practical if the file is large.
string data;
using(StreamReader reader = new StreamReader(path))
{
data = reader.ReadToEnd();
}
using(StreamWriter writer = new StreamWriter(path, true))
{
writer.Write(data);
}
StreamReader and StreamWriter (which derive from TextReader and TextWriter) are not suitable for situations requiring an exact form of binary data. They are high level abstractions of a file which consists of bytes, not text or lines. In fact, not only could you wind up with different number of newlines, but depending on the environment you might write out a line terminator other than the expected CR/LF.
You should instead just copy from one stream to another. This is quite easy actually.
var bytes = File.ReadAllBytes(pathIn);
var stream = File.Open(pathOut, FileMode.Append);
stream.Write(bytes, 0, bytes.Length);
stream.Close();
If the size of the file is potentially large, you should open both the input and output file at the same time and use a fixed-sized buffer to copy a block at a time.
using (var streamIn = File.Open(pathIn, FileMode.Read))
using (var streamOut = File.Open(pathOut, FileMode.Append)) {
var bytes = new byte[BLOCK_SIZE];
int count;
while ((count=streamIn.Read(bytes, 0, bytes.Length)) > 0) {
streamOut.Write(bytes, 0, count);
}
}
Also worth noting is that the above code could be replaced by Stream.CopyTo which is new in .NET 4.
You can use StreamWriter.Write instead of WriteLine to avoid the extra crlf.
As to the ReadLine docs, I beleive the problem is a poorly worded explanation. You certainly wouldn't want the last bytes of a file discarded just because there is no formal line ending flag.
Well it really depends on the reasons for your implementation (Why are you reading it by line and writing it back line by line?) You could just use StreamWriter.Write(string) and output all the text you have stored, the WriteLine() methods are named as such because they append a newline.
TextWriter.WriteLine Method (String)
Writes a string followed by a line terminator to the text stream.

Categories