C# - Read bytes from file from a specific string - c#

I'm trying to parse a crg-file in C#. The file is mixed with plain text and binary data. The first section of the file contains plain text while the rest of the file is binary (lots of floats), here's an example:
$
$ROAD_CRG
reference_line_start_u = 100
reference_line_end_u = 120
$
$KD_DEFINITION
#:KRBI
U:reference line u,m,730.000,0.010
D:reference line phi,rad
D:long section 1,m
D:long section 2,m
D:long section 3,m
...
$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
�#z����RA����\�l
...
I know I can read bytes starting at a specific offset but how do I find out which byte to start from? The last row before the binary section will always contain at least four dollar signs "$$$$". Here's what I've got so far:
using var fs = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read);
var startByte = ??; // How to find out where to start?
using (BinaryReader reader = new BinaryReader(fs))
{
reader.BaseStream.Seek(startByte, SeekOrigin.Begin);
var f = reader.ReadSingle();
Debug.WriteLine(f);
}

When you have a mixture of text data and binary data, you need to treat everything as binary. This means you should be using raw Stream access, or something similar, and using binary APIs to look through the text data (often looking for cr/lf/crlf at bytes as sentinels, although it sounds like in your case you could just look for the $$$$ using binary APIs, then decode the entire block before, and scan forwards). When you think you have an entire line, then you can use Encoding to parse each line - the most convenient API being encoding.GetString(). When you've finished looking through the text data as binary, then you can continue parsing the binary data, again using the binary API. I would usually recommend against BinaryReader here too, because frankly it doesn't gain you much over more direct API. The other problem you might want to think about is CPU endianness, but assuming that isn't a problem: BitConverter.ToSingle() may be your friend.
If the data is modest in size, you may find it easiest to use byte[] for the data; either via File.ReadAllBytes, or by renting an oversized byte[] from the array-pool, and loading it from a FileStream. The Stream API is awkward for this kind of scenario, because once you've looked at data: it has gone - so you need to maintain your own back-buffers. The pipelines API is ideal for this, when dealing with large data, but is an advanced topic.

UPDATE: This code may not work as expected. Please review the valuable information in the comments.
using (var fs = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read))
{
using (StreamReader sr = new StreamReader(fs, Encoding.ASCII, true, 1, true))
{
var line = sr.ReadLine();
while (!string.IsNullOrWhiteSpace(line) && !line.Contains("$$$$"))
{
line = sr.ReadLine();
}
}
using (BinaryReader reader = new BinaryReader(fs))
{
// TODO: Start reading the binary data
}
}

Solution
I know this is far from the most optimized solution but in my case it did the trick and since the plain text section of the file was known to be fairly small this didn't cause any noticable performance issues. Here's the code:
using var fileStream = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read);
using var reader = new BinaryReader(fileStream);
var newLine = '\n';
var markerString = "$$$$";
var currentString = "";
var foundMarker = false;
var foundNewLine = false;
while (!foundNewLine)
{
var c = reader.ReadChar();
if (!foundMarker)
{
currentString += c;
if (currentString.Length > markerString.Length)
currentString = currentString.Substring(1);
if (currentString == markerString)
foundMarker = true;
}
else
{
if (c == newLine)
foundNewLine = true;
}
}
if (foundNewLine)
{
// Read binary
}
Note: If you're dealing with larger or more complex files you should probably take a look at Mark Gravell's answer and the comment sections.

Related

Why is my filestream not writing correctly

I am trying to modify a file-stream inline as the file has the potential to be very large and I don't want to load it into memory. The piece of information I'm editing will always be the same length so in theory I can just swap the content out using a stream reader but it doesn't seem to be writing to the correct place
I have created a section of code that using a stream reader will read line by line until it finds a regex match and will then attempt to swap the bytes out with the edited line. The code is as follows:
private void UpdateFile(string newValue, string path, string pattern)
{
var regex = new Regex(pattern, RegexOptions.IgnoreCase);
int index = 0;
string line = "";
using (var fileStream = File.OpenRead(path))
using (var streamReader = new StreamReader(fileStream, Encoding.Default, true, 128))
{
while ((line = streamReader.ReadLine()) != null)
{
if (regex.Match(line).Success)
{
break;
}
index += Encoding.Default.GetBytes(line).Length;
}
}
if (line != null)
{
using (Stream stream = File.Open(path, FileMode.Open))
{
stream.Position = index + 1;
var newLine = regex.Replace(line, newValue);
var oldBytes = Encoding.Default.GetBytes(line);
var newBytes = Encoding.Default.GetBytes("\n" + newLine);
stream.Write(newBytes, 0, newBytes.Length);
}
}
}
The code almost works as expected, it inserts the updated line but it always does it a little early, just how early varies slightly based on the file I'm editing. I expect it is something to do with the way I am managing the stream position but I don't know the correct way to approach this.
Unfortunately the exact files I'm working on are under NDA.
The structure is as follows though:
A file will have an unkown amount of data followed by a line of a known format, for example:
Description: ABCDEF
I know the portion that follows "Description: " will always be 6 characters, so I do a replace on the line to replace with, for example, UVWXYZ.
The problem is that for example if a file read as
'...
UNIMPORTANT UNKNOWN DATA
DESCRIPTION: ABCDEF
MORE DATA
...'
it will come out as something like
'...
UNIMPORTANT UNKNOWN DDESCRIPTION: UVWXYZDEF
MORE DATA
...'
I think the problem here is that you are not considering the line feed ("\n") for each line you are getting and therefore your index is incorrectly setting the position of your stream. Try the following code:
private void UpdateFile(string newValue, string path, string pattern)
{
var regex = new Regex(pattern, RegexOptions.IgnoreCase);
int index = 0;
string line = "";
using (var fileStream = File.OpenRead(path))
using (var streamReader = new StreamReader(fileStream, Encoding.Default, true, 128))
{
while ((line = streamReader.ReadLine()) != null)
{
if (regex.Match(line).Success)
{
break;
}
index += Encoding.ASCII.GetBytes(line + "\n").Length;
}
}
if (line != null)
{
using (Stream stream = File.Open(path, FileMode.Open))
{
stream.Position = index;
var newBytes = Encoding.Default.GetBytes(regex.Replace(line + "\n", newValue));
stream.Write(newBytes, 0, newBytes.Length);
}
}
}
In your example, you are "off" by 4 Characters. Not quite the common "off by one error", but close. But maybe a different pattern would help the most?
Programms nowadays rarely work "on the file" like that. There is just too much to go wrong, all the way to a power loss mid-process. Instead they:
create a empty new file at the same location. Often temporary named and hidden.
write the output to the new file
Once you are done and eveyrthing is good - all the caches are flushed and everything is on the disk (done by Stream.Close() or Dispose()) - just replace the old file with the new file using the OS move operation.
The advantage is that it is impossible to have data-loss. Even if the computer looses power mid-operation, at tops the temporary file is messed up. You still got the orignal file and yoou can just delte the temporary file and restart the work from scratch if you need too. Indeed recovery only makes sense in rare cases (Word Processors)
The replacement of old file by new file is done with a move order. If they are on the same partition, that is literally just a rename operation in the Filesytem. And as modern FS are basically designed like a topline, robust relational Databases there is no danger in this.
You can find that pattern in everything from your Word Porcessor of choice, to backup programms, the download manager of Firefox (as you might be overriding a file that was there befroe) and even zipping programms. Everytime you got a long writing phase and want to minimize the danger, it is to go to pattern.
And as you can work entirely in memory without having to deal with moving around the read/write head, it will get around your issue too.
Edit: I made some source code for it from memory/documentation. Might contain syntax errors
string sourcepath; //containts the source file path, set by other code
string temppath; //containts teh path of the tempfile. Should be in the same folder, and thus same partiion
//Open both Streams, can use a single using for this
//The supression of any Buffering on the output should be optional and will be detrimental to performance
using(var sourceStream = File.OpenRead(sourcepath),
outStream = File.Create(temppath, 0, FileOptions.WriteThrough )){
string line = "";
//itterte over the input
while((line = streamReader.ReadLine()) != null){
//do processing on line here
outStream.Write(line);
}
}
//replace the files. Pretty sure it will just overwrite without asking
File.Move(temppath, sourcepath);

Reading a complete text file using for loop

I am trying to read a text file using a for loop that runs for a 100 times.
StreamReader reader = new StreamReader("client.txt");
for (int i=0;i<=100;i++)
{
reader.readline();
}
Now this works fine if the text file has 100 lines but not if lets say 700. So I want the loop to run for 100 times but read "1%" of the file in each run.How would i do that?
If file size is not too large you can:
string[] lines = File.ReadAllLines("client.txt");
or
string text = File.ReadAllText("client.txt");
Reading 1% at a time is a bit tricky, I'd go with the approach of reading line by line:
var filename = "client.txt";
var info = new FileInfo(filename);
var text = new StringBuilder();
using (var stream = new FileStream(filename, FileMode.Open))
using (var reader = new StreamReader(stream))
{
while (!reader.EndOfStream)
{
text.AppendLine(reader.ReadLine());
var progress = Convert.ToDouble(stream.Position) * 100 / info.Length;
Console.WriteLine(progress);
}
}
var result = text.ToString();
But please notice, the progress will not be very accurate because StreamReader.ReadLine (and equivalently ReadLineAsync) will often read more than just a single line - it basically reads into a buffer and then interprets that buffer. That's much more efficient than reading a single byte at a time, but it does mean that the stream will have advanced further than it strictly speaking needs to.

Why does FileStream.Position increment in multiples of 1024?

I have a text file that I want to read line by line and record the position in the text file as I go. After reading any line of the file the program can exit, and I need to resume reading the file at the next line when it resumes.
Here is some sample code:
using (FileStream fileStream = new FileStream("Sample.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
fileStream.Seek(GetLastPositionInFile(), SeekOrigin.Begin);
using (StreamReader streamReader = new StreamReader(fileStream))
{
while (!streamReader.EndOfStream)
{
string line = streamReader.ReadLine();
DoSomethingInteresting(line);
SaveLastPositionInFile(fileStream.Position);
if (CheckSomeCondition())
{
break;
}
}
}
}
When I run this code, the value of fileStream.Position does not change after reading each line, it only advances after reading a couple of lines. When it does change, it increases in multiples of 1024. Now I assume that there is some buffering going on under the covers, but how can I record the exact position in the file?
It's not FileStream that's responsible - it's StreamReader. It's reading 1K at a time for efficiency.
Keeping track of the effective position of the stream as far as the StreamReader is concerned is tricky... particularly as ReadLine will discard the line ending, so you can't accurately reconstruct the original data (it could have ended with "\n" or "\r\n"). It would be nice if StreamReader exposed something to make this easier (I'm pretty sure it could do so without too much difficulty) but I don't think there's anything in the current API to help you :(
By the way, I would suggest that instead of using EndOfStream, you keep reading until ReadLine returns null. It just feels simpler to me:
string line;
while ((line = reader.ReadLine()) != null)
{
// Process the line
}
I would agree with Stefan M., it is probably the buffering which is causing the Position to be incorrect. If it is just the number of characters that you have read that you want to track than I suggest you do it yourself, as in:
using(FileStream fileStream = new FileStream("Sample.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
fileStream.Seek(GetLastPositionInFile(), SeekOrigin.Begin);
/**Int32 position = 0;**/
using(StreamReader streamReader = new StreamReader(fileStream))
{
while(!streamReader.EndOfStream)
{
string line = streamReader.ReadLine();
/**position += line.Length;**/
DoSomethingInteresting(line);
/**SaveLastPositionInFile(position);**/
if(CheckSomeCondition())
{
break;
}
}
}
}
Provide that your file is not too big, why not read the whole thing in big chuncks and then manipulate the string - probably faster than the stop and go i/o.
For example,
//load entire file
StreamReader srFile = new StreamReader(strFileName);
StringBuilder sbFileContents = new StringBuilder();
char[] acBuffer = new char[32768];
while (srFile.ReadBlock(acBuffer, 0, acBuffer.Length)
> 0)
{
sbFileContents.Append(acBuffer);
acBuffer = new char[32768];
}
srFile.Close();

.NET: Reading/writing binary string

EDIT: Don't answer; I've found the solution on my own.
I have some code that does this:
using (var stream = new FileStream(args[1], FileMode.Create))
{
using (var writer = new BinaryWriter(stream))
{
writer.Write(ip.Iso3166CountryCode);
...
}
}
Iso3166CountryCode is a string with two characters ("US").
When I try to read "US" from the file:
// line is a byte[] from the file with the first 1024 bytes
UnicodeEncoding.Default.GetString(line.Take(2).ToArray());
I don't get "US" back, I get some odd ASCII characters back. How do I read the two country-code characters from this binary file?
EDIT: NEVER MIND. I changed writer.Write(ip.Iso3166CountryCode) to writer.Write(UnicodeEncoding.Default.GetBytes(ip.Iso3166CountryCode)) and it works.
Try changing writer.Write(ip.Iso3166CountryCode) to writer.Write(UnicodeEncoding.Default.GetBytes(ip.Iso3166CountryCode)), that should work! :)

Bytes consumed by StreamReader

Is there a way to know how many bytes of a stream have been used by StreamReader?
I have a project where we need to read a file that has a text header followed by the start of the binary data. My initial attempt to read this file was something like this:
private int _dataOffset;
void ReadHeader(string path)
{
using (FileStream stream = File.OpenRead(path))
{
StreamReader textReader = new StreamReader(stream);
do
{
string line = textReader.ReadLine();
handleHeaderLine(line);
} while(line != "DATA") // Yes, they used "DATA" to mark the end of the header
_dataOffset = stream.Position;
}
}
private byte[] ReadDataFrame(string path, int frameNum)
{
using (FileStream stream = File.OpenRead(path))
{
stream.Seek(_dataOffset + frameNum * cbFrame, SeekOrigin.Begin);
byte[] data = new byte[cbFrame];
stream.Read(data, 0, cbFrame);
return data;
}
return null;
}
The problem is that when I set _dataOffset to stream.Position, I get the position that the StreamReader has read to, not the end of the header. As soon as I thought about it this made sense, but I still need to be able to know where the end of the header is and I'm not sure if there's a way to do it and still take advantage of StreamReader.
You can find out how many bytes the StreamReader has actually returned (as opposed to read from the stream) in a number of ways, none of them too straightforward I'm afraid.
Get the result of textReader.CurrentEncoding.GetByteCount(totalLengthOfAllTextRead) and then seek to this position in the stream.
Use some reflection hackery to retrieve the value of the private variable of the StreamReader object that corresponds to the current byte position within the internal buffer (different from that with the stream - usually behind, but no more than equal to of course). Judging by .NET Reflector, the this variable seems to be named bytePos.
Don't bother using a StreamReader at all but instead implement your custom ReadLine function built on top of the Stream or BinaryReader even (BinaryReader is guaranteed never to read further ahead than what you request). This custom function must read from the stream char by char, so you'd actually have to use the low-level Decoder object (unless the encoding is ASCII/ANSI, in which case things are a bit simpler due to single-byte encoding).
Option 1 is going to be the least efficient I would imagine (since you're effectively re-encoding text you just decoded), and option 3 the hardest to implement, though perhaps the most elegant. I'd probably recommend against using the ugly reflection hack (option 2), even though it's looks tempting, being the most direct solution and only taking a couple of lines. (To be quite honest, the StreamReader class really ought to expose this variable via a public property, but alas it does not.) So in the end, it's up to you, but either method 1 or 3 should do the job nicely enough...
Hope that helps.
So the data is utf8 (the default encoding for StreamReader). This is a multibyte encoding, so IndexOf would be inadvisable. You could:
Encoding.UTF8.GetByteCount(string)
on your data so far, adding 1 or 2 bytes for the missing line ending.
If you're needing to count bytes, I'd go with the BinaryReader. You can take the results and cast them about as needed, but I find its idea of its current position to be more reliable (in that since it reads in binary, its immune to character-set problems).
So your last line contains 'DATA' + an unknown amount of data bytes. You could extract the position by using IndexOf() with your last read line. Then readjust the stream.Position.
But I am not sure if you should use ReadLine() at all in this case. Maybe it would be better to read byte by byte until you reach the 'DATA' mark.
The line breaks are easily identifiable without needing to decode the stream first (except for some encodings rarely used for text files like EBCDIC, UTF-16, UTF-32), so you can just read each line as bytes and then decode the entire line:
using (FileStream stream = File.OpenRead(path)) {
List<byte> buffer = new List<byte>();
bool hasCr = false;
bool done = false;
while (!done) {
int b = stream.ReadByte();
if (b == -1) throw new IOException("End of file reached in header.");
if (b == 13) {
hasCr = true;
} else if (b == 10 && hasCr) {
string line = Encoding.UTF8.GetString(buffer.ToArray(), 0, buffer.Count);
if (line == "DATA") {
done = true;
} else {
HandleHeaderLine(line);
}
buffer.Clear();
hasCr = false;
} else {
if (hasCr) buffer.Add(13);
hasCr = false;
buffer.Add((byte)b);
}
}
_dataOffset = stream.Position;
}
Instead of closing the stream and open it again, you could of course just keep on reading the data.

Categories