Very strange behavior
If I create a bufferedstream on top of a file and then seek to an offset I get a block of bytes back
If I move the debugger back to the seek and re-seek I get an extra two characters
I ve triple checked this
Can there possibly be a bug with this class ?
If I reseek back to position I expect to get the same - The file has not changed - I open it in read only mode and I seek based on Origin
Reproduction:
bufferedStream.Seek(100,0, 100)
bufferedStream.Reade(buffer, 0, 100)
is different to what you get from here
bufferedStream.Seek(100,0, 100)
bufferedStream.Reade(buffer, 0, 100)
First off, it is hard to know if they are the same without checking the return value fro Read - is it possible they are just choosing different chunks? (perfectly valid; it is your job to ensure you loop over Read until you have enough data, or EOF).
However, I wonder if a BOM is involved here - especially if you are sitting a text-reader on top of this. Simply, a reader expects the BOM at the start; so it may well hide it the first time through. But if you rewind the stream while using the same reader/decoder, it won't be expecting a BOM, so will try to report it as character data (or throw an error, depending on the configuraion).
Related
I am wondering if it's a good practice to use the EndOfStreamException to detect the end of the BinaryReader stream> I don't want to use BaseStream.Length property or PeekChar as proposed here
C# checking for binary reader end of file
because I have to load it into memory (maybe because it's from Zip file) and enable some flags. Instead, this is what I'm doing:
using (ZipArchive zipArchive = ZipFile.OpenRead(Filename))
using (BinaryReader fStream = new BinaryReader(zipArchive.Entries[0].Open()))
{
while(true){
try
{
fStream.ReadInt32();
}
catch (EndOfStreamException ex)
{
Log.Debug("End of Binary Stream");
break;
}
}
}
That approach is fine. If you know you have a seekable stream you can compare its length to the number of bytes read. Note that FileStream.Length does not load the whole stream into memory.
But that approach is appropriate for arbitrary streams.
And don't worry about the cost of using exceptions in this case, as streams imply IO, and IO is orders of magnitude slower that exception handling.
I would argue that 'best practice' is to have the number of values known, for example by prefixing the stream with the number of values. This should allow you to write it like
var length = fStream.ReadInt32();
for(var i = 0; i < length-1; i++){
fStream.ReadInt32(); // Skip all values except last
}
return fStream.ReadInt32(); // Last value
First of all this would reduce the need of exception handling, if you reach the endOfStream before the last item you know the stream was incorrectly saved, and have a chance of handling it, instead of just returning the last available value. I also find it helpful to have as few exceptions as possible, so you can run your debugger with "break with thrown", and have some sort of confidence that thrown exceptions indicate actual problems. It can also allow you to save your values as part of some other data.
If you cannot change the input format you can still get the uncompressed length of the entry from ZipArchiveEntry.Length. Just divide by sizeof(int) to get the number of values.
In most cases I would also argue for using a serialization library to save data. This tend to make it much easier to change the format of the data in the future.
check your program or init value.
I am trying to read the data stored in an ICMT tag on a WAV file generated by a noise monitoring device.
The RIFF parsing code all seems to work fine, except for the fact that the ICMT tag seems to have data after the declared size. As luck would have it, it's the timestamp, which is the one absolutely critical piece of info for my application.
SYN is hex 16, which gives a size of 22, which is up to and including the NUL before the timestamp. The monitor documentation is no help; it says that the tag includes the time, but their example also has the same issue.
It is the last tag in the enclosing list, and the size of the list does include it - does that mean it doesn't need a chunk ID? I'm struggling to find decent RIFF docs, but I can't find anything that suggests that's the case; also I can't see how it'd be possible to determine that it was the last chunk and so know to read it with no chunk ID.
Alternatively, the ICMT comment chunk is the last thing in the file - is that a special case? Can I just get the time by reading everything from the end of the declared length ICMT to the end of the file and assume that will always work?
The current parser behaviour is that it's being read after the channel / dB information as a chunk ID + size, and then complaining that there was not enough data left in the file to fulfil the request.
No, it would still need its own ID. No, being the last thing in the file is no special case either. What you're showing here is malformed.
Your current parser errors correctly, as the next thing to be expected again is a 4 byte ID followed by 4 bytes for the length. The potential ID _10: is unknown and would be skipped, but interpreting 51:4 as DWORD for the length of course asks for trouble.
The device is the culprit. Do you have other INFO fields which use NULL bytes? If not then I assume the device is naive enough to consider a NULL the end of a string, despite producing himself strings with multiple NULLs.
Since I encountered countless files not sticking to standards I can only say your parser is too naive as well: it knows how long the encapsulating list is and thus could easily detect field lengths that would not fit anymore. And could ignore garbage like that. Or, in your case, offer the very specific option "add to last field".
FileStream.Read() returns the amount of bytes read, but... is there any situation other than having reached the end of file, that it will read less bytes than the number of bytes requested and not throw an exception?
the documentation says:
The Read method returns zero only after reaching the end of the stream. Otherwise, Read always reads at least one byte from the stream before returning. If no data is available from the stream upon a call to Read, the method will block until at least one byte of data can be returned. An implementation is free to return fewer bytes than requested even if the end of the stream has not been reached.
But this doesn't quite explain in what situations data would be unavailable and cause the method to block until it can read again. I mean, shouldn't most situations where data is unavailable force an exception?
What are real situations where comparing the number of bytes read against the number of expected bytes could differ (assuming that we're already checking for end of file when we mention number of bytes expected)?
EDIT: A bit more information, reason why I'm asking this is because I've come across a bit of code where the developer pretty much did something like this:
bytesExpected = (remainingBytesInFile > 94208 ? 94208 : remainingBytesInFile
while (bytesRead < bytesExpected)
{
bytesRead += fileStream.Read(buffer, bytesRead, bytesExpected - bytesRead)
}
Now, I can't see any advantage to having this while at all, I'd expect it to throw an exception if it can't read the number of bytes expected (bearing in mind it's already taking into account that there are those many bytes left to read)
What would the reason one could possibly have for something like this? I'm sure I'm missing something
The documentation is for Stream.Read, from which FileStream is derived. Since FileStream is a stream, it should obey the stream contract. Not all streams do, but unless you have a very good reason, you should stick to that.
In a typical file stream, you'll only get a return value smaller than count when you reach the end of file (and it's a pretty simple way of checking for the end of file).
However, in a NetworkStream, for example, you keep reading in a loop until the method returns zero - signalling the end of stream. The same works for file streams - you know you're at the end of the file when Read returns zero.
Most importantly, FileStream isn't just for what you'd consider files - it's also for pseudo-files like standard input/output pipes and COM ports, for example (try opening a file stream on PRN, for example). In that case, you're not reading a file with a fixed length, and the behaviour is the same as with NetworkStream.
Finally, don't forget that FileStream isn't sealed. It's perfectly fine for you to implement a virtualized file system, for example - and it's perfectly fine if your virtualized file system doesn't support seeking, or checking the length of file.
EDIT:
To address your edit, this is exactly how you're supposed to read any stream. Nothing wrong with it. If there's nothing else to read in a stream, the Read method will simply return 0, and you know the stream is over. The only thing is, it seems that he tries to fill his buffer to full, one buffer at a time - this only makes sense if you explicitly need to partition the file by 94208 bytes, and pass that byte[] for further processing somewhere.
If that's not the case, you don't really need to fill the full buffer - you just keep reading (and probably writing on some other side) until Read returns 0. And indeed, by default, FileStream will always fill the whole buffer unless it's built around a pipe handle - but since that's a possibility, you shouldn't rely on the "real file" behaviour, so as long as you need those byte[] for something non-stream (e.g. parsing messages), this is entirely fine. If you're only using the stream as an actual stream, and you're streaming the data somewhere else, it doesn't have a point, really - you only need one while to read the file.
Your expectations would only apply to the case when the stream is reading data off of a no-latency source. Other I/O sources can be slow, which is why the Read method might will not always be able to return immediately. That doesn't mean that there is an error (so no exception), just that it has to wait for data to arrive.
Examples: network stream, file stream on slow disk, etc.
(UPDATE, HDD example) To give an example specific to files (since your case is FileStream, although Read is defined on Stream and so all implementations should fulfill the requirements): mechanical hard-drives go to "sleep" when not active (specially on battery-powered devices, read laptops). Spinning up can take a second or so. That is not an IOException, but your read would have to wait for a second before any data is read.
Simple answer is that on a FileStream it probably never happens.
However keep in mind that the Read method is inherited from Stream which serves as base for many other streams like NetworkStream and in this case you may not be able to read has many bytes as you requested simple because they havent been received from the network yet.
So like the documentation says it all depends on the implementation of the specific type of stream - FileStream, NetworkStream, etc.
Is there a class that lets you read lines by line number in C#?
I know about StreamReader and TextFieldParser but AFAIK those don't have this functionality. For example, if I know that line number 34572 in my text file contains certain data, it would be nice to not have to call StreamReader.ReadLine() 34572 times.
Unless the file has a precise and pre-determined format, for instance with every line having the same length, there is no way to seek within a text file.
In order to find the ith line, you must find the first i-1 line ends. And if you do not know anything about where those line ends could be, it follows that you must read the entire file up until the ith line.
This is not a problem of C# - this is a problem of line terminators. There's no way to skip to the 34572 line, because you don't know when it starts - the only thing you know is that it starts after you find 34571 \r\ns. If you need this functionality, you don't want to be using text files at all :)
A simple (but still slow) way would be to use File.ReadLines(...):
var line = File.ReadLines(fileName).Skip(34571).FirstOrDefault();
The best way, however, would be to know the actual byte offset of the line. If you remember the offset instead of the line number, you can simply seek in the stream and avoid reading the unnecessary data. Then you'd just continue reading the line as usual:
streamReader.BaseStream.Seek(offset, SeekOrigin.Begin);
var line = streamReader.ReadLine();
This is useful if the file is append-only (e.g. a log file) and you can afford to remember bookmarks. It will only work if the file isn't modified in front of the bookmark, though.
All in all, there are three options:
Add indexing - have a table that contains all (or some) of the line-start offsets
Have a fixed line length - this allows you to seek predictably without an index; however, this will not work well with unicode, so it's pretty much useless these days
Parse the file - this pretty much amounts to reading the file line by line, the only optimisation being that you don't actually need to allocate the strings - a simple reusable byte buffer would do.
There's a reason why text formats aren't preferred when performance is important - when you work with user-editable general text formats, the third option is the only option. Thus, reading from a JSON, XML, text log file etc. will always mean reading up to at least the line you want.
This is probably not your case, but if you knew the length of each line you would calculate the start index byte of the line you look for and could go like below:
FileStream fs = new FileStream("fullFileName");
fs.Seek(startByte, SeekOrigin.Current)
for (long offset = startByte; offset < fs.Length; offset++)
{
fs.Seek(-1, SeekOrigin.Current);
char value = (char)((byte)fs.ReadByte());
.
.
.
//To determine the end of each line you can use the conditions below
if (value == 0xA)// hex \n
{
if (offset == fs.Length)
continue;
}
else if (value == 0xD)// hex \r
{
if (offset == fs.Length - 1)
continue;
}
}
Is there a library that I can use to perform binary search in a very big text file (can be 10GB).
The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.
I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.
You won't find it in a library, for two reasons:
It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).
On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)
As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.
The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.
When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.
You should find that you will reach your target pretty quickly.
You would need to be able to stream the file, but you would also need random access. I'm not sure how you accomplish this short of a guarantee that each line of the file contains the same number of bytes. If you had that, you could get a Stream of the object and use the Seek method to move around in the file, and from there you could conduct your binary search by reading in the number of bytes that constitute a line. But again, this is only valid if the lines are the same number of bytes. Otherwise, you would jump in and out of the middle of lines.
Something like
byte[] buffer = new byte[lineLength];
stream.Seek(lineLength * searchPosition, SeekOrigin.Begin);
stream.Read(buffer, 0, lineLength);
string line = Encoding.Default.GetString(buffer);
This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. Very big, but possible.
So try this:
Scan the file and store the ordinal offset of each line-feed in a List
Binary search the List with a custom comparer that scans to the file offset and reads the data.
If your file is static (or changes rarely) and you have to run "enough" queries against it, I believe the best approach will be creating "index" file:
Scan the initial file and take the datetime parts of the file plus their positions in the original (this is why has to be pretty static) encode them some how (for example: unix time (full 10 digits) + nanoseconds (zero-filled 4 digits) and line position (zero filed 10 digits). this way you will have file with consistent "lines"
preform binary search on that file (you may need to be a bit creative in order to achieve range search) and get the relevant location(s) in the original file
read directly from the original file starting from the given location / read the given range
You've got range search with O(log(n)) run-time :) (and you've created primitive DB functionality)
Needless to say that if the file data file is updated "too" frequently or you don't run "enough" queries against the index file you mat end up with spending more time on creating the index file than you are saving from the query file.
Btw, working with this index file doesn't require the data file to be sorted. As log files tend to be append only, and sorted, you may speed up the whole thing by simply creating index file that only holds the locations of the EOL marks (zero-filled 10 digits) in the data file - this way you can preform the binary search directly on the data-file (using the index file in order to determinate the seek positions in the original file) and if lines are appended to the log file you can simply add (append) their EOL positions to the index file.
The List object has a Binary Search method.
http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx