How to import and read large binary file data in c#? - c#

i have a large binary file that contains different data types, i can access single records in the file but i am not sure how to loop over the binary values and load it in the memory stream byte by byte
i have been using binary reader
BinaryReader binReader = new BinaryReader(File.Open(fileName, FileMode.Open));
Encoding ascii = Encoding.ASCII;
string authorName = binReader.ReadString();
Console.WriteLine(authorName);
Console.ReadLine();
but this won't work since i have a large file with different data types
simply, i need to convert the file to read byte by byte and then read these data either if it's a string or whatsoever.
would appreciate any thought that can help

This will very much depend on what format the file is in. Each byte in the file might represent different things, or it might just represent values from a large array, or some mix of the two.
You need to know what the format looks like to be able to read it, since binary files are not self-descriptive. Reading a simple object might look like
var authorName = binReader.ReadString();
var publishDate = DateTime.FromBinary(binReader.ReadInt64());
...
If you have a list of items it is common to use a length prefix. Something like
var numItems = binReader.ReadInt32();
for(int i = 0; i < numItems; i++){
var title = binReader.ReadString();
...
}
You would then typically create one or more objects from the data that can be used in the rest of the application. I.e.
new Bibliography(authorName, publishDate , books);
If this is a format you do not control I hope you have a detailed specification. Otherwise this is kind of a lost cause for anything but the cludgiest solutions.
If there is more data than can fit in memory you need some kind of streaming mechanism. I.e. read one item, do some processing of the item, save the result, read the next item, etc.
If you do control the format I would suggest alternatives that are easier to manage. I have used protobuf.Net, and I find it quite easy to use, but there are other alternatives. The common way to use these kinds of libraries is to create a class for the data, and add attributes for the fields that should be stored. The library can manage serialization/deserialization automatically, and usually handle things like inheritance and changes to the format in an easy way.

Here's a simple bit of code that shows the most basic way of doing it.
using System;
using System.IO;
using System.Linq;
using System.Threading.Tasks;
namespace binary_read
{
class Program
{
private static readonly int bufferSize = 1024;
static async Task Main(string[] args)
{
var bytesRead = 0;
var totalBytes = 0;
using (var stream = File.OpenRead(args.First()))
{
do
{
var buffer = new byte[bufferSize];
bytesRead = await stream.ReadAsync(buffer, 0, bufferSize);
totalBytes += bytesRead;
// Process buffer
} while (bytesRead > 0);
Console.WriteLine($"Processed {totalBytes} bytes.");
}
}
}
}
The main bit to take note of is within the using block.
Firstly, when working with files/streams/sockets it's best to use using if possible to deterministically clean up after yourself.
Then it's really just a matter of calling Read/ReadAsync on the stream if you're just after the raw data. However there are various 'readers' that provide an abstraction to make working with certain formats easier.
So if you know that you're going to be reading ints and doubles and strings, then you can use the BinaryReader and it's ReadIntxx/ReadDouble/ReadString methods.
If you're reading into a struct, then you can read the properties in a loop as suggested by #JonasH above. Or use the method in this answer.

Related

Reading alphanumeric data from text file

I am using the code below to read binary data from text file and divide it into small chunks. I want to do the same with a text file with alphanumeric data which is obviously not working with the binary reader. Which reader would be best to achieve that stream,string or text and how to implement that in the following code?
public static IEnumerable<IEnumerable<byte>> ReadByChunk(int chunkSize)
{
IEnumerable<byte> result;
int startingByte = 0;
do
{
result = ReadBytes(startingByte, chunkSize);
startingByte += chunkSize;
yield return result;
} while (result.Any());
}
public static IEnumerable<byte> ReadBytes(int startingByte, int byteToRead)
{
byte[] result;
using (FileStream stream = File.Open(#"C:\Users\file.txt", FileMode.Open, FileAccess.Read, FileShare.Read))
using (BinaryReader reader = new BinaryReader(stream))
{
int bytesToRead = Math.Max(Math.Min(byteToRead, (int)reader.BaseStream.Length - startingByte), 0);
reader.BaseStream.Seek(startingByte, SeekOrigin.Begin);
result = reader.ReadBytes(bytesToRead);
}
return result;
}
I can only help you get the general process figured out:
String/Text is the 2nd worst data format to read, write or process. It should be reserved for output towards and input from the user exclusively. It has some serious issues as a storage and retreival format.
If you have to transmit, store or retreive something as text, make sure you use a fixed Encoding and Culture Format (usually invariant) at all endpoints. You do not want to run into issues with those two.
The worst data fromat is raw binary. But there is a special 0th place for raw binary that you have to interpret into text, to then further process. To quote the most importnt parts of what I linked on encodings:
It does not make sense to have a string without knowing what encoding it uses. [...]
If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.
Almost every stupid “my website looks like gibberish” or “she can’t read my emails when I use accents” problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.

How to access individual items in serialized array?

I want to store an array of timestamps in a binary flat file. One of my requirements is that I can access individual timestamps later on for efficient query purposes without having to read and deserialize the entire array first (I use a binary search algorithm that finds the file position of a start timestamp and end timestamp which in turn determines which bytes are read and deserialized between those two timestamps because the entire binary file can be multiple gigabytes large in size).
Obviously, the simple but slow way is to use BitConverter.GetBytes(timestamp) to convert each timestamp to bytes and to then store them in the file. I can then access each item individually in the file and use my custom binary search algorithm to find the timestamp that matches with the desired timestamp.
However, I found that BinaryFormatter is incredibly efficient (multiple times faster than protobuf-net and any other serializer I tried) regarding serialization/deserialization of value type arrays. Hence I attempted to try to serialize an array of timestamps into binary form. However, apparently that will now prevent me from accessing individual timestamps in the file without having to first deserialize the entire array.
Is there a way to still access individual items in binary form after having serialized an entire array of items via BinaryFormatter?
Here is some code snippet that demonstrates what I mean:
var sampleArray = new int[5] { 1,2,3,4,5};
var serializedSingleValueArray = sampleArray.SelectMany(x => BitConverter.GetBytes(x)).ToArray();
var serializedArrayofSingleValues = Serializers.BinarySerializeToArray(sampleArray);
var deserializesToCorrectValue = BitConverter.ToInt32(serializedSingleValueArray, 0); //value = 1 (ok)
var wrongDeserialization = BitConverter.ToInt32(serializedArrayofSingleValues, 0); //value = 256 (???)
Here the serialization function:
public static byte[]BinarySerializeToArray(object toSerialize)
{
using (var stream = new MemoryStream())
{
Formatter.Serialize(stream, toSerialize);
return stream.ToArray();
}
}
Edit: I do not need to concern myself with efficient memory consumption or file sizes as those are currently by far not the bottlenecks. It is the speed of serialization and deserialization that is the bottleneck for me with multi-gigabyte large binary files and hence very large arrays of primitives.
If your problem is just "how to convert an array of struct,to byte[]" you have other options than BitConverter. BitConverter is for single values, the Buffer class is for arrays.
double[] d = new double[100];
d[4] = 1235;
d[8] = 5678;
byte[] b = new byte[800];
Buffer.BlockCopy(d, 0, b, 0, d.Length*sizeof(double));
// just to test it works
double[] d1 = new double[100];
Buffer.BlockCopy(b, 0, d1, 0, d.Length * sizeof(double));
This does a byte-level copy without converting anything and without iterating over items.
You can put this byte array directly to your stream (not a StreamWriter, not a Formatter)
stream.Write(b, 0, 800);
That's definitly the fastest way to write to a file,but it involves a complete copy, but probably also any other thinkable method, will read an item, store it first for some reason, before it goes to the file.
If this is the only thing you write to your file - you don't need to write the array-length in the file, you can use the file-length for this.
To read the 100th double value in the file:
file.Seek(100*sizeof(double), SeekOrigin.Begin);
byte[] tmp = new byte[8];
f.Read(tmp, 0, 8);
double value = BitConverter.ToDouble(tmp, 0);
Here, for single value, you can use BitConverter.
This is the solution for .NET Framework, C# <= 7.0
For .NET Standard/.NET Core, C# 8.0 you have more options with Span<T>, which gives you access to the internal memory, without copying the Data.
A Bitconverter is not a "slow" version, it's just a way to convert everything to a byte[] sequence. This is actually not costly, it's just interpreting the memory differently.
Computing the position in file, load 8 bytes, convert it to DateTime, you are done.
You should do it only with simple structured files, and with simple structured files you don't need a binary formatter. Just load/save your one array to one file. This way you can be sure your file-positions can be computed.
So in other words. Save your array yourself, Date byte Date, than you can load it also Date by Date.
Writing with one processing style, Reading with another, is always a bad idea.

Why does the following trivial protobuf-net code not work correctly?

I am trying to figure out how to work with protobuf-net. The below example is trivial on purpose. It does not reflect a real life application.
Please, observe:
var ms = new MemoryStream();
Serializer.Serialize(ms, 1234);
Serializer.Serialize(ms, 5678);
ms.Position = 0;
var n1 = Serializer.Deserialize<int>(ms);
var n2 = Serializer.Deserialize<int>(ms);
Debug.WriteLine(n1);
Debug.WriteLine(n2);
It outputs:
5678
0
What is wrong?
I am using protobuf-net 2.4.0
As said in the comments, Protobuf doesn't have a concept of 'packets' the way you'd expect. A stream is always fully read on deserialization, previously read values are overriden when they're seen again in the stream. Therefore, writing multiple messages to a single stream won't work the way you do it. Unfortunately, there is no easy fix, you have to implement packet-splitting yourself. One way to do it is to prefix any message by it's length, and split the input accordingly. This works good for bigger message, but your message consists only of a single int which would scale very badly. I'd really suggest not using protobuf at all in this scenario, instead you could take a look at their int-serialization routine (some dynamic-length encoding if I remember correctly) and serialize and deserialize single ints yourself.
You can use BitConverter for converting integers to byte arrays and then sending the arrays to memorystream:
Following link can help:
https://www.c-sharpcorner.com/uploadfile/mahesh/convert-integer-to-byte-array-in-C-Sharp/
then once you have the array of bytearrays, you can do this:
var memoryStream = new MemoryStream();
for(int temp = 0; temp < bytArray.Length; temp++)
memoryStream.Write(byteArray[temp], 0, byteArray[temp].Length);
Finally, use Protobuff to have memorystream.

Data from byte array

I'm trying to read the bytes in the stream at each frame.
I want to be able to read the position and the timestamp information that is stored on a file I have created.
The stream is a stream of recorded skeleton data and it is in encoded binary format
Stream recordStream;
byte[] results;
using (FileStream SourceStream = File.Open(#".....\Stream01.recorded", FileMode.Open))
{
if (SourceStream.CanRead)
{
results = new byte[recordStream.Length];
SourceStream.Read(results, 0, (int)recordStream.Length);
}
}
The file should be read and the Read method should read the current sequence of bytes before advances the position in the stream.
Is there a way to pull out the data (position and timestamp) I want from the bytes read, and save it in separate variables before it advances?
Could using the binary reader give me the capabilities to do this.
BinaryReader br1 = new BinaryReader(recordStream);
I have save the file as .recorded. I have also saved it as .txt to see what is contained in the file, but since it is encoded, it is not understandable.
Update:
I tried running the code with breakpoints to see if it enters the function with my binaryreader and it crashes with an error: ArgumentException was unhandled. Stream was not readable, on the BinaryReader initialization and declaration
BinaryReader br1 = new BinaryReader(recordStream);
The file type was .recorded.
You did not provide any information about the format of the data you are trying to read.
However, using the BinaryReader is exactly what you need to do.
It exposes methods to read data from the stream and convert them to various types.
Consider the following example:
var filename = "pathtoyourfile";
using (var stream = File.Open(filename, FileMode.Open))
using(var reader = new BinaryReader(stream))
{
var x = reader.ReadByte();
var y = reader.ReadInt16();
var z = reader.ReadBytes(10);
}
It really depends on the format of your data though.
Update
Even though I feel I've already provided all the information you need,
let's use your data.
You say each record in your data starts with
[long: timestamp][int: framenumber]
using (var stream = File.Open(filename, FileMode.Open))
using(var reader = new BinaryReader(stream))
{
var timestamp = reader.ReadInt64();
var frameNumber = reader.ReadInt32();
//At this point you have the timestamp and the frame number
//you can now do whatever you want with it and decide whether or not
//to continue, after that you just continue reading
}
How you continue reading depends on the format of the remaining part of the records
If all fields in a record have a specific length, then you either (depending on the
choice you made knowing the values of the timestamp and the frame number) continue
reading all the fields for that record OR you simply advance to a position in the stream
that contains the next record. For example if each record is 100 bytes long, if you want to skip this record after you got the first two fields:
stream.Seek(88, SeekOrigin.Current);
//88 here because the first two fields take 12 bytes -> (100 - 8 + 4)
If the records have a variable length the solution is similar, but you'll have to
take into account the length of the various fields (which should be defined by
length fields preceding the variable length fields)
As for knowing if the first 8 bytes really do represent a timestamp,
there's no real way of knowing for sure... remember in the end the stream just contains
a series of individual bytes that have no meaning whatsoever except for the meaning
given to them by your file format. Either you have to revise the file format or you could
try checking if the value of 'timestamp' in the example above even makes sense.
Is this a file format you have defined yourself, if so... perhaps you are making it to complicated and might want to look at solutions such as Google Protocol Buffers or Apache Thrift.
If this is still not what you are looking for, you will have to redefine your question.
Based on your comments:
You need to know the exact definition of the entire file. You create a struct based on this file format:
struct YourFileFormat {
[FieldOffset(0)]
public long Timestamp;
[FieldOffset(8)]
public int FrameNumber;
[FieldOffset(12)]
//.. etc..
}
Then, using a BinaryReader, you can either read each field individually for each frame:
// assume br is an instantiated BinaryReader..
YourFileFormat file = new YourFileFormat();
file.Timestamp = br.ReadInt64();
file.FrameNumber = br.ReadInt32();
// etc..
Or, you can read the entire file in and have the Marshalling classes copy everything into the struct for you..
byte[] fileContent = br.ReadBytes(sizeof(YourFileFormat));
GCHandle gcHandle = GCHandle.Alloc(fileContent, GCHandleType.Pinned); // or pinning it via the "fixed" keyword in an unsafe context
file = (YourFileFormat)Marshal.PtrToStructure(gcHandle.AddrOfPinnedObject(), typeof(YourFileFormat));
gcHandle.Free();
However, this assumes you'll know the exact size of the file. With this method though.. each frame (assuming you know how many there are) can be a fixed size array within this struct for that to work.
Bottom line: Unless you know the size of what you want to skip.. you can't hope to get the data from the file you require.

Most efficient way to compare a memorystream to a file C# .NET

I have a MemoryStream containing the bytes of a PNG-encoded image, and want to check if there is an exact duplicate of that image data in a directory on disk. The first obvious step is to only look for files that match the exact length, but after this I'd like to know what's the most efficient way to compare the memory against the files. I'm not very experienced working with streams.
I had a couple thoughts on the matter:
First, if I could get a hash code for the file, it would (presumably) be more efficient to compare hash codes rather than every byte of the image. Similarly, I could compare just some of the bytes of the image, giving a "close-enough" answer.
And then of course I could just compare the entire stream, but I don't know how quick that would be.
What's the best way to compare a MemoryStream to a file? Byte-by-byte in a for-loop?
Another solution:
private static bool CompareMemoryStreams(MemoryStream ms1, MemoryStream ms2)
{
if (ms1.Length != ms2.Length)
return false;
ms1.Position = 0;
ms2.Position = 0;
var msArray1 = ms1.ToArray();
var msArray2 = ms2.ToArray();
return msArray1.SequenceEqual(msArray2);
}
Firstly, getting hashcode of the two streams won't help - to calculate hashcodes, you'd need to read the entire contents and perform some simple calculation while reading. If you compare the files byte-by-byte or using buffers, then you can stop earlier (after you find first two bytes/blocks) that don't match.
However, this approach would make sense if you needed to compare the MemoryStream against multiple files, because then you'd need to loop through the MemoryStream just once (to calculate the hashcode) and tne loop through all the files.
In any case, you'll have to write code to read the entire file. As you mentioned, this can be done either byte-by-byte or using buffers. Reading data into buffer is a good idea, because it may be more efficient operation when reading from HDD (e.g. reading 1kB buffer). Moreover, you could use asynchronous BeginRead method if you need to process multiple files in parallel.
Summary:
If you need to compare multiple files, use hashcode
To read/compare content of single file:
Read 1kB of data into a buffer from both streams
See if there is a difference (if yes, quit)
Continue looping
Implement the above steps asynchronously using BeginRead if you need to process mutliple files in parallel.
Firstly, getting hashcode of the two streams won't help - to calculate hashcodes, you'd need to read the entire contents and perform some simple calculation while reading.
I'm not sure if I misunderstood it or this is simply isn't true. Here's the example of hash calculation using streams
private static byte[] ComputeHash(Stream data)
{
using HashAlgorithm algorithm = MD5.Create();
byte[] bytes = algorithm.ComputeHash(data);
data.Seek(0, SeekOrigin.Begin); //I'll use this trick so the caller won't end up with the stream in unexpected position
return bytes;
}
I've measured this code with benchmark.net and it allocated 384 bytes on 900Mb file. Needless to say how inefficient loading whole file in memory in this case.
However, this is true
It's important to be aware of the (unlikely) possibility of hash collisions. Byte comparison would be necessary to avoid this issue.
So in case hashes don't match you have to perform additional checks in order to be sure that files are 100% different. In such a case following is a great approach.
As you mentioned, this can be done either byte-by-byte or using buffers. Reading data into buffer is a good idea, because it may be more efficient operation when reading from HDD (e.g. reading 1kB buffer).
Recently I had to perform such checks so I'll post results of this exercise as 2 utility methods
private bool AreStreamsEqual(Stream stream, Stream other)
{
const int bufferSize = 2048;
if (other.Length != stream.Length)
{
return false;
}
byte[] buffer = new byte[bufferSize];
byte[] otherBuffer = new byte[bufferSize];
while ((_ = stream.Read(buffer, 0, buffer.Length)) > 0)
{
var _ = other.Read(otherBuffer, 0, otherBuffer.Length);
if (!otherBuffer.SequenceEqual(buffer))
{
stream.Seek(0, SeekOrigin.Begin);
other.Seek(0, SeekOrigin.Begin);
return false;
}
}
stream.Seek(0, SeekOrigin.Begin);
other.Seek(0, SeekOrigin.Begin);
return true;
}
private bool IsStreamEuqalToByteArray(byte[] contents, Stream stream)
{
const int bufferSize = 2048;
var i = 0;
if (contents.Length != stream.Length)
{
return false;
}
byte[] buffer = new byte[bufferSize];
int bytesRead;
while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
{
var contentsBuffer = contents
.Skip(i * bufferSize)
.Take(bytesRead)
.ToArray();
if (!contentsBuffer.SequenceEqual(buffer))
{
stream.Seek(0, SeekOrigin.Begin);
return false;
}
}
stream.Seek(0, SeekOrigin.Begin);
return true;
}
We've open sourced a library to deal with this at NeoSmart Technologies, because we've had to compare opaque Stream objects for bytewise equality one time too many. It's available on NuGet as StreamCompare and you can read about its advantages over existing approaches in the official release announcement.
Usage is very straightforward:
var stream1 = ...;
var stream2 = ...;
var scompare = new StreamCompare();
var areEqual = await scompare.AreEqualAsync(stream1, stream2);
It's written to abstract away as many of the gotchas and performance pitfalls as possible, and contains a number of optimizations to speed up comparisons (and to minimize memory usage). There's also a file comparison wrapper FileCompare included in the package, that can be used to compare two files by path.
StreamCompare is released under the MIT license and runs on .NET Standard 1.3 and above. NuGet packages for .NET Standard 1.3, .NET Standard 2.0, .NET Core 2.2, and .NET Core 3.0 are available. Full documentation is in the README file.
Using Stream we don't get the result, each and every files has a unique identity, such as the last modified date and so on. So each and every file is different. This information is included in the stream

Categories