Which is faster reading binary data or just plain text data? - c#

I have some data that I know its exact structure. It has to be inserted in files second by second.
The structs contain fields of double, but they have different names. The same number of struct have to be written to file every second
The thing is ..
Which is a better appraoch when it comes to reading the data
1- Convert the Structs to bytes then insert it while indexing the byte that marks the end of the second
2- Writing CSV data and index the byte that marks the end of second
The data is requested at random basis from the file.
So in both cases I will set the position of the FileStream to the byte of the second.
In the first case I will use the following for each of the struct in that second to get the whole data
_filestream.Read(buffer, 0, buffer.Length);
GCHandle handle = GCHandle.Alloc(buffer, GCHandleType.Pinned);
oReturn = (object)Marshal.PtrToStructure(handle.AddrOfPinnedObject(), _oType);
the previous approach is applied X number of times because there's around 100 struct every second
In the second case I will use string.Split(',') then I will fill in the data accordingly since I know the exact order of my data
file.Read(buffer, 0, buffer.Length);
string val = System.Text.ASCIIEncoding.ASCII.GetString(buffer);
string[] row = val.Split(',');
edit
using the profiler is not showing a difference, but I cannot simulate the exact real life scenario because the file size might get really huge. I am looking for theoratical information for now

Related

How to access individual items in serialized array?

I want to store an array of timestamps in a binary flat file. One of my requirements is that I can access individual timestamps later on for efficient query purposes without having to read and deserialize the entire array first (I use a binary search algorithm that finds the file position of a start timestamp and end timestamp which in turn determines which bytes are read and deserialized between those two timestamps because the entire binary file can be multiple gigabytes large in size).
Obviously, the simple but slow way is to use BitConverter.GetBytes(timestamp) to convert each timestamp to bytes and to then store them in the file. I can then access each item individually in the file and use my custom binary search algorithm to find the timestamp that matches with the desired timestamp.
However, I found that BinaryFormatter is incredibly efficient (multiple times faster than protobuf-net and any other serializer I tried) regarding serialization/deserialization of value type arrays. Hence I attempted to try to serialize an array of timestamps into binary form. However, apparently that will now prevent me from accessing individual timestamps in the file without having to first deserialize the entire array.
Is there a way to still access individual items in binary form after having serialized an entire array of items via BinaryFormatter?
Here is some code snippet that demonstrates what I mean:
var sampleArray = new int[5] { 1,2,3,4,5};
var serializedSingleValueArray = sampleArray.SelectMany(x => BitConverter.GetBytes(x)).ToArray();
var serializedArrayofSingleValues = Serializers.BinarySerializeToArray(sampleArray);
var deserializesToCorrectValue = BitConverter.ToInt32(serializedSingleValueArray, 0); //value = 1 (ok)
var wrongDeserialization = BitConverter.ToInt32(serializedArrayofSingleValues, 0); //value = 256 (???)
Here the serialization function:
public static byte[]BinarySerializeToArray(object toSerialize)
{
using (var stream = new MemoryStream())
{
Formatter.Serialize(stream, toSerialize);
return stream.ToArray();
}
}
Edit: I do not need to concern myself with efficient memory consumption or file sizes as those are currently by far not the bottlenecks. It is the speed of serialization and deserialization that is the bottleneck for me with multi-gigabyte large binary files and hence very large arrays of primitives.
If your problem is just "how to convert an array of struct,to byte[]" you have other options than BitConverter. BitConverter is for single values, the Buffer class is for arrays.
double[] d = new double[100];
d[4] = 1235;
d[8] = 5678;
byte[] b = new byte[800];
Buffer.BlockCopy(d, 0, b, 0, d.Length*sizeof(double));
// just to test it works
double[] d1 = new double[100];
Buffer.BlockCopy(b, 0, d1, 0, d.Length * sizeof(double));
This does a byte-level copy without converting anything and without iterating over items.
You can put this byte array directly to your stream (not a StreamWriter, not a Formatter)
stream.Write(b, 0, 800);
That's definitly the fastest way to write to a file,but it involves a complete copy, but probably also any other thinkable method, will read an item, store it first for some reason, before it goes to the file.
If this is the only thing you write to your file - you don't need to write the array-length in the file, you can use the file-length for this.
To read the 100th double value in the file:
file.Seek(100*sizeof(double), SeekOrigin.Begin);
byte[] tmp = new byte[8];
f.Read(tmp, 0, 8);
double value = BitConverter.ToDouble(tmp, 0);
Here, for single value, you can use BitConverter.
This is the solution for .NET Framework, C# <= 7.0
For .NET Standard/.NET Core, C# 8.0 you have more options with Span<T>, which gives you access to the internal memory, without copying the Data.
A Bitconverter is not a "slow" version, it's just a way to convert everything to a byte[] sequence. This is actually not costly, it's just interpreting the memory differently.
Computing the position in file, load 8 bytes, convert it to DateTime, you are done.
You should do it only with simple structured files, and with simple structured files you don't need a binary formatter. Just load/save your one array to one file. This way you can be sure your file-positions can be computed.
So in other words. Save your array yourself, Date byte Date, than you can load it also Date by Date.
Writing with one processing style, Reading with another, is always a bad idea.

Split "Fixed width" files based on the value of a byte range

I have multiple files coming from Mainframe systems, basically EBCDIC data. Now some of these files have data from multiple modules, appended in one single file, for example, lets say I have a file CISA, which has data from multiple sub-modules. Now all these modules have row length of 1000 bytes but have different data structure. So to read these files I need to use different layout and to do that I need to split the parent file into multiple files based on a key value specified at a given location, lets say byte range 20-23.
For first row, 20-23 byte range value maybe 0001 and for next row 0002, so I need to split this file into multiple file based on value of byte range.
In my current implementation using C#, what I have done is that read the data using byte stream and then read one row at a time. I've used a Data table with two columns, first column stores filename, generated based on the byte range (20-23) value, second column stores the Byte stream which I just read.
I keep doing this so once the entire file is read, I have a data table, which gives me a list of file names and byte stream for those file. I loop through the data table and write each row based on the file name stored in the column name.
This solution is working all right but the performance in really slow because of the high I/O in writing the data table. So is there an option with which I can skip writing the data for each row and instead save the entire partition in one shot.
Firstly, I'd completely forget about DataTable here - that seems a terrible idea. How big are the files? if they're small: just load all the data (File.ReadAllBytes) and use an ArraySegment<byte> for each (maybe a List<ArraySegment<byte>>) - or if you're OK using preview bits: this would be a great use of Span<byte> (similar to ArraySegment<byte>, but more ... just more).
If the file is large, I'd look at MemoryMappedFile here; seems a great fit.

Processing 1000's of parameters a second in C# by quickly converting data types from byte to other types

I have asked this question over the last 2 years and am still looking for a good way of doing this. What I am doing is as follows:
I have a WPF/C# application which has been developed over the last 3 years. It takes a real time stream of bytes over a UDP port. Each record set is 1000 bytes. I am getting 100 of these byte records per second. I am reading the data and processing it for display in various formats. These logical records are sub-commutated.
The first 300 bytes are the same each logical record contain a mixture of Byte, Int16, UInt16, Int32 and UInt32 values. About 70% of these values are eventually multiplied by an least significant bit to create a Double. These parameters are always the same.
The second 300 bytes are another mixture of Byte, Int16, UIn32, Int32 and UInt32 values. Again about 70% of these values are multiplied by an LSB to create a Double. These parameters are again always the same.
The last segment is 400 bytes and sub-commutated. This means that the last part of the record contains 1 of 20 different logical record formats. I call them Type01...Type20 data. There is an identifier byte which tells me which one it is. These again contain Byte, Int, UInt data values which need to be converted.
I am currently using hundreds of function calls to process this data. Each function call takes the 1000 byte array as a parameter, an offset (index) into the byte array to where the parameter starts. It then uses the BitConverter.ToXXX call to convert the bytes to the correct data type, and then if necessary multiply by an LSB to create the final data value and return it.
I am trying to streamline this processing because the data stream are changing based on the source. For instance one of the new data sources (feeds) changes about 20 parameters in the first 300 bytes, about 24 parameters in the second 300 bytes and several in the last sub-commutated 400 bytes records.
I would like to build a data dictionary where the dictionary contains the logical record number (type of data), offset into the record, LSB of data, type of data to be converted to (Int16, UInt32, etc) and finally output type (Int32, Double, etc). Maybe also include the BitConverter function to use and "cast it dynamically"?
This appears to be a exercise in using Template Classes and possibly Delegates but I do not know how to do this. I would appreciate some code as in example.
The data is also recorded so playback may run at 2x, 4x, 8x, 16x speeds. Now before someone comments on how you can look at thousands of parameters at those speeds, it is not as hard as one may think. Some types of data such as green background for good, red for bad; or plotting map positions (LAT/LON) over time lend themselves very well for fast playback to find interesting events. So it is possible.
Thanks in advance for any help.
I am not sure others have an idea of what I am trying to do so I thought I would post a small segment of source code to see if anyone can improve on it.
Like I said above, the data comes in byte streams. Once it is read in a Byte Array it looks like the following:
Byte[] InputBuffer = { 0x01, 0x00, 0x4F, 0xEB, 0x06, 0x00, 0x17, 0x00,
0x00, 0x00, ... };
The first 2 bytes are an ushort which equals 1. This is the record type for this particular record. This number can range from 1 to 20.
The next 4 bytes are an uint which equals 453,455. This value is the number of tenths of a seconds. Value in this case is 12:35:45.5. To arrive at this I would make the following call to the following subroutine:
labelTimeDisplay.Content = TimeField(InputBuffer, 2, .1).ToString();
public Double TimeField(Byte[] InputBuffer, Int32 Offset, Double lsb)
{
return BitConverter.ToUInt32(InputBuffer, Offset) * lsb;
}
The next data field is the software version, in this case 23
labelSoftwareVersion.Content = SoftwareVersion(InputBuffer, 6).ToString();
public UInt16 SoftwareVersion(Byte[] InputBuffer, Int32 Offset)
{
return BitConverter.ToUInt16(InputBuffer, Offset);
}
The next data field is the System Status Word another UInt16.
Built-In-Test status bits are passed to other routines if any of the 16 bits are set to logic 1.
UInt16 CheckStatus = SystemStatus(InputBuffer, 8);
public UInt16 SystemStatus(Byte[] InputBuffer, Int32 Offset)
{
return BitConverter.ToUInt16(InputBuffer, Offset);
}
I literally have over a thousand of individual subroutines to process the data stored in the array of bytes. The array of bytes are always fixed length of 1000 bytes. The first 6 bytes are always the same, identifier and time. After that the parameters are different for every frame.
I have some major modifications coming the software which will redefine many of the parameters for the next software version. I still have to support the old software versions so the software just gets more complicated. My goal is to find a way to process the data using a dictionary lookup. That way I can just create the dictionary and read the dictionary to know how to process the data. Maybe use loops to load the data into a collection and then bind it to the display fields.
Something like this:
public class ParameterDefinition
{
String ParameterNumber;
String ParameterName;
Int32 Offset;
Double lsb;
Type ReturnDataType;
Type BaseDataType;
}
private ParameterDefinition[] parms = new ParameterDefinition[]
{
new ParameterDefinition ( "0000","RecordID", 0, 0.0, typeof(UInt16), typeof(UInt16)),
new ParameterDefinition ( "0001", "Time", 2, 0.1, typeof(Double), typeof(UInt32)),
new ParameterDefinition ( "0002", "SW ID", 6, 0.0, typeof(UInt16), typeof(UInt16)),
new ParameterDefinition ( "0003", "Status", 8, 0.0, typeof(UInt16), typeof(UInt16)),
// Lots more parameters
}
My bottom line problem is getting the parameter definitions to cast or select the right functions. I cannot find a way to link the "dictionary" to actual data ouputs
Thanks for any help
Using a data dictionary to represent the data structure is fine, as long as you don't walk the dictionary for each individual record. Instead, use Reflection Emit or Expression trees to build a delegate that you can call many many times.
It sounds like you are manually deserializing a byte stream, where the bytes represent various data types. That problem has been solved before.
Try defining a class that represents the first 600 bytes and deserialize that and deserialize it using Protocol Buffer Serializer (that implementation is by SO's own Marc Gravell, and there is a different implementation by top SO contributer Jon Skeet).
Protocol buffers are language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols and data storage. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the "old" format.
Source, as well as a 3rd implementation I have not personally used.
For the last 300 bytes, create appropriate class definitions for the appropriate formats, and again use protocol buffer to deserialize an appropriate class.
For the final touch-ups (e.g. converting values to doubles) you can either post-process the classes, or just have a getter that returns the appropriate final number.

loop for reading different data types & sizes off very large byte array from file

I have a raw byte stream stored on a file (rawbytes.txt) that I need to parse and output to a CSV-style text file.
The input of raw bytes (when read as characters/long/int etc.) looks something like this:
A2401028475764B241102847576511001200C...
Parsed it should look like:
OutputA.txt
(Field1,Field2,Field3) - heading
A,240,1028475764
OutputB.txt
(Field1,Field2,Field3,Field4,Field5) - heading
B,241,1028475765,1100,1200
OutputC.txt
C,...//and so on
Essentially, it's a hex-dump-style input of bytes that is continuous without any line terminators or gaps between data that needs to be parsed. The data, as seen above, consists of different data types one after the other.
Here's a snippet of my code - because there are no commas within any field, and no need arises to use "" (i.e. a CSV wrapper), I'm simply using TextWriter to create the CSV-style text file as follows:
if (File.Exists(fileName))
{
using (BinaryReader reader = new BinaryReader(File.Open(fileName, FileMode.Open)))
{
inputCharIdentifier = reader.ReadChar();
switch (inputCharIdentifier)
case 'A':
field1 = reader.ReadUInt64();
field2 = reader.ReadUInt64();
field3 = reader.ReadChars(10);
string strtmp = new string(field3);
//and so on
using (TextWriter writer = File.AppendText("outputA.txt"))
{
writer.WriteLine(field1 + "," + field2 + "," + strtmp); // +
}
case 'B':
//code...
My question is simple - how do I use a loop to read through the entire file? Generally, it exceeds 1 GB (which rules out File.ReadAllBytes and the methods suggested at Best way to read a large file into a byte array in C#?) - I considered using a while loop, but peekchar is not suitable here. Also, case A, B and so on have different sized input - in other words, A might be 40 bytes total, while B is 50 bytes. So the use of a fixed size buffer, say inputBuf[1000], or [50] for instance - if they were all the same size - wouldn't work well either, AFAIK.
Any suggestions? I'm relatively new to C# (2 months) so please be gentle.
You could read the file byte by byte which you append to the currentBlock byte array until you find the next block. If the byte identifies a new block you can then parse the currentBlock using you case trick and make the currentBlock = characterJustRead.
This approach works even if the id of the next block is longer than 1 byte - in this case you just parse currentBlock[0,currentBlock.Lenght-lenOfCurrentIdInBytes] - in other words you read a little too much, but you then parse only what is needed and use what is left as the base for the next currentBlock.
If you want more speed you can read the file in chunks of X bytes, but apply the same logic.
You said "The issue is that the data is not 100% kosher - i.e. there are situations where I need to separately deal with the possibility that the character I expect to identify each block is not in the right place." but building a currentBlock still should work. The code surely will have some complications, maybe something like nextBlock, but I'm guessing here without knowing what incorrect data you have to deal with.

Reading\Writing Structured Binary File

i want to read\write a binary file which has the following structure:
The file is composed by "RECORDS". Each "RECORD" has the following structure:
I will use the first record as example
(red)START byte: 0x5A (always 1 byte, fixed value 0x5A)
(green) LENGTH bytes: 0x00 0x16 (always 2 bytes, value can change from
"0x00 0x02" to "0xFF 0xFF")
(blue) CONTENT: Number of Bytes indicated by the decimal value of LENGTH Field minus 2. In this case LENGHT field value is 22 (0x00 0x16 converted to decimal), therefore the CONTENT will contain 20 (22 - 2) bytes.
My goal is to read each record one by one, and write it to an output file.
Actually i have a read function and write function (some pseudocode):
private void Read(BinaryReader binaryReader, BinaryWriter binaryWriter)
{
byte START = 0x5A;
int decimalLenght = 0;
byte[] content = null;
byte[] length = new byte[2];
while (binaryReader.PeekChar() != -1)
{
//Check the first byte which should be equals to 0x5A
if (binaryReader.ReadByte() != START)
{
throw new Exception("0x5A Expected");
}
//Extract the length field value
length = binaryReader.ReadBytes(2);
//Convert the length field to decimal
int decimalLenght = GetLength(length);
//Extract the content field value
content = binaryReader.ReadBytes(decimalLenght - 2);
//DO WORK
//modifying the content
//Writing the record
Write(binaryWriter, content, length, START);
}
}
private void Write(BinaryWriter binaryWriter, byte[] content, byte[] length, byte START)
{
binaryWriter.Write(START);
binaryWriter.Write(length);
binaryWriter.Write(content);
}
This way is actually working.
However since I am dealing with very large files i find it to be not performing at all, cause I Read and write 3 times foreach Record. Actually I would like to read bug chunks of data instead small amount of byte and maybe work in memory, but my experience in using Stream stops with BinaryReader and BinaryWriter. Thanks in advance.
FileStream is already buffered, so I'd expect it to work pretty well. You could always create a BufferedStream around the original stream to add extra more buffering if you really need to, but I doubt it would make a significant difference.
You say it's "not performing at all" - how fast is it working? How sure are you that the IO is where your time is going? Have you performed any profiling of the code?
I might also suggest that you read 3 (or 6?) bytes initially, instead of 2 separate reads. Put the initial bytes in a small array, check the 5a ck-byte, then the 2 byte length indicator, then the 3 byte AFP op-code, THEN, read the remainder of the AFP record.
It's a small difference, but it gets rid of one of your read calls.
I'm no Jon Skeet, but I did work at one of the biggest print & mail shops in the country for quite a while, and we did mostly AFP output :-)
(usually in C, though)

Categories