Better/faster way to fill a big array in C# - c#

I have 3 *.dat files (346KB,725KB,1762KB) that are filled with a json-string of "big" int-Arrays.
Each time my object is created (several times) I take those three files and use JsonConvert.DeserializeObject to deserialize the arrays into the object.
I thought about using binary-files instead of a json-string or could I even save these arrays directly? I dont need to use these files, it's just the location the data is currently saved. I would gladly switch to anything faster.
What are the different ways to speed up the initialization of these objects?

The fastest way is to manually serialize the data.
An easy way to do this is by creating a FileStream, and then wrapping it in a BinaryWriter/BinaryReader.
You have access to functions to write the basic data structures (numbers, string, char, byte[] and char[]).
An easy way to write a int[] (unneccesary if it's fixed size) is by prepending the length of the array with either an int/long (depending on the size, unsigned doesn't really give any advantages, since arrays use signed datatypes for their length storage). And then write all the ints.
Two ways to write all the ints would be:
1. Simply loop over the entire array.
2. Convert it into a byte[] and write it using BinaryWriter.Write(byte[])
These is how you can implement them both:
// Writing
BinaryWriter writer = new BinaryWriter(new FileStream(...));
int[] intArr = new int[1000];
writer.Write(intArr.Length);
for (int i = 0; i < intArr.Length; i++)
writer.Write(intArr[i]);
// Reading
BinaryReader reader = new BinaryReader(new FileStream(...));
int[] intArr = new int[reader.ReadInt32()];
for (int i = 0; i < intArr.Length; i++)
intArr[i] = reader.ReadInt32();
// Writing, method 2
BinaryWriter writer = new BinaryWriter(new FileStream(...));
int[] intArr = new int[1000];
byte[] byteArr = new byte[intArr.Length * sizeof(int)];
Buffer.BlockCopy(intArr, 0, byteArr, 0, intArr.Length * sizeof(int));
writer.Write(intArr.Length);
writer.Write(byteArr);
// Reading, method 2
BinaryReader reader = new BinaryReader(new FileStream(...));
int[] intArr = new int[reader.ReadInt32()];
byte[] byteArr = reader.ReadBytes(intArr.Length * sizeof(int));
Buffer.BlockCopy(byteArr, 0, intArr, 0, byteArr.Length);
I decided to put this all to the test, with an array of 10000 integers I ran the test 10000 times.
It resulted in method one consumes averagely 888200ns on my system (about 0.89ms).
While method 2 only consumes averagely 568600ns on my system (0.57ms averagely).
Both times include the work the garbage collector has to do.
Obviously method 2 is faster than method 1, though possibly less readable.
Another reason why method 1 can be better than method 2 is because method 2 requires double the amount of RAM free than data you're going to write (the original int[] and the byte[] that's converted from the int[]), when dealing with limited RAM/extremely large files (talking about 512MB+), though if this is the case, you can always make a hybrid solution, by for example writing away 128MB at a time.
Note that method 1 also requires this extra space, but because it's split down in 1 operation per item of the int[], it can release the memory a lot earlier.
Something like this, will write 128MB of an int[] at a time:
const int WRITECOUNT = 32 * 1024 * 1024; // 32 * sizeof(int)MB
int[] intArr = new int[140 * 1024 * 1024]; // 140 * sizeof(int)MB
for (int i = 0; i < intArr.Length; i++)
intArr[i] = i;
byte[] byteArr = new byte[WRITECOUNT * sizeof(int)]; // 128MB
int dataDone = 0;
using (Stream fileStream = new FileStream("data.dat", FileMode.Create))
using (BinaryWriter writer = new BinaryWriter(fileStream))
{
while (dataDone < intArr.Length)
{
int dataToWrite = intArr.Length - dataDone;
if (dataToWrite > WRITECOUNT) dataToWrite = WRITECOUNT;
Buffer.BlockCopy(intArr, dataDone, byteArr, 0, dataToWrite * sizeof(int));
writer.Write(byteArr);
dataDone += dataToWrite;
}
}
Note that this is just for writing, reading works differently too :P.
I hope this gives you some more insight in dealing with very large data files :).

If you've just got a bunch of integers, then using JSON will indeed be pretty inefficient in terms of parsing. You can use BinaryReader and BinaryWriter to write binary files efficiently... but it's not clear to me why you need to read the file every time you create an object anyway. Why can't each new object keep a reference to the original array, which has been read once? Or if they need to mutate the data, you could keep one "canonical source" and just copy that array in memory each time you create an object.

The fastest way to create a byte array from an array of integers is to use Buffer.BlockCopy
byte[] result = new byte[a.Length * sizeof(int)];
Buffer.BlockCopy(a, 0, result, 0, result.Length);
// write result to FileStream or wherever
If you store the size of the array in the first element, you can use it again to deserialize. Make sure everything fits into memory, but looking at your file sizes it should.
var buffer = File.ReadAllBytes(#"...");
int size = BitConverter.ToInt32(buffer,0);
var result = new int[size];
Buffer.BlockCopy(buffer, 0, result, result.length);
Binary is not human readable, but definetely faster than JSON.

Related

C# - Stream.Read offset is working incorrectly

I have a test code like below. I am reading from stream, offsetting by 2 positions, and then taking next 2 bytes. I would hope that result would be an array with 2 elements. This does not work though - offset is completely ignored, and full sized array is always returned, with only offset blocks having values. But this means my result table is still very large, it just has a lot of unwanted zeroes
How can I rework below code, so that file.Read() returns only an array of 2 bytes instead of 10 when length = 2 and offset = 2? In real world scenario I am dealing with large files (>2gigs) so filtering out the result array is not an option.
Edit: As the issue is unclear - below code requires me to always define output array that is the same size as the stream. Instead I would like to have an output that is of size of length (in below example I would like to have var buffer = new byte[2], but that will throw an exception because file.Read ignores offset and length and always returns 10 elements (with only 2 of them being read, rest is dummy zeroes).
private byte[] GetFilePart(int length, int offset)
{
//build some dummy content
var content = new byte[10];
for (int i = 0; i<10; i++)
{
content[i] = 1;
}
//read the data from content
var buffer = new byte[10];
using (Stream file = new MemoryStream(content))
{
file.Read(buffer, offset, length);
}
return buffer;
}
Looks like it's working properly to me; maybe your confusion would clear a bit if you inited your content array with something like:
for (int i = 1; i<=10; i++)
{
content[i-1] = i;
}
then each byte would have a different number and the image would look like:
offset relates to where into buffer the Stream will write the bytes to (it reads from the start of content). It does not relate to what bytes are read out of content.
Imagine Read as being called WriteBytesInto(byte[] whatBuffer, int whereToStartWriting, int howManyBytesToWrite) - you provide the buffer it will write into and tell it where to start and how many to do
If you did this, having inited content to be incrementing numbers:
file.Read(buffer, 2, 3); //read 3 bytes from stream and write to buffer # index 2
file.Read(buffer, 0, 2); //read 2 bytes from stream and write to buffer # index 0
Your buffer would end up looking like:
4,5,1,2,3,0,0,0,0,0
The 1,2,3 having been written first, then the 4,5 written next
If you want to skip two bytes from the stream (i.e. read the 3rd and 4th byte from content, Seek() the stream or set its Position (or as canton7 advises in the commments, if the stream is not seekable, read and discard some bytes)
How can I rework below code, so that file.Read() returns only an array of 2 bytes instead of 10 when length = 2 and offset = 2?
Well, file.Read doesn't return an array at all; it modifies an array you give it. If you want a 2 byte array, give it a 2 byte array:
byte buf = new byte[2];
file.Read(buf, 0, buf.Length);
If you want to open a file, skip the first 7 bytes and then read bytes 8th and 9th into your length-of-2 byte array then:
byte buf = new byte[2];
file.Position = 7; //absolute skip to 8th byte
file.Read(buf, 0, buf.Length);
For more on seeking in streams see Stream.Seek(0, SeekOrigin.Begin) or Position = 0

How to read binary array directly from the stream?

Here by "directly" I mean without temporary byte[] array.
The problem is, for example I have array of ints or doubles on the disk, so currently I create two arrays -- byte array and int array (in case of ints). The former is just for reading, the latter is the actual output.
Since Stream can read only to byte array I read it to the first array, than copy all the data to the second. It works, but it really hurts me (I am not talking about performance here).
So, how to read the array without temporary array? Using C# unsafe context is fine for me.
So far I tried two approaches: I looked if it is possible to create an array reusing allocated memory and second which looked more promising -- I could get pointer to the result/second array and in unsafe context I could cast it to byte* pointer. Considering my needs it is 100% safe and valid, however byte* pointer is not a byte[] array in C# world and I cannot find the way to cast pointer to array.
Code:
void ReadStuff(Stream stream, double[] data)
{
var dataBytes = new byte[data.Length * sizeof(double)];
stream.Read(dataBytes, 0, dataBytes.Length);
Buffer.BlockCopy(dataBytes, 0, data, 0, dataBytes.Length);
// ...
}
There is no way I know of to copy data directly from a stream to a typed array. But you can process the data in chunks, limiting your memory overhead to a fixed amount. The memory will be copied twice, but this is unavoidable as far as I know.
For example:
public static void ReadArrayDataChunked(BinaryReader binaryReader, Array target, Type type, int bufferSize = 4096)
{
var buffer = new byte[bufferSize];
var tSize = Marshal.SizeOf(type) ;
var remainingBytes = target.Length * tSize;
var targetPosition = 0;
while (remainingBytes > 0)
{
var toRead = Math.Min(remainingBytes, buffer.Length);
var bytesRead = binaryReader.Read(buffer, 0, toRead);
Buffer.BlockCopy(buffer, 0, target, targetPosition, bytesRead);
targetPosition += bytesRead;
remainingBytes -= bytesRead;
}
}
Note that this only works for primitive types due to the BlockCopy, but this helps improve copy performance. You will need to read other types item by item.

Does converting between byte[] and MemoryStream cause overhead?

I want to know if there's overhead when converting between byte arrays and Streams (specifically MemoryStreams when using MemoryStream.ToArray() and MemoryStream(byte[]). I assume it's temporary doubling memory usage.
For example, I read as a stream, convert to bytes, and then convert to stream again.
But getting rid of that byte conversion will require a bit of a rewrite. I don't want to waste time rewriting it if it doesn't make a difference.
So, yes.. you are correct in assuming that ToArray duplicates the memory in the stream.
If you want do not want to do this (for efficiency reasons), you could modify the bytes directly in the stream. Take a look at this:
// create some bytes: 0,1,2,3,4,5,6,7...
var originalBytes = Enumerable.Range(0, 256).Select(Convert.ToByte).ToArray();
using(var ms = new MemoryStream(originalBytes)) // ms is referencing bytes array, not duplicating it
{
// var duplicatedBytes = ms.ToArray(); // copy of originalBytes array
// If you don't want to duplicate the bytes but want to
// modify the buffer directly, you could do this:
var bufRef = ms.GetBuffer();
for(var i = 0; i < bufRef.Length; ++i)
{
bufRef[i] = Convert.ToByte(bufRef[i] ^ 0x55);
}
// or this:
/*
ms.TryGetBuffer(out var buf);
for (var i = 0; i < buf.Count; ++i)
{
buf[i] = Convert.ToByte(buf[i] ^ 0x55);
}*/
// or this:
/*
for (var i = 0; i < ms.Length; ++i)
{
ms.Position = i;
var b = ms.ReadByte();
ms.Position = i;
ms.WriteByte(Convert.ToByte(b ^ 0x55));
}*/
}
// originalBytes will now be 85,84,87,86...
ETA:
Edited to add in Blindy's examples. Thanks! -- Totally forgot about GetBuffer and had no idea about TryGetBuffer
Does MemoryStream(byte[]) cause a memory copy?
No, it's a non-resizable stream, and as such no copy is necessary.
Does MemoryStream.ToArray() cause a memory copy?
Yes, by design it creates a copy of the active buffer. This is to cover the resizable case, where the buffer used by the stream is not the same buffer that was initially provided due to reallocations to increase/decrease its size.
Alternatives to MemoryStream.ToArray() that don't cause memory copy?
Sure, you have MemoryStream.TryGetBuffer (out ArraySegment<byte> buffer), which returns a segment pointing to the internal buffer, whether or not it's resizable. If it's non-resizable, it's a segment into your original array.
You also have MemoryStream.GetBuffer, which returns the entire internal buffer. Note that in the resizable case, this will be a lot larger than the actual used stream space, and you'll have to adjust for that in code.
And lastly, you don't always actually need a byte array, sometimes you just need to write it to another stream (a file, a socket, a compression stream, an Http response, etc). For this, you have MemoryStream.CopyTo[Async], which also doesn't perform any copies.

Best way to copy bits in terms of performance, in C#

What's the fastest way to copy bits from an Int to a byte array, in C#?
I have a couple of ints and I need to copy (sometimes all and somtimes some of) the bits serially into a byte[]...
I need the process to be as efficient as possible (e.g. avoid creating new byte array in the process as I understand the BitConverter does etc).
One way to avoid creating new byte[] array on each call is to create a BinaryWriter on top of a MemoryStream, write your integers into it, and then harvest all the results at once by accessing MemoryStream's buffer:
var buf = new byte[400];
using (var ms = new MemoryStream(buf))
using (var bw = new BinaryWriter(ms)) {
for (int i = 0 ; i != 100 ; i++) {
bw.Write(2*i+3);
}
}
// At this point buf contains the bytes of 100 ints

How to Convert Primitive[] to byte[]

For Serialization of Primitive Array, i'am wondering how to convert a Primitive[] to his corresponding byte[]. (ie an int[128] to a byte[512], or a ushort[] to a byte[]...)
The destination can be a Memory Stream, a network message, a file, anything.
The goal is performance (Serialization & Deserialization time), to be able to write with some streams a byte[] in one shot instead of loop'ing' through all values, or allocate using some converter.
Some already solution explored:
Regular Loop to write/read
//array = any int[];
myStreamWriter.WriteInt32(array.Length);
for(int i = 0; i < array.Length; ++i)
myStreamWriter.WriteInt32(array[i]);
This solution works for Serialization and Deserialization And is like 100 times faster than using Standard System.Runtime.Serialization combined with a BinaryFormater to Serialize/Deserialize a single int, or a couple of them.
But this solution becomes slower if array.Length contains more than 200/300 values (for Int32).
Cast?
Seems C# can't directly cast a Int[] to a byte[], or a bool[] to a byte[].
BitConverter.Getbytes()
This solution works, but it allocates a new byte[] at each call of the loop through my int[]. Performances are of course horrible
Marshal.Copy
Yup, this solution works too, but same problem as previous BitConverter one.
C++ hack
Because direct cast is not allowed in C#, i tryed some C++ hack after seeing into memory that array length is stored 4 bytes before array data starts
ARRAYCAST_API void Cast(int* input, unsigned char** output)
{
// get the address of the input (this is a pointer to the data)
int* count = input;
// the size of the buffer is located just before the data (4 bytes before as this is an int)
count--;
// multiply the number of elements by 4 as an int is 4 bytes
*count = *count * 4;
// set the address of the byte array
*output = (unsigned char*)(input);
}
and the C# that call:
byte[] arrayB = null;
int[] arrayI = new int[128];
for (int i = 0; i < 128; ++i)
arrayI[i] = i;
// delegate call
fptr(arrayI, out arrayB);
I successfully retrieve my int[128] into C++, switch the array length, and affecting the right adress to my 'output' var, but C# is only retrieving a byte[1] as return. It seems that i can't hack a managed variable like that so easily.
So i really start to think that all theses casts i want to achieve are just impossible in C# (int[] -> byte[], bool[] -> byte[], double[] -> byte[]...) without Allocating/copying...
What am-i missing?
How about using Buffer.BlockCopy?
// serialize
var intArray = new[] { 1, 2, 3, 4, 5, 6, 7, 8 };
var byteArray = new byte[intArray.Length * 4];
Buffer.BlockCopy(intArray, 0, byteArray, 0, byteArray.Length);
// deserialize and test
var intArray2 = new int[byteArray.Length / 4];
Buffer.BlockCopy(byteArray, 0, intArray2, 0, byteArray.Length);
Console.WriteLine(intArray.SequenceEqual(intArray2)); // true
Note that BlockCopy is still allocating/copying behind the scenes. I'm fairly sure that this is unavoidable in managed code, and BlockCopy is probably about as good as it gets for this.

Categories