Dealing with very large Lists on x86 - c#

I need to work with large lists of floats, but I am hitting memory limits on x86 systems. I do not know the final length, so I need to use an expandable type. On x64 systems, I can use <gcAllowVeryLargeObjects>.
My current data type:
List<RawData> param1 = new List<RawData>();
List<RawData> param2 = new List<RawData>();
List<RawData> param3 = new List<RawData>();
public class RawData
{
public string name;
public List<float> data;
}
The length of the paramN lists is low (currently 50 or lower), but data can be 10m+. When the length is 50, I am hitting memory limits (OutOfMemoryException) at just above 1m data points, and when the length is 25, I hit the limit at just above 2m data points. (If my calculations are right, that is exactly 200MB, plus the size of name, plus overhead). What can I use to increase this limit?
Edit: I tried using List<List<float>> with a max inner list size of 1 << 17 (131072), which increased the limit somewhat, but still not as far as I want.
Edit2: I tried reducing the chunk size in the List> to 8192, and I got OOM at ~2.3m elements, with task manager reading ~1.4GB for the process. It looks like I need to reduce memory usage in between the data source and the storage, or trigger GC more often - I was able to gather 10m data points in a x64 process on a pc with 4GB RAM, IIRC the process never went over 3GB
Edit3: I condensed my code down to just the parts that handle the data. http://pastebin.com/maYckk84
Edit4: I had a look in DotMemory, and found that my data structure does take up ~1GB with the settings I was testing on (50ch * 3 params * 2m events = 300,000,000 float elements). I guess I will need to limit it on x86 or figure out how to write to disk in this format as I get data

First of all, on x86 systems memory limit is 2GB, not 200MB. I presume
your problem is much more trickier than that. You have aggressive LOH (large object heap) fragmentation.
CLR uses different heaps for small and large objects. Object is large if its size is larger than 85,000 bytes. LOH is a very fractious thing, it is not eager to return unused memory back to OS, and it is very poor at defragmentation.
.Net List is implementation of ArrayList data structure, it stores elements in array, which has fixed size; when array is filled, new array with doubled size is created. That continuous growth of array with your amount of data is a "starvation" scenario for LOH.
So, you have to use tailor-made data structure to suit your needs. E.g. list of chunks, with each chunk is small enough not to get into LOH. Here is small prototype:
public class ChunkedList
{
private readonly List<float[]> _chunks = new List<float[]>();
private const int ChunkSize = 8000;
private int _count = 0;
public void Add(float item)
{
int chunk = _count / ChunkSize;
int ind = _count % ChunkSize;
if (ind == 0)
{
_chunks.Add(new float[ChunkSize]);
}
_chunks[chunk][ind] = item;
_count ++;
}
public float this[int index]
{
get
{
if(index <0 || index >= _count) throw new IndexOutOfRangeException();
int chunk = index / ChunkSize;
int ind = index % ChunkSize;
return _chunks[chunk][ind];
}
set
{
if(index <0 || index >= _count) throw new IndexOutOfRangeException();
int chunk = index / ChunkSize;
int ind = index % ChunkSize;
_chunks[chunk][ind] = value;
}
}
//other code you require
}
With ChunkSize = 8000 every chunk will take only 32,000 bytes, so it will not get into LOH. _chunks will get into LOH only when there will be about 16,000 chunks in collection, which is more than 128 million elements in collection (about 500 MB).
UPD I've performed some stress tests for sample above. OS is x64, solution platform is x86. ChunkSize is 20000.
First:
var list = new ChunkedList();
for (int i = 0; ; i++)
{
list.Add(0.1f);
}
OutOfMemoryException is raised at ~324,000,000 elements
Second:
public class RawData
{
public string Name;
public ChunkedList Data = new ChunkedList();
}
var list = new List<RawData>();
for (int i = 0;; i++)
{
var raw = new RawData { Name = "Test" + i };
for (int j = 0; j < 20 * 1000 * 1000; j++)
{
raw.Data.Add(0.1f);
}
list.Add(raw);
}
OutOfMemoryException is raised at i=17, j~12,000,000. 17 RawData instances successfully created, 20 million data points per each, about 352 million data points totally.

Related

What defines the capacity of a memory stream

I was calculating the size of object(a List that is being Populated), using the following code:
long myObjectSize = 0;
System.IO.MemoryStream memoryStreamObject = new System.IO.MemoryStream();
System.Runtime.Serialization.Formatters.Binary.BinaryFormatter binaryBuffer = new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();
binaryBuffer.Serialize(memoryStreamObject, myListObject);
myObjectSize = memoryStreamObject.Position;
A the initial point the capacity of the memoryStreamObject was 1024
Later(after adding more elements to the list) It was shown as 2048.
And It seems to be increasing as the stream content increasing. Then what is the purpose of capacity in this scenario.?
This is caused by the internal implementation of the MemoryStream. The Capacity property is the size of the internal buffer. This make sense if the MemoryStream is created with a fixed size buffer. But in your case the MemoryStream can grow and the actual implementation doubles the size of the buffer if the buffer is too small.
Code of MemoryStream
private bool EnsureCapacity(int value)
{
if (value < 0)
{
throw new IOException(Environment.GetResourceString("IO.IO_StreamTooLong"));
}
if (value > this._capacity)
{
int num = value;
if (num < 256)
{
num = 256;
}
if (num < this._capacity * 2)
{
num = this._capacity * 2;
}
if (this._capacity * 2 > 2147483591)
{
num = ((value > 2147483591) ? value : 2147483591);
}
this.Capacity = num;
return true;
}
return false;
}
And somewhere in Write
int num = this._position + count;
// snip
if (num > this._capacity && this.EnsureCapacity(num))
The purpose of the capacity for a memory stream and for lists is that the underlying data structure is really an array, and arrays cannot be dynamically resized.
So to start with you use an array with a small(ish) size, but once you add enough data so that the array is no longer big enough you need to create a new array, copy over all the data from the old array to the new array, and then switch to using the new array from now on.
This create+copy takes time, the bigger the array, the longer the time it takes to do this. As such, if you resized the array just big enough every time, you would effectively do this every time you write to the memory stream or add a new element to the list.
Instead you have a capacity, saying "you can use up to this value before having to resize" to reduce the number of create+copy cycles you have to perform.
For instance, if you were to write one byte at a time to this array, and not have this capacity concept, every extra byte would mean one full cycle of create+copy of the entire array. Instead, with the last screenshot in your question, you can write one byte at a time, 520 more times before this create+copy cycle has to be performed.
So this is a performance optimization.
An additional bonus is that repeatedly allocating slightly larger memory blocks will eventually fragment memory so that you would risk getting "out of memory" exceptions, reducing the number of such allocations also helps to stave off this scenario.
A typical method to calculate this capacity is by just doubling it every time.

Create random ints with minimum and maximum from Random.NextBytes()

Title pretty much says it all. I know I could use Random.NextInt(), of course, but I want to know if there's a way to turn unbounded random data into bounded without statistical bias. (This means no RandomInt() % (maximum-minimum)) + minimum). Surely there is a method like it, that doesn't introduce bias into the data it outputs?
If you assume that the bits are randomly distributed, I would suggest:
Generate enough bytes to get a number within the range (e.g. 1 byte to get a number in the range 0-100, 2 bytes to get a number in the range 0-30000 etc).
Use only enough bits from those bytes to cover the range you need. So for example, if you're generating numbers in the range 0-100, take the bottom 7 bits of the byte you've generated
Interpret the bits you've got as a number in the range [0, 2n) where n is the number of bit
Check whether the number is in your desired range. It should be at least half the time (on average)
If so, use it. If not, repeat the above steps until a number is in the right range.
The use of just the required number of bits is key to making this efficient - you'll throw away up to half the number of bytes you generate, but no more than that, assuming a good distribution. (And if you are generating numbers in a nicely binary range, you won't need to throw anything away.)
Implementation left as an exercise to the reader :)
You could try with something like:
public static int MyNextInt(Random rnd, int minValue, int maxValue)
{
var buffer = new byte[4];
rnd.NextBytes(buffer);
uint num = BitConverter.ToUInt32(buffer, 0);
// The +1 is to exclude the maxValue in the case that
// minValue == int.MinValue, maxValue == int.MaxValue
double dbl = num * 1.0 / ((long)uint.MaxValue + 1);
long range = (long)maxValue - minValue;
int result = (int)(dbl * range) + minValue;
return result;
}
Totally untested... I can't guarantee that the results are truly pseudo-random... But the idea of creating a double (dbl) number is the same used by the Random class. Only I use the uint.MaxValue as the base instead of int.MaxValue. In this way I don't have to check for negative values of the buffer.
I propose a generator of random integers, based on NextBytes.
This method discards only 9.62% of bits in average over the word size range for positive Int32's due to the usage of Int64 as a representation for bit manupulation.
Maximum bit loss occurs at word size of 22 bits, and it's 20 lost bits of 64 used in byte range conversion. In this case bit efficiency is 68.75%
Also, 25% of values are lost because of clipping the unbound range to maximum value.
Be careful to use Take(N) on the IEnumerable returned, because it's an infinite generator otherwise.
I'm using a buffer of 512 long values, so it generates 4096 random bytes at once. If you just need a sequence of few integers, change the buffer size from 512 to a more optimal value, down to 1.
public static class RandomExtensions
{
public static IEnumerable<int> GetRandomIntegers(this Random r, int max)
{
if (max < 1)
throw new ArgumentOutOfRangeException("max", max, "Must be a positive value.");
const int longWordsTotal = 512;
const int bufferSize = longWordsTotal * 8;
var buffer = new byte[bufferSize];
var wordSize = (int)Math.Log(max, 2) + 1;
while(true)
{
r.NextBytes(buffer);
for (var longWordIndex = 0; longWordIndex < longWordsTotal; longWordIndex++)
{
ulong longWord = BitConverter.ToUInt64(buffer, longWordIndex);
var lastStartBit = 64 - wordSize;
var count = 0;
for (var startBit = 0; startBit <= lastStartBit; startBit += wordSize)
{
count ++;
var mask = ((1UL << wordSize) - 1) << startBit;
var unboundValue = (int)((mask & longWord) >> startBit);
if (unboundValue <= max)
yield return unboundValue;
}
}
}
}
}

BitConverter.GetBytes in place

I need to get values in UInt16 and UInt64 as Byte[]. At the moment I am using BitConverter.GetBytes, but this method gives me a new array instance each time.
I would like to use a method that allow me to "copy" those values to already existing arrays, something like:
.ToBytes(UInt64 value, Byte[] array, Int32 offset);
.ToBytes(UInt16 value, Byte[] array, Int32 offset);
I have been taking a look to the .NET source code with ILSpy, but I not very sure how this code works and how can I safely modify it to fit my requirement:
public unsafe static byte[] GetBytes(long value)
{
byte[] array = new byte[8];
fixed (byte* ptr = array)
{
*(long*)ptr = value;
}
return array;
}
Which would be the right way of accomplishing this?
Updated: I cannot use unsafe code. It should not create new array instances.
You can do like this:
static unsafe void ToBytes(ulong value, byte[] array, int offset)
{
fixed (byte* ptr = &array[offset])
*(ulong*)ptr = value;
}
Usage:
byte[] array = new byte[9];
ToBytes(0x1122334455667788, array, 1);
You can set offset only in bytes.
If you want managed way to do it:
static void ToBytes(ulong value, byte[] array, int offset)
{
byte[] valueBytes = BitConverter.GetBytes(value);
Array.Copy(valueBytes, 0, array, offset, valueBytes.Length);
}
Or you can fill values by yourself:
static void ToBytes(ulong value, byte[] array, int offset)
{
for (int i = 0; i < 8; i++)
{
array[offset + i] = (byte)value;
value >>= 8;
}
}
Now that .NET has added Span<T> support for better working with arrays, unmanaged memory, etc without excess allocations, they've also added System.Buffer.Binary.BinaryPrimitives.
This works as you would want, e.g. WriteUInt64BigEndian has this signature:
public static void WriteUInt64BigEndian (Span<byte> destination, ulong value);
which avoids allocating.
You say you want to avoid creating new arrays and you cannot use unsafe. Either use Ulugbek Umirov's answer with cached arrays (be careful with threading issues) or:
static void ToBytes(ulong value, byte[] array, int offset) {
unchecked {
array[offset + 0] = (byte)(value >> (8*7));
array[offset + 1] = (byte)(value >> (8*6));
array[offset + 2] = (byte)(value >> (8*5));
array[offset + 3] = (byte)(value >> (8*4));
//...
}
}
It seems that you wish to avoid, for some reason, creating any temporary new arrays. And you also want to avoid unsafe code.
You could pin the object and then copy to the array.
public static void ToBytes(ulong value, byte[] array, int offset)
{
GCHandle handle = GCHandle.Alloc(value, GCHandleType.Pinned);
try
{
Marshal.Copy(handle.AddrOfPinnedObject(), array, offset, 8);
}
finally
{
handle.Free();
}
}
BinaryWriter can be a good solution.
var writer = new BinaryWriter(new MemoryStream(yourbuffer, youroffset, yourbuffer.Length-youroffset));
writer.Write(someuint64);
It's useful when you need to convert a lot of data into a buffer continuously
var writer = new BinaryWriter(new MemoryStream(yourbuffer));
foreach(var value in yourints){
writer.Write(value);
}
or when you just want to write to a file, that's the best case to use BinaryWriter.
var writer = new BinaryWriter(yourFileStream);
foreach(var value in yourints){
writer.Write(value);
}
For anyone else that stumbles across this.
If Big Endian support is NOT required and Unsafe is allowed then I've written a library specifically to minimize GC allocations during serialization for single threaded serializers such as when serializing across TCP sockets.
Note: supports Little Endian ONLY example X86 / X64 (might add big endian support some day)
https://github.com/tcwicks/ChillX/tree/master/src/ChillX.Serialization
In the process also re-wrote BitConverter so that it works similar to BitPrimitives but with a lot of extras such as supporting arrays as well.
https://github.com/tcwicks/ChillX/blob/master/src/ChillX.Serialization/BitConverterExtended.cs
Random rnd = new Random();
RentedBuffer<byte> buffer = RentedBuffer<byte>.Shared.Rent(BitConverterExtended.SizeOfUInt64
+ (20 * BitConverterExtended.SizeOfUInt16)
+ (20 * BitConverterExtended.SizeOfTimeSpan)
+ (10 * BitConverterExtended.SizeOfSingle);
UInt64 exampleLong = long.MaxValue;
int startIndex = 0;
startIndex += BitConverterExtended.GetBytes(exampleLong, buffer.BufferSpan, startIndex);
UInt16[] shortArray = new UInt16[20];
for (int I = 0; I < shortArray.Length; I++) { shortArray[I] = (ushort)rnd.Next(0, UInt16.MaxValue); }
//When using reflection / expression trees CLR cannot distinguish between UInt16 and Int16 or Uint64 and Int64 etc...
//Therefore Uint methods are renamed.
startIndex += BitConverterExtended.GetBytesUShortArray(shortArray, buffer.BufferSpan, startIndex);
TimeSpan[] timespanArray = new TimeSpan[20];
for (int I = 0; I < timespanArray.Length; I++) { timespanArray[I] = TimeSpan.FromSeconds(rnd.Next(0, int.MaxValue)); }
startIndex += BitConverterExtended.GetBytes(timespanArray, buffer.BufferSpan, startIndex);
float[] floatArray = new float[10];
for (int I = 0; I < floatArray.Length; I++) { floatArray[I] = MathF.PI * rnd.Next(short.MinValue, short.MaxValue); }
startIndex += BitConverterExtended.GetBytes(floatArray, buffer.BufferSpan, startIndex);
//Do stuff with buffer and then
buffer.Return();
//Or
buffer = null;
//and let RentedBufferContract do this automatically
The underlying problem is actually two fold. Even if you use bitprimitives we still have the issue of the byte array buffers that we are writing to / reading from. The issue is further compounded if we have lots of array fields T[] for example int[]
Using standard BitConverter leads to situations like 80% time spent in GC.
bitprimitives is a much better but we still need to manage the byte[] buffers. If we create one byte[] buffer for each int in an array of 1000 ints that is 1000 byte[4] arrays to garbage collect which means rent a buffer big enough for all 1000 ints (int[4000]) from ArrayPool but then we have to manage returning these buffers.
Further complicating it is the fact that we have to track the length of each buffer that is actually used since ArrayPool will return arrays that are larger than that which is requested. System.Buffers.ArrayPool.Shared.Rent(4000); will usually return a byte[4096] or maybe even a byte[8192]
This serialization code:
uses a wrappers around ArrayPool in order make its usage transparent:
https://github.com/tcwicks/ChillX/blob/master/src/ChillX.Core/Structures/ManagedPool.cs
For managing renting and returning of ArrayPool buffers:
https://github.com/tcwicks/ChillX/blob/master/src/ChillX.Core/Structures/RentedBuffer.cs
With operator overloads on + operator to automatically assign from Span
For preventing memory leaks and also managing use and forget style guaranteed return of rented buffers:
https://github.com/tcwicks/ChillX/blob/master/src/ChillX.Core/Structures/RentedBufferContract.cs
Example:
//Assume ArrayProperty_Long is a long[] array.
RentedBuffer<long> ClonedArray = RentedBuffer<long>.Shared.Rent(ArrayProperty_Long.Length);
ClonedArray += ArrayProperty_Long;
ClonedArray will be returned to ArrayPool.Shared automatically when ClonedArray goes out of scope (handled by RentedBufferContract).
Or call ClonedArray.Return();
p.s. Any feedback is much appreciated.
Following benchmarks compare performance using rented buffers and this serializer implementation versus using MessagePack without rented buffers. Messagepack if micro benchmarked is a faster serializer. This performance difference is purely due to reducing GC collection overheads by pooling / renting buffers.
----------------------------------------------------------------------------------
Benchmarking ChillX Serializer: Test Object: Data class with 31 properties / fields of different types inlcuding multiple arrays of different types
Num Reps: 50000 - Array Size: 256
----------------------------------------------------------------------------------
Entity Size bytes: 20,678 - Count: 00,050,000 - Threads: 01 - Entities Per Second: 00,025,523 - Mbps: 503.32 - Time: 00:00:01.9589880
Entity Size bytes: 20,678 - Count: 00,050,000 - Threads: 01 - Entities Per Second: 00,026,129 - Mbps: 515.26 - Time: 00:00:01.9135886
Entity Size bytes: 20,678 - Count: 00,050,000 - Threads: 01 - Entities Per Second: 00,026,291 - Mbps: 518.46 - Time: 00:00:01.9018027
----------------------------------------------------------------------------------
Benchmarking MessagePack Serializer: Test Object: Data class with 31 properties / fields of different types inlcuding multiple arrays of different types
Num Reps: 50000 - Array Size: 256
----------------------------------------------------------------------------------
Entity Size bytes: 19,811 - Count: 00,050,000 - Threads: 01 - Entities Per Second: 00,012,386 - Mbps: 234.01 - Time: 00:00:04.0668261
Entity Size bytes: 19,811 - Count: 00,050,000 - Threads: 01 - Entities Per Second: 00,012,285 - Mbps: 232.10 - Time: 00:00:04.0523329
Entity Size bytes: 19,811 - Count: 00,050,000 - Threads: 01 - Entities Per Second: 00,012,642 - Mbps: 238.84 - Time: 00:00:03.9811276

Memory alignment of classes in c#?

(btw. This refers to 32 bit OS)
SOME UPDATES:
This is definitely an alignment issue
Sometimes the alignment (for whatever reason?) is so bad that access to the double is more than 50x slower than its fastest access.
Running the code on a 64 bit machine cuts down the issue, but I think it was still alternating between two timing (of which I could get similar results by changing the double to a float on a 32 bit machine)
Running the code under mono exhibits no issue -- Microsoft, any chance you can copy something from those Novell guys???
Is there a way to memory align the allocation of classes in c#?
The following demonstrates (I think!) the badness of not having doubles aligned correctly. It does some simple math on a double stored in a class, timing each run, running 5 timed runs on the variable before allocating a new one and doing it over again.
Basically the results looks like you either have a fast, medium or slow memory position (on my ancient processor, these end up around 40, 80 or 120ms per run)
I have tried playing with StructLayoutAttribute, but have had no joy - maybe something else is going on?
class Sample
{
class Variable { public double Value; }
static void Main()
{
const int COUNT = 10000000;
while (true)
{
var x = new Variable();
for (int inner = 0; inner < 5; ++inner)
{
// move allocation here to allocate more often so more probably to get 50x slowdown problem
var stopwatch = Stopwatch.StartNew();
var total = 0.0;
for (int i = 1; i <= COUNT; ++i)
{
x.Value = i;
total += x.Value;
}
if (Math.Abs(total - 50000005000000.0) > 1)
throw new ApplicationException(total.ToString());
Console.Write("{0}, ", stopwatch.ElapsedMilliseconds);
}
Console.WriteLine();
}
}
}
So I see lots of web pages about alignment of structs for interop, so what about alignment of classes?
(Or are my assumptions wrong, and there is another issue with the above?)
Thanks,
Paul.
Interesting look in the gears that run the machine. I have a bit of a problem explaining why there are multiple distinct values (I got 4) when a double can be aligned only two ways. I think alignment to the CPU cache line plays a role as well, although that only adds up to 3 possible timings.
Well, nothing you can do about it, the CLR only promises alignment for 4 byte values so that atomic updates on 32-bit machines are guaranteed. This is not just an issue with managed code, C/C++ has this problem too. Looks like the chip makers need to solve this one.
If it is critical then you could allocate unmanaged memory with Marshal.AllocCoTaskMem() and use an unsafe pointer that you can align just right. Same kind of thing you'd have to do if you allocate memory for code that uses SIMD instructions, they require a 16 byte alignment. Consider it a desperation-move though.
To prove the concept of misalignment of objects on heap in .NET you can run following code and you'll see that now it always runs fast. Please don't shoot me, it's just a PoC, but if you are really concerned about performance you might consider using it ;)
public static class AlignedNew
{
public static T New<T>() where T : new()
{
LinkedList<T> candidates = new LinkedList<T>();
IntPtr pointer = IntPtr.Zero;
bool continue_ = true;
int size = Marshal.SizeOf(typeof(T)) % 8;
while( continue_ )
{
if (size == 0)
{
object gap = new object();
}
candidates.AddLast(new T());
GCHandle handle = GCHandle.Alloc(candidates.Last.Value, GCHandleType.Pinned);
pointer = handle.AddrOfPinnedObject();
continue_ = (pointer.ToInt64() % 8) != 0 || (pointer.ToInt64() % 64) == 24;
handle.Free();
if (!continue_)
return candidates.Last.Value;
}
return default(T);
}
}
class Program
{
[StructLayoutAttribute(LayoutKind.Sequential)]
public class Variable
{
public double Value;
}
static void Main()
{
const int COUNT = 10000000;
while (true)
{
var x = AlignedNew.New<Variable>();
for (int inner = 0; inner < 5; ++inner)
{
var stopwatch = Stopwatch.StartNew();
var total = 0.0;
for (int i = 1; i <= COUNT; ++i)
{
x.Value = i;
total += x.Value;
}
if (Math.Abs(total - 50000005000000.0) > 1)
throw new ApplicationException(total.ToString());
Console.Write("{0}, ", stopwatch.ElapsedMilliseconds);
}
Console.WriteLine();
}
}
}
Maybe the StructLayoutAttribute is what you are looking for?
Using struct instead of class, makes the time constant. also consider using StructLayoutAttribute. It helps to specify exact memory layout of a structures. For CLASSES I do not think you have any guarantees how they are layouted in memory.
It will be correctly aligned, otherwise you'd get alignment exceptions on x64. I don't know what your snippet shows, but I wouldn't say anything about alignment from it.
You don't have any control over how .NET lays out your class in memory.
As others have said the StructLayoutAttribute can be used to force a specific memory layout for a struct BUT note that the purpose of this is for C/C++ interop, not for trying to fine-tune the performance of your .NET app.
If you're worried about memory alignment issues then C# is probably the wrong choice of language.
EDIT - Broke out WinDbg and looked at the heap running the code above on 32-bit Vista and .NET 2.0.
Note: I don't get the variation in timings shown above.
0:003> !dumpheap -type Sample+Variable
Address MT Size
01dc2fec 003f3c48 16
01dc54a4 003f3c48 16
01dc58b0 003f3c48 16
01dc5cbc 003f3c48 16
01dc60c8 003f3c48 16
01dc64d4 003f3c48 16
01dc68e0 003f3c48 16
01dc6cd8 003f3c48 16
01dc70e4 003f3c48 16
01dc74f0 003f3c48 16
01dc78e4 003f3c48 16
01dc7cf0 003f3c48 16
01dc80fc 003f3c48 16
01dc8508 003f3c48 16
01dc8914 003f3c48 16
01dc8d20 003f3c48 16
01dc912c 003f3c48 16
01dc9538 003f3c48 16
total 18 objects
Statistics:
MT Count TotalSize Class Name
003f3c48 18 288 TestConsoleApplication.Sample+Variable
Total 18 objects
0:003> !do 01dc9538
Name: TestConsoleApplication.Sample+Variable
MethodTable: 003f3c48
EEClass: 003f15d0
Size: 16(0x10) bytes
(D:\testcode\TestConsoleApplication\bin\Debug\TestConsoleApplication.exe)
Fields:
MT Field Offset Type VT Attr Value Name
6f5746e4 4000001 4 System.Double 1 instance 1655149.000000 Value
This seems to me that the classes' allocation addresses appear to be aligned unless I'm reading this wrong?

What is the equivalent of memset in C#?

I need to fill a byte[] with a single non-zero value. How can I do this in C# without looping through each byte in the array?
Update: The comments seem to have split this into two questions -
Is there a Framework method to fill a byte[] that might be akin to memset
What is the most efficient way to do it when we are dealing with a very large array?
I totally agree that using a simple loop works just fine, as Eric and others have pointed out. The point of the question was to see if I could learn something new about C# :) I think Juliet's method for a Parallel operation should be even faster than a simple loop.
Benchmarks:
Thanks to Mikael Svenson: http://techmikael.blogspot.com/2009/12/filling-array-with-default-value.html
It turns out the simple for loop is the way to go unless you want to use unsafe code.
Apologies for not being clearer in my original post. Eric and Mark are both correct in their comments; need to have more focused questions for sure. Thanks for everyone's suggestions and responses.
You could use Enumerable.Repeat:
byte[] a = Enumerable.Repeat((byte)10, 100).ToArray();
The first parameter is the element you want repeated, and the second parameter is the number of times to repeat it.
This is OK for small arrays but you should use the looping method if you are dealing with very large arrays and performance is a concern.
Actually, there is little known IL operation called Initblk (English version) which does exactly that. So, let's use it as a method that doesn't require "unsafe". Here's the helper class:
public static class Util
{
static Util()
{
var dynamicMethod = new DynamicMethod("Memset", MethodAttributes.Public | MethodAttributes.Static, CallingConventions.Standard,
null, new [] { typeof(IntPtr), typeof(byte), typeof(int) }, typeof(Util), true);
var generator = dynamicMethod.GetILGenerator();
generator.Emit(OpCodes.Ldarg_0);
generator.Emit(OpCodes.Ldarg_1);
generator.Emit(OpCodes.Ldarg_2);
generator.Emit(OpCodes.Initblk);
generator.Emit(OpCodes.Ret);
MemsetDelegate = (Action<IntPtr, byte, int>)dynamicMethod.CreateDelegate(typeof(Action<IntPtr, byte, int>));
}
public static void Memset(byte[] array, byte what, int length)
{
var gcHandle = GCHandle.Alloc(array, GCHandleType.Pinned);
MemsetDelegate(gcHandle.AddrOfPinnedObject(), what, length);
gcHandle.Free();
}
public static void ForMemset(byte[] array, byte what, int length)
{
for(var i = 0; i < length; i++)
{
array[i] = what;
}
}
private static Action<IntPtr, byte, int> MemsetDelegate;
}
And what is the performance? Here's my result for Windows/.NET and Linux/Mono (different PCs).
Mono/for: 00:00:01.1356610
Mono/initblk: 00:00:00.2385835
.NET/for: 00:00:01.7463579
.NET/initblk: 00:00:00.5953503
So it's worth considering. Note that the resulting IL will not be verifiable.
Building on Lucero's answer, here is a faster version. It will double the number of bytes copied using Buffer.BlockCopy every iteration. Interestingly enough, it outperforms it by a factor of 10 when using relatively small arrays (1000), but the difference is not that large for larger arrays (1000000), it is always faster though. The good thing about it is that it performs well even down to small arrays. It becomes faster than the naive approach at around length = 100. For a one million element byte array, it was 43 times faster.
(tested on Intel i7, .Net 2.0)
public static void MemSet(byte[] array, byte value) {
if (array == null) {
throw new ArgumentNullException("array");
}
int block = 32, index = 0;
int length = Math.Min(block, array.Length);
//Fill the initial array
while (index < length) {
array[index++] = value;
}
length = array.Length;
while (index < length) {
Buffer.BlockCopy(array, 0, array, index, Math.Min(block, length-index));
index += block;
block *= 2;
}
}
A little bit late, but the following approach might be a good compromise without reverting to unsafe code. Basically it initializes the beginning of the array using a conventional loop and then reverts to Buffer.BlockCopy(), which should be as fast as you can get using a managed call.
public static void MemSet(byte[] array, byte value) {
if (array == null) {
throw new ArgumentNullException("array");
}
const int blockSize = 4096; // bigger may be better to a certain extent
int index = 0;
int length = Math.Min(blockSize, array.Length);
while (index < length) {
array[index++] = value;
}
length = array.Length;
while (index < length) {
Buffer.BlockCopy(array, 0, array, index, Math.Min(blockSize, length-index));
index += blockSize;
}
}
Looks like System.Runtime.CompilerServices.Unsafe.InitBlock now does the same thing as the OpCodes.Initblk instruction that Konrad's answer mentions (he also mentioned a source link).
The code to fill in the array is as follows:
byte[] a = new byte[N];
byte valueToFill = 255;
System.Runtime.CompilerServices.Unsafe.InitBlock(ref a[0], valueToFill, (uint) a.Length);
This simple implementation uses successive doubling, and performs quite well (about 3-4 times faster than the naive version according to my benchmarks):
public static void Memset<T>(T[] array, T elem)
{
int length = array.Length;
if (length == 0) return;
array[0] = elem;
int count;
for (count = 1; count <= length/2; count*=2)
Array.Copy(array, 0, array, count, count);
Array.Copy(array, 0, array, count, length - count);
}
Edit: upon reading the other answers, it seems I'm not the only one with this idea. Still, I'm leaving this here, since it's a bit cleaner and it performs on par with the others.
If performance is critical, you could consider using unsafe code and working directly with a pointer to the array.
Another option could be importing memset from msvcrt.dll and use that. However, the overhead from invoking that might easily be larger than the gain in speed.
Or use P/Invoke way:
[DllImport("msvcrt.dll",
EntryPoint = "memset",
CallingConvention = CallingConvention.Cdecl,
SetLastError = false)]
public static extern IntPtr MemSet(IntPtr dest, int c, int count);
static void Main(string[] args)
{
byte[] arr = new byte[3];
GCHandle gch = GCHandle.Alloc(arr, GCHandleType.Pinned);
MemSet(gch.AddrOfPinnedObject(), 0x7, arr.Length);
}
With the advent of Span<T> (which is dotnet core only, but it is the future of dotnet) you have yet another way of solving this problem:
var array = new byte[100];
var span = new Span<byte>(array);
span.Fill(255);
If performance is absolutely critical, then Enumerable.Repeat(n, m).ToArray() will be too slow for your needs. You might be able to crank out faster performance using PLINQ or Task Parallel Library:
using System.Threading.Tasks;
// ...
byte initialValue = 20;
byte[] data = new byte[size]
Parallel.For(0, size, index => data[index] = initialValue);
All answers are writing single bytes only - what if you want to fill a byte array with words? Or floats? I find use for that now and then. So after having written similar code to 'memset' in a non-generic way a few times and arriving at this page to find good code for single bytes, I went about writing the method below.
I think PInvoke and C++/CLI each have their drawbacks. And why not have the runtime 'PInvoke' for you into mscorxxx? Array.Copy and Buffer.BlockCopy are native code certainly. BlockCopy isn't even 'safe' - you can copy a long halfway over another, or over a DateTime as long as they're in arrays.
At least I wouldn't go file new C++ project for things like this - it's a waste of time almost certainly.
So here's basically an extended version of the solutions presented by Lucero and TowerOfBricks that can be used to memset longs, ints, etc as well as single bytes.
public static class MemsetExtensions
{
static void MemsetPrivate(this byte[] buffer, byte[] value, int offset, int length) {
var shift = 0;
for (; shift < 32; shift++)
if (value.Length == 1 << shift)
break;
if (shift == 32 || value.Length != 1 << shift)
throw new ArgumentException(
"The source array must have a length that is a power of two and be shorter than 4GB.", "value");
int remainder;
int count = Math.DivRem(length, value.Length, out remainder);
var si = 0;
var di = offset;
int cx;
if (count < 1)
cx = remainder;
else
cx = value.Length;
Buffer.BlockCopy(value, si, buffer, di, cx);
if (cx == remainder)
return;
var cachetrash = Math.Max(12, shift); // 1 << 12 == 4096
si = di;
di += cx;
var dx = offset + length;
// doubling up to 1 << cachetrash bytes i.e. 2^12 or value.Length whichever is larger
for (var al = shift; al <= cachetrash && di + (cx = 1 << al) < dx; al++) {
Buffer.BlockCopy(buffer, si, buffer, di, cx);
di += cx;
}
// cx bytes as long as it fits
for (; di + cx <= dx; di += cx)
Buffer.BlockCopy(buffer, si, buffer, di, cx);
// tail part if less than cx bytes
if (di < dx)
Buffer.BlockCopy(buffer, si, buffer, di, dx - di);
}
}
Having this you can simply add short methods to take the value type you need to memset with and call the private method, e.g. just find replace ulong in this method:
public static void Memset(this byte[] buffer, ulong value, int offset, int count) {
var sourceArray = BitConverter.GetBytes(value);
MemsetPrivate(buffer, sourceArray, offset, sizeof(ulong) * count);
}
Or go silly and do it with any type of struct (although the MemsetPrivate above only works for structs that marshal to a size that is a power of two):
public static void Memset<T>(this byte[] buffer, T value, int offset, int count) where T : struct {
var size = Marshal.SizeOf<T>();
var ptr = Marshal.AllocHGlobal(size);
var sourceArray = new byte[size];
try {
Marshal.StructureToPtr<T>(value, ptr, false);
Marshal.Copy(ptr, sourceArray, 0, size);
} finally {
Marshal.FreeHGlobal(ptr);
}
MemsetPrivate(buffer, sourceArray, offset, count * size);
}
I changed the initblk mentioned before to take ulongs to compare performance with my code and that silently fails - the code runs but the resulting buffer contains the least significant byte of the ulong only.
Nevertheless I compared the performance writing as big a buffer with for, initblk and my memset method. The times are in ms total over 100 repetitions writing 8 byte ulongs whatever how many times fit the buffer length. The for version is manually loop-unrolled for the 8 bytes of a single ulong.
Buffer Len #repeat For millisec Initblk millisec Memset millisec
0x00000008 100 For 0,0032 Initblk 0,0107 Memset 0,0052
0x00000010 100 For 0,0037 Initblk 0,0102 Memset 0,0039
0x00000020 100 For 0,0032 Initblk 0,0106 Memset 0,0050
0x00000040 100 For 0,0053 Initblk 0,0121 Memset 0,0106
0x00000080 100 For 0,0097 Initblk 0,0121 Memset 0,0091
0x00000100 100 For 0,0179 Initblk 0,0122 Memset 0,0102
0x00000200 100 For 0,0384 Initblk 0,0123 Memset 0,0126
0x00000400 100 For 0,0789 Initblk 0,0130 Memset 0,0189
0x00000800 100 For 0,1357 Initblk 0,0153 Memset 0,0170
0x00001000 100 For 0,2811 Initblk 0,0167 Memset 0,0221
0x00002000 100 For 0,5519 Initblk 0,0278 Memset 0,0274
0x00004000 100 For 1,1100 Initblk 0,0329 Memset 0,0383
0x00008000 100 For 2,2332 Initblk 0,0827 Memset 0,0864
0x00010000 100 For 4,4407 Initblk 0,1551 Memset 0,1602
0x00020000 100 For 9,1331 Initblk 0,2768 Memset 0,3044
0x00040000 100 For 18,2497 Initblk 0,5500 Memset 0,5901
0x00080000 100 For 35,8650 Initblk 1,1236 Memset 1,5762
0x00100000 100 For 71,6806 Initblk 2,2836 Memset 3,2323
0x00200000 100 For 77,8086 Initblk 2,1991 Memset 3,0144
0x00400000 100 For 131,2923 Initblk 4,7837 Memset 6,8505
0x00800000 100 For 263,2917 Initblk 16,1354 Memset 33,3719
I excluded the first call every time, since both initblk and memset take a hit of I believe it was about .22ms for the first call. Slightly surprising my code is faster for filling short buffers than initblk, seeing it got half a page full of setup code.
If anybody feels like optimizing this, go ahead really. It's possible.
Tested several ways, described in different answers.
See sources of test in c# test class
You could do it when you initialize the array but I don't think that's what you are asking for:
byte[] myBytes = new byte[5] { 1, 1, 1, 1, 1};
.NET Core has a built-in Array.Fill() function, but sadly .NET Framework is missing it. .NET Core has two variations: fill the entire array and fill a portion of the array starting at an index.
Building on the ideas above, here is a more generic Fill function that will fill the entire array of several data types. This is the fastest function when benchmarking against other methods discussed in this post.
This function, along with the version that fills a portion an array are available in an open source and free NuGet package (HPCsharp on nuget.org). Also included is a slightly faster version of Fill using SIMD/SSE instructions that performs only memory writes, whereas BlockCopy-based methods perform memory reads and writes.
public static void FillUsingBlockCopy<T>(this T[] array, T value) where T : struct
{
int numBytesInItem = 0;
if (typeof(T) == typeof(byte) || typeof(T) == typeof(sbyte))
numBytesInItem = 1;
else if (typeof(T) == typeof(ushort) || typeof(T) != typeof(short))
numBytesInItem = 2;
else if (typeof(T) == typeof(uint) || typeof(T) != typeof(int))
numBytesInItem = 4;
else if (typeof(T) == typeof(ulong) || typeof(T) != typeof(long))
numBytesInItem = 8;
else
throw new ArgumentException(string.Format("Type '{0}' is unsupported.", typeof(T).ToString()));
int block = 32, index = 0;
int endIndex = Math.Min(block, array.Length);
while (index < endIndex) // Fill the initial block
array[index++] = value;
endIndex = array.Length;
for (; index < endIndex; index += block, block *= 2)
{
int actualBlockSize = Math.Min(block, endIndex - index);
Buffer.BlockCopy(array, 0, array, index * numBytesInItem, actualBlockSize * numBytesInItem);
}
}
Most of answers is for byte memset but if you want to use it for float or any other struct you should multiply index by size of your data. Because Buffer.BlockCopy will copy based on the bytes.
This code will be work for float values
public static void MemSet(float[] array, float value) {
if (array == null) {
throw new ArgumentNullException("array");
}
int block = 32, index = 0;
int length = Math.Min(block, array.Length);
//Fill the initial array
while (index < length) {
array[index++] = value;
}
length = array.Length;
while (index < length) {
Buffer.BlockCopy(array, 0, array, index * sizeof(float), Math.Min(block, length-index)* sizeof(float));
index += block;
block *= 2;
}
}
The Array object has a method called Clear. I'm willing to bet that the Clear method is faster than any code you can write in C#.
you can try the following codes, it will work.
byte[] test = new byte[65536];
Array.Clear(test,0,test.Length);

Categories