Strange data corruption with Buffer.BlockCopy C#

Strange data corruption with Buffer.BlockCopy C# - c#

BACKGROUND
I am writing a C# program which collects some information by data acquisition. It's quite complex so I won't detail it all here, but the data acquisition is instigated continuously and then, on an asynchronous thread, my program periodically visits the acquisition buffer and takes 100 samples from it. I then look inside the 100 samples for a trigger condition which I am interested in. If I see the trigger condition I collect a bunch of samples from a pre-trigger buffer, a bunch more from a post-trigger buffer, and assemble it all together into one 200-element array.
In my asynchronous thread I assemble my 200-element array (of type double) using the Buffer.BlockCopy method. The only specific reason I chose to use this method is that I need to be careful about how much data processing I do in my asynchronous thread; if I do too much I can end up over-filling the acquisition buffer because I am not visiting it often enough. Since Buffer.BlockCopy is much more efficient at pushing data from a source array into a destination array than a big 'for loop', that's the sole reason I decided to use it.
THE QUESTION
When I call the Buffer.BlockCopy method I do this:
Buffer.BlockCopy(newData, 0, myPulse, numSamplesfromPreTrigBuf, (trigLocation * sizeof(double));
Where;
newData is a double[] array containing new data (100 elements) (with typical data like 0.0034, 6.4342, etc ranging from 0 to 7).
myPulse is the destination array. It is instantiated with 200 elements.
numSamplesfromPreTrigBuf is an offset that I want to apply in this particular instance of the copy
trigLocation is the number of elements I want to copy in this particular instance.
The copy occurs without error, but the data written into myPulse is all screwed up; numbers such as -2.05E-289 and 5.72E+250. Either tiny numbers or massive numbers. These numbers do not occur in my source array.
I have resolved the issue simply by using Array.Copy() instead, with no other source-code modification except for removing the need to calculate the number of elements to copy by multiplying by sizeof(double). But I did spend two hours trying to debug the Buffer.BlockCopy() method with absolutely no idea why the copy is garbage.
Would any body have an idea, from my example usage of Buffer.BlockCopy (which I believe is the correct usage), how garbage data might be copied across?

I assume your offset is wrong - it's also a byte-offset, so you need to multiply it by sizeof(double), just like with the length.
Be careful about using BlockCopy and similar methods - you lose some of the safety of .NET. Unlike outright unsafe methods, it does check array bounds, but you can still produce some pretty weird results (and I assume you could e.g. produce invalid references - a big problem EDIT: fortunately, BlockCopy only works on primitive typed arrays).
Also, BlockCopy isn't thread-safe, so you want to synchronize access to the shared buffer, if you're accessing it from more than one thread at a time.

Indeed, Buffer.BlockCopy allows the source Array and destination Array to have different element types, so long as each element type is primitive. Either way, as you can see from the mscorlib.dll source code for ../vm/comutilnative.cpp, the copy is just a direct imaging operation which never interprets the copied bytes in any way (i.e., as 'logical' or 'numeric' values). It basically calls the C-language classic, memmove. So don't expect this:
var rgb = new byte[] { 1, 2, 3 };
var rgl = new long[3];
Buffer.BlockCopy(rgb, 0, rgl, 0, 3); // likely ERROR: integers never widened or narrowed
// INTENTION?: rgl = { 1, 2, 3 }
// RESULT: rgl = { 0x0000000000030201, 0, 0 }
Now given that Buffer.BlockCopy takes just a single count argument, allowing differently-sized element types introduces the fundamental semantic ambiguity of whether that single count argument would be counting the total bytes in terms or source or destination elements. Solutions to this might include:
Add a second count property so you'd have one each for src and dst; (no...)
Arbitrarily select src vs. dst for expressing count--and document the choice; (no...)
Always express count in bytes, since element size "1" is the (only) common denominator that's suitable to arbitrary different-sized element types. (yes)
Since (1.) is complex (possibly adding even more confusion), and the arbitrary symmetry-breaking of (2.) is poorly self-documenting, the choice taken here was (3.), meaning the count argument must always be specified in bytes.
Because the situation for the srcOffset and dstOffset arguments isn't as critical (on account of there being independent arguments for each 'offset', whereby each c̲o̲u̲l̲d̲ be indexed relative to its respective Array; spoiler alert: ...they aren't), it's less-widely mentioned that these arguments are also always expressed in bytes. From the documentation (emphasis added):
Buffer.BlockCopyhttps://learn.microsoft.com/en-us/dotnet/api/system.buffer.blockcopyParameters
src Array The source buffer.
srcOffset Int32 The zero-based byte offset into src.
dst Array The destination buffer.
dstOffset Int32 The zero-based byte offset into dst.
count Int32 The number of bytes to copy.
The fact that the srcOffset and dstOffset are byte-offsets leads to the strange situations under discussion on this page. For one thing, it entails that the copy of the first and/or last element can be partial:
var rgb = new byte[] { 0xFF, 1, 2, 3, 4, 5, 6, 7, 8, 0xFF };
var rgl = new long[10];
Buffer.BlockCopy(rgb, 1, rgl, 1, 8); // likely ERROR: does not target rgl[1], but
// rather *parts of* both rgl[0] and rgl[1]
// INTENTION? (see above) rgl = { 0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 0L } ✘
// INTENTION? (little-endian) rgl = { 0L, 0x0807060504030201L, 0L, ... } ✘
// INTENTION? (big-endian) rgl = { 0L, 0x0102030405060708L, 0L, ... } ✘
// ACTUAL RESULT: rgl = { 0x0706050403020100L, 8L, 0L, 0L, ... } ?
// ^^-- this byte not copied (see text)
Here, we see that instead of (perhaps) copying something into rgl[1], the element at index 1, and then (maybe) continuing on from there, the copy targeted byte-offset 1 within the first element rgl[0], and led to a partial copy--and surely unintended corruption--of that element. Specifically, byte 0 of rgl[0]--the least-significant-byte of a little-endian long--was not copied.
Continuing with the example, the long value at index 1 is a̲l̲s̲o̲ incompletely written, this time storing value '8' into its least-significant-byte, notably without affecting its other (upper) 7 bytes.
Because I didn't craft my example well enough to explicitly show it, let me be clear about this last point: for these partially-copied long values, the parts that are not copied are not zeroed out as might normally be expected from a proper long store of a byte value. So for the discussion of Buffer.BlockCopy, "partially-copied" means that the un-copied bytes of any multi-byte primitive (e.g. long) value are retained unaltered from before the operation, and thus become "merged" into the new value in some endianness-dependent--and thus likely (and hopefully) unintentional--manner.
To "fix" the example code, the offset supplied for each Array must be pre-multiplied by its respective element size to convert it to a byte offset. This will "correct" the above code to the only sensible operation Buffer.BlockCopy might reasonably perform here, namely a little-endian copy between (one or more) source and (one or more) destination elements, taking care to ensure that no element is partially- or incompletely-copied, respective to its size.
Buffer.BlockCopy(rgb, 1 * sizeof(byte), rgl, 1 * sizeof(long), 8); // CORRECTED (?)
// CORRECT RESULT: rgl = { 0L, 0x0807060504030201L, 0L, ... } ✔
// repaired code shows a proper little-endian store of eight consecutive bytes from a
// byte[] into exactly one complete element of a long[].
In the fixed example, 8 complete byte elements are copied to 1 complete long element. For simplicity, this is a "many-to-1" copy, but you can imagine more elaborate scenarios as well (not shown). In fact, with respect to element count from source-to-destination, a single call to Buffer.BlockCopy can deploy any of five operational patterns: { nop, 1-to-1, 1-to-many, many-to-1, many-to-many }.
The code also illustrates how concerns of endianness are implicated by Buffer.BlockCopy accepting arrays of differently-sized elements. Indeed, the repaired example seems to entail that the code now inherently incorporates a (correctness-)dependency on the endianness of the CPU on which it happens to be running. Combine this with the fact that realistic use-cases seem scarce or obscure, especially remembering the very real and error-prone partial-copying hazard discussed above.
Considering these points would suggest that the technique of mixing source/destination element sizes within a single call to Buffer.BlockCopy, while allowed by the API, should be avoided. In any case, use mixed element sizes with special caution, if at all.

Related

Does primitive array expects integer as index

Should primitive array content be accessed by int for best performance?
Here's an example
int[] arr = new arr[]{1,2,3,4,5};
Array is only 5 elements in length, so the index doesn't have to be int, but short or byte, that would save useless 3 byte memory allocation if byte is used instead of int. Of course, if only i know that array wont overflow size of 255.
byte index = 1;
int value = arr[index];
But does this work as good as it sounds?
Im worried about how this is executed on lower level, does index gets casted to int or other operations which would actually slow down the whole process instead of this optimizing it.

In C and C++, arr[index] is formally equivalent to *(arr + index). Your concerns about casting should be answerable in terms of the simpler question about what the machine will do when it needs to add add an integer offset to a pointer.
I think it's safe to say that on most modern machines when you add a "byte" to a pointer, its going to use the same instruction as it would if you added a 32-bit integer to a pointer. And indeed it's still going to represent that byte using the machine word size, padded with some unused space. So this isn't going to make using the array faster.
Your optimization might make a difference if you need to store millions of these indices in a table, and then using byte instead of int would use 4 times less memory and take less time to move that memory around. If the array you are indexing is huge, and the index needs to be larger than the machine word side, then that's a different consideration. But I think it's safe to say that in most normal situations this optimization doesn't really make sense, and size_t is probably the most appropriate generic type for array indices all things being equal (since it corresponds exactly to the machine word size, on the majority of architectures).

does index gets casted to int or other operations which would actually slow down the whole process instead of this optimizing it
No, but
that would save useless 3 byte memory allocation
You don't gain anything by saving 3 bytes.
Only if you are storing a huge array of those indices then the amount of space you would save might make it a worthwhile investment.
Otherwise stick with a plain int, it's the processor's native word size and thus the fastest.

Processing 1000's of parameters a second in C# by quickly converting data types from byte to other types

I have asked this question over the last 2 years and am still looking for a good way of doing this. What I am doing is as follows:
I have a WPF/C# application which has been developed over the last 3 years. It takes a real time stream of bytes over a UDP port. Each record set is 1000 bytes. I am getting 100 of these byte records per second. I am reading the data and processing it for display in various formats. These logical records are sub-commutated.
The first 300 bytes are the same each logical record contain a mixture of Byte, Int16, UInt16, Int32 and UInt32 values. About 70% of these values are eventually multiplied by an least significant bit to create a Double. These parameters are always the same.
The second 300 bytes are another mixture of Byte, Int16, UIn32, Int32 and UInt32 values. Again about 70% of these values are multiplied by an LSB to create a Double. These parameters are again always the same.
The last segment is 400 bytes and sub-commutated. This means that the last part of the record contains 1 of 20 different logical record formats. I call them Type01...Type20 data. There is an identifier byte which tells me which one it is. These again contain Byte, Int, UInt data values which need to be converted.
I am currently using hundreds of function calls to process this data. Each function call takes the 1000 byte array as a parameter, an offset (index) into the byte array to where the parameter starts. It then uses the BitConverter.ToXXX call to convert the bytes to the correct data type, and then if necessary multiply by an LSB to create the final data value and return it.
I am trying to streamline this processing because the data stream are changing based on the source. For instance one of the new data sources (feeds) changes about 20 parameters in the first 300 bytes, about 24 parameters in the second 300 bytes and several in the last sub-commutated 400 bytes records.
I would like to build a data dictionary where the dictionary contains the logical record number (type of data), offset into the record, LSB of data, type of data to be converted to (Int16, UInt32, etc) and finally output type (Int32, Double, etc). Maybe also include the BitConverter function to use and "cast it dynamically"?
This appears to be a exercise in using Template Classes and possibly Delegates but I do not know how to do this. I would appreciate some code as in example.
The data is also recorded so playback may run at 2x, 4x, 8x, 16x speeds. Now before someone comments on how you can look at thousands of parameters at those speeds, it is not as hard as one may think. Some types of data such as green background for good, red for bad; or plotting map positions (LAT/LON) over time lend themselves very well for fast playback to find interesting events. So it is possible.
Thanks in advance for any help.
I am not sure others have an idea of what I am trying to do so I thought I would post a small segment of source code to see if anyone can improve on it.
Like I said above, the data comes in byte streams. Once it is read in a Byte Array it looks like the following:
Byte[] InputBuffer = { 0x01, 0x00, 0x4F, 0xEB, 0x06, 0x00, 0x17, 0x00,
0x00, 0x00, ... };
The first 2 bytes are an ushort which equals 1. This is the record type for this particular record. This number can range from 1 to 20.
The next 4 bytes are an uint which equals 453,455. This value is the number of tenths of a seconds. Value in this case is 12:35:45.5. To arrive at this I would make the following call to the following subroutine:
labelTimeDisplay.Content = TimeField(InputBuffer, 2, .1).ToString();
public Double TimeField(Byte[] InputBuffer, Int32 Offset, Double lsb)
{
return BitConverter.ToUInt32(InputBuffer, Offset) * lsb;
}
The next data field is the software version, in this case 23
labelSoftwareVersion.Content = SoftwareVersion(InputBuffer, 6).ToString();
public UInt16 SoftwareVersion(Byte[] InputBuffer, Int32 Offset)
{
return BitConverter.ToUInt16(InputBuffer, Offset);
}
The next data field is the System Status Word another UInt16.
Built-In-Test status bits are passed to other routines if any of the 16 bits are set to logic 1.
UInt16 CheckStatus = SystemStatus(InputBuffer, 8);
public UInt16 SystemStatus(Byte[] InputBuffer, Int32 Offset)
{
return BitConverter.ToUInt16(InputBuffer, Offset);
}
I literally have over a thousand of individual subroutines to process the data stored in the array of bytes. The array of bytes are always fixed length of 1000 bytes. The first 6 bytes are always the same, identifier and time. After that the parameters are different for every frame.
I have some major modifications coming the software which will redefine many of the parameters for the next software version. I still have to support the old software versions so the software just gets more complicated. My goal is to find a way to process the data using a dictionary lookup. That way I can just create the dictionary and read the dictionary to know how to process the data. Maybe use loops to load the data into a collection and then bind it to the display fields.
Something like this:
public class ParameterDefinition
{
String ParameterNumber;
String ParameterName;
Int32 Offset;
Double lsb;
Type ReturnDataType;
Type BaseDataType;
}
private ParameterDefinition[] parms = new ParameterDefinition[]
{
new ParameterDefinition ( "0000","RecordID", 0, 0.0, typeof(UInt16), typeof(UInt16)),
new ParameterDefinition ( "0001", "Time", 2, 0.1, typeof(Double), typeof(UInt32)),
new ParameterDefinition ( "0002", "SW ID", 6, 0.0, typeof(UInt16), typeof(UInt16)),
new ParameterDefinition ( "0003", "Status", 8, 0.0, typeof(UInt16), typeof(UInt16)),
// Lots more parameters
}
My bottom line problem is getting the parameter definitions to cast or select the right functions. I cannot find a way to link the "dictionary" to actual data ouputs
Thanks for any help

Using a data dictionary to represent the data structure is fine, as long as you don't walk the dictionary for each individual record. Instead, use Reflection Emit or Expression trees to build a delegate that you can call many many times.

It sounds like you are manually deserializing a byte stream, where the bytes represent various data types. That problem has been solved before.
Try defining a class that represents the first 600 bytes and deserialize that and deserialize it using Protocol Buffer Serializer (that implementation is by SO's own Marc Gravell, and there is a different implementation by top SO contributer Jon Skeet).
Protocol buffers are language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols and data storage. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the "old" format.
Source, as well as a 3rd implementation I have not personally used.
For the last 300 bytes, create appropriate class definitions for the appropriate formats, and again use protocol buffer to deserialize an appropriate class.
For the final touch-ups (e.g. converting values to doubles) you can either post-process the classes, or just have a getter that returns the appropriate final number.

How to build struct with variable 1 to 4 bytes in one value?

What I try to do:
I want to store very much data in RAM. For faster access and less memory footprint I need to use an array of struct values:
MyStruct[] myStructArray = new MyStruct[10000000000];
Now I want to store unsigned integer values with one, two, three or four bytes in MyStruct. But it should only use the less possible memory amount. When I store a value it one byte it should only use one byte and so on.
I could implement this with classes, but that is inappropriate here because the pointer to the object would need 8 bytes on a 64bit system. So it would be better to store just 4 bytes for every array entry. But I want to only store/use one/two/three byte when needed. So I can't use some of the fancy classes.
I also can't use one array with one bytes, one array with two bytes and so on, because I need the special order of the values. And the values are very mixed, so storing an additional reference when to switch to the other array would not help.
Is it possible what want or is the only way to just store an array of 4 byte uints regardless I only need to store one byte, two byte in about 60% of the time and three bytes in about 25% of the time?

This is not possible. How would the CLR process the following expression?
myStructArray[100000]
If the elements are of variable size, the CLR cannot know the address of the 100000th element. Therefore array elements are of fixed size, always.
If you don't require O(1) access, you can implement variable-length elements on top of a byte[] and search the array yourself.
You could split the list into 1000 sublists, which are packed individually. That way you get O(n/2000) search performance on average. Maybe that is good enough in practice.
A "packed" array can only be searched in O(n/2) on average. But if your partial arrays are 1/1000th the size, it becomes O(n/2000). You can pick the partial array in O(1) because they all would be of the same size.
Also, you can adjust the number of partial arrays so that they are individually about 1k elements in size. At that point the overhead of the array object and reference to it vanish. That would give you O(1000/2 + 1) lookup performance which I think is quite an improvement over O(n/2). It is a constant-time lookup (with a big constant).

You could get close to that what you want if you are willing to sacrifice some additional CPU time and waste additional 2 or 4 bits per one stored value.
You could just use byte byte[] and combine it with BitArray collection. In byte[] you would then just sequentially store one, two, three or four bytes and in BitArray denote in binary form (pairs of two bits) or just put a bit to value 1 to denote a new set of bytes have just started (or ended, however you implement it) in your data array.
However you could get something like this in memory:
byte[] --> [byte][byte][byte][byte][byte][byte][byte]...
BitArray --> 1001101...
Which means you have 3 byte, 1 byte, 2 bytes etc. values stored in your byte array.
Or you could alternatively encode your bitarray as binary pairs to make it even smaller. This means you would vaste somewhere between 1.0625 and 1.25 bytes per your actual data byte.
It depends on your actual data (your MyStruct) if this will suffice. If you need to distinguish to which values in your struct those bytes really corresponds, you could waste some additional bits in BitArray.
Update to your O(1) requirement:
Use another index structure which would store one index for each N elements, for example 1000. You could then for example access item with index 234241 as
indexStore[234241/1000]
which gives you index of element 234000, then you just calculate the exact index of element 234241 by examining those few hundred elements in BitArray.
O(const) is acheieved this way, const can be controlled with density of main index, of course you trade time for space.

You can't do it.
If the data isn't sorted, and there is nothing more you can say about it, then you are not going to be able to do what you want.
Simple scenario:
array[3]
Should point to some memory address. But, how would you know what are dimensions of array[0]-array[2]? To store that information in an O(1) fashion, you would only waste MORE memory than you want to save in the first place.
You are thinking out of the box, and that's great. But, my guess is that this is the wrong box that you are trying to get out of. If your data is really random, and you want direct access to every array member, you'll have to use MAXIMUM width that is needed for your number for every number. Sorry.
I had one similar situation, with having numbers of length smaller than 32 bits that I needed to store. But they were all fixed width, so I was able to solve that, with custom container and some bit shifting.
HOPE:
http://www.dcc.uchile.cl/~gnavarro/ps/spire09.3.pdf
Maybe you can read it, and you'll be able not only to have 8, 16, 24, 32 bit per number, but ANY number size...

I'd almost start looking at some variant of short-word encoding like a PkZip program.
Or even RLE encoding.
Or try to understand the usage of your data better. Like, if these are all vectors or something, then there are certain combinations that are disallowed like, -1,-1,-1 is basically meaningless to a financial graphing application, as it denotes data outsides the graphable range. If you can find some oddities about your data, you may be able to reduce the size by having different structures for different needs.

Array of shortened integers

Just to avoid inventing hot-water, I am asking here...
I have an application with lots of arrays, and it is running out of memory.
So the thought is to compress the List<int> to something else, that would have same interface (IList<T> for example), but instead of int I could use shorter integers.
For example, if my value range is 0 - 100.000.000 I need only ln2(1000000) = 20 bits. So instead of storing 32 bits, I can trim the excess and reduce memory requirements by 12/32 = 37.5%.
Do you know of an implementation of such array. c++ and java would be also OK, since I could easily convert them to c#.
Additional requirements (since everyone is starting to getting me OUT of the idea):
integers in the list ARE unique
they have no special property so they aren't compressible in any other way then reducing the bit count
if the value range is one million for example, lists would be from 2 to 1000 elements in size, but there will be plenty of them, so no BitSets
new data container should behave like re-sizable array (regarding method O()-ness)
EDIT:
Please don't tell me NOT to do it. The requirement for this is well thought-over, and it is the ONLY option that is left.
Also, 1M of value range and 20 bit for it is ONLY AN EXAMPLE. I have cases with all different ranges and integer sizes.
Also, I could have even shorter integers, for example 7 bit integers, then packing would be
00000001
11111122
22222333
33334444
444.....
for first 4 elements, packed into 5 bytes.
Almost done coding it - will be posted soon...

Since you can only allocate memory in byte quantums, you are essentially asking if/how you can fit the integers in 3 bytes instead of 4 (but see #3 below). This is not a good idea.
Since there is no 3-byte sized integer type, you would need to use something else (e.g. an opaque 3-byte buffer) in its place. This would require that you wrap all access to the contents of the list in code that performs the conversion so that you can still put "ints" in and pull "ints" out.
Depending on both the architecture and the memory allocator, requesting 3-byte chunks might not affect the memory footprint of your program at all (it might simply litter your heap with unusable 1-byte "holes").
Reimplementing the list from scratch to work with an opaque byte array as its backing store would avoid the two previous issues (and it can also let you squeeze every last bit of memory instead of just whole bytes), but it's a tall order and quite prone to error.
You might want instead to try something like:
Not keeping all this data in memory at the same time. At 4 bytes per int, you 'd need to reach hundreds of millions of integers before memory runs out. Why do you need all of them at the same time?
Compressing the dataset by not storing duplicates if possible. There are bound to be a few of them if you are up to hundreds of millions.
Changing your data structure so that it stores differences between successive values (deltas), if that is possible. This might be not very hard to achieve, but you can only realistically expect something at the ballpark of 50% improvement (which may not be enough) and it will totally destroy your ability to index into the list in constant time.

One option that will get your from 32 bits to 24bits is to create a custom struct that stores an integer inside of 3 bytes:
public struct Entry {
byte b1; // low
byte b2; // middle
byte b3; // high
public void Set(int x) {
b1 = (byte)x;
b2 = (byte)(x >> 8);
b3 = (byte)(x >> 16);
}
public int Get() {
return (b3 << 16) | (b2 << 8) | b1;
}
}
You can then just create a List<Entry>.
var list = new List<Entry>();
var e = new Entry();
e.Set(12312);
list.Add(e);
Console.WriteLine(list[0].Get()); // outputs 12312

This reminds me of base64 and similar kinds of binary-to-text encoding.
They take 8 bit bytes and do a bunch of bit-fiddling to pack them into 4-, 5-, or 6-bit printable characters.
This also reminds me of the Zork Standard Code for Information Interchange (ZSCII), which packs 3 letters into 2 bytes, where each letter occupies 5 bits.
It sounds like you want to taking a bunch of 10- or 20-bit integers and pack them into a buffer of 8-bit bytes.
The source code is available for many libraries that handle a packed array of single bits
(a
b
c
d
e).
Perhaps you could
(a) download that source code and modify the source (starting from some BitArray or other packed encoding), recompiling to create a new library that handles packing and unpacking 10- or 20-bit integers rather than single bits.
It may take less programming and testing time to
(b) write a library that, from the outside, appears to act just like (a), but internally it breaks up 20-bit integers into 20 separate bits, then stores them using an (unmodified) BitArray class.

Edit: Given that your integers are unique you could do the following: store unique integers until the number of integers you're storing is half the maximum number. Then switch to storing the integers you don't have. This will reduce the storage space by 50%.
Might be worth exploring other simplification techniques before trying to use 20-bit ints.
How do you treat duplicate integers? If have lots of duplicates you could reduce the storage size by storing the integers in a Dictionary<int, int> where keys are unique integers and values are corresponding counts. Note this assumes you don't care about the order of your integers.
Are your integers all unique? Perhaps you're storing lots of unique integers in the range 0 to 100 mil. In this case you could try storing the integers you don't have. Then when determining if you have an integer i just ask if it's not in your collection.

How can I determine the sizeof for an unmanaged array in C#?

I'm trying to optimize some code where I have a large number of arrays containing structs of different size, but based on the same interface. In certain cases the structs are larger and hold more data, othertimes they are small structs, and othertimes I would prefer to keep null as a value to save memory.
My first question is. Is it a good idea to do something like this? I've previously had an array of my full data struct, but when testing mixing it up I would virtually be able to save lots of memory. Are there any other downsides?
I've been trying out different things, and it seams to work quite well when making an array of a common interface, but I'm not sure I'm checking the size of the array correctly.
To simplified the example quite a bit. But here I'm adding different structs to an array. But I'm unable to determine the size using the traditional Marshal.SizeOf method. Would it be correct to simply iterate through the collection and count the sizeof for each value in the collection?
IComparable[] myCollection = new IComparable[1000];
myCollection[0] = null;
myCollection[1] = (int)1;
myCollection[2] = "helloo world";
myCollection[3] = long.MaxValue;
System.Runtime.InteropServices.Marshal.SizeOf(myCollection);
The last line will throw this exception:
Type 'System.IComparable[]' cannot be marshaled as an unmanaged structure; no meaningful size or offset can be computed.
Excuse the long post:
Is this an optimal and usable solution?
How can I determine the size
of my array?

I may be wrong but it looks to me like your IComparable[] array is a managed array? If so then you can use this code to get the length
int arrayLength = myCollection.Length;
If you are doing platform interop between C# and C++ then the answer to your question headline "Can I find the length of an unmanaged array" is no, its not possible. Function signatures with arrays in C++/C tend to follow the following pattern
void doSomeWorkOnArrayUnmanaged(int * myUnmanagedArray, int length)
{
// Do work ...
}
In .NET the array itself is a type which has some basic information, such as its size, its runtime type etc... Therefore we can use this
void DoSomeWorkOnManagedArray(int [] myManagedArray)
{
int length = myManagedArray.Length;
// Do work ...
}
Whenever using platform invoke to interop between C# and C++ you will need to pass the length of the array to the receiving function, as well as pin the array (but that's a different topic).
Does this answer your question? If not, then please can you clarify

Optimality always depends on your requirements. If you really need to store many elements of different classes/structs, your solution is completely viable.
However, I guess your expectations on the data structure might be misleading: Array elements are per definition all of the same size. This is even true in your case: Your array doesn't store the elements themselves but references (pointers) to them. The elements are allocated somewhere on the VM heap. So your data structure actually goes like this: It is an array of 1000 pointers, each pointer pointing to some data. The size of each particular element may of course vary.
This leads to the next question: The size of your array. What are you intending to do with the size? Do you need to know how many bytes to allocate when you serialize your data to some persistent storage? This depends on the serialization format... Or do you need just a rough estimate on how much memory your structure is consuming? In the latter case you need to consider the array itself and the size of each particular element. The array which you gave in your example consumes approximately 1000 times the size of a reference (should be 4 bytes on a 32 bit machine and 8 bytes on a 64 bit machine). To compute the sizes of each element, you can indeed iterate over the array and sum up the size of the particular elements. Please be aware that this is only an estimate: The virtual machine adds some memory management overhead which is hard to determine exactly...

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.