Best Way to Represent Large Byte Array

Best Way to Represent Large Byte Array - c#

I'm looking for the most efficient way to store and manage a large byte array in memory. I will have the need to both insert and delete bytes from any position within the array.
At first, I was thinking that a regular array was best.
byte[] buffer = new byte[ArraySize];
This would allow me to access any byte within the array. I can also resize the array. However, there doesn't appear to be any built-in support for shifting or moving items within the array.
One option is to have a loop to move items one by one but that sounds horribly inefficient in C#. Another option is to create a new array and copy bytes over to the correct position, but that requires copying all data in the array.
Is there no better option?

Actually, I just found the Buffer Class, which appears ideal for what I need.
It looks like the BlockCopy method will block copy a bunch of items and supports copying within the same array, and even correctly handles overlapping items.

I think the best option in this case is a hybrid between a regular array and a list. This would only be necessary with megabyte sized arrays though.
So you could do something like this:
List<byte[]> buffer;
And have each element of the list just a chunk of the data(say 64K or something small and manageable)
It'd require quite a bit of custom code, but would definitely be the fastest option when having to shift data around in a large array.
Also, if you're doing a lot more shifting of bytes than anything else, LinkedList<T> may work better (but it's famously bad for everything but a specific set of cases)
To clarify why this is more correct than an array, consider inserting 1 byte to the beginning of an array. You must allocate another array (double memory consumption) and then copy every byte to the new array after inserting the new byte, and then free the old array (possible heap corruption depending on size)
Consider now this method with lists.
If you have to insert a lot of bytes, you'll probably want to insert at the beginning of the buffer list. This is an O(n) operation, so your ending efficiency for this operation is O(n/CHUNK_SIZE)
Or, if you just need to insert a single byte, you can just get the first element of the list and copy the array as normal. Then, the speed is O(CHUNK_SIZE), which isn't horrible, especially if n in comparison is very large (megabytes of data)

Related

Is C# List<char[]> allocated in contiguous memory?

If I declare a List of char arrays, are they allocated in contiguous memory, or does .NET create a linked list instead?
If it's not contiguous, is there a way I can declare a contiguous list of char arrays? The size of the char arrays is know ahead of time and is fixed (they are all the same size).

Yes, but not in the way that you want. List<T> guarantees that its elements are stored contiguously.
Arrays are a reference type, so the references are stored cotiguously as List<T> guarantees. However, the arrays themselves are allocated separately and where they are stored has nothing to do with the list. It is only concerned with its elements, the references.
If you require that then you should simply use one large array and maintain boundary data.
EDIT: Per your comment:
The inner arrays are always 9 chars.
So, in this case, cache coherency may be an issue because the sub-arrays are so small. You'll be jumping around a lot in memory getting from one array to the next, and I'll just take you on your word about the performance sensitivity of this code.
Just use a multi-dimensional if you can. This of course assumes you know the size or that you can impose a maximum size on it.
Is it possible to trade some memory to reduce complexity/time and just set a max size for N? Using a multi-dimensional array (but don't use the latter) is the only way you can guarantee contiguous allocation.
EDIT 2:
Trying to keep the answer in sync with the comments. You say that the max size of the first dimension is 9! and, as before, the size of the second dimension is 9.
Allocate it all up front. You're trading some memory for time. 9! * 9 * 2 / 1024 / 1024 == ~6.22MB.
As you say, the List may grow to that size anyway, so worst case you waste a few MB of memory. I don't think it's going to be an issue unless you plan on running this code in a toaster oven. Just allocate the buffer as one array up front and you're good.

List functions as a dynamic array, not a linked list, but this is beside the point. No memory will be allocated for the char[]s until they themselves are instantiated. The List is merely responsible for holding references to char[]s, of which it will contain none when first created.
If it's not contiguous, is there a way I can declare a contiguous list of char arrays? The size of the char arrays is know ahead of time and is fixed (they are all the same size).
No, but you could instantiate a 2-dimensional array of chars, if you also know how many char arrays there would have been:
char[,] array = new char[x, y];

Resizing big arrays

I have an array which size is like 2 GB (filled with audio samples). Now I want to apply a filter for that array. This filter is generating like 50% more samples than input source. So now I need to create new array which size is 3 GB. Now I gave 5 GB of memory used. But if this filter can operate only at that source array and only need some more space in this array.
Question: can I allocate a memory in C# that can be resized w/o creating a second memory block, then removing that first one?
I just thought, If memory in PC's is divided into 4 kB pages (or more), so why C# cannot (?) use that good feature?

If your filter can work in-place just allocate 50% more space at the beginning. All you need to know is the actual length of the original sample.
If that code doesn't work always and you don't want to consume more memory beforehand, you can allocate half of the original array (the extension array) and check which part your access relates to:
byte[] myOriginalArray = new byte[2GB]; // previously allocated
byte[] myExtensionArray = new byte[1GB]; // 50% of the original
for(... my processing code of the array ...)
{
byte value = read(index);
... process the index and the value here
store(index, value);
}
byte read(int index)
{
if(index < 2GB) return myOriginalArray[index];
return myExtensionArray[index - 2GB];
}
void store(int index, byte value)
{
if(index < 2GB) myOriginalArray[index] = value;
myExtensionArray[index - 2GB] = value;
}
You add index check and subtraction overhead for each access to the array. That could also be made smarter for certain cases. For instance for the portion you do not need to access extension you can use your faster loop and for the part where you need to write to extension part you can use the slower version (two consecutive loops).

Question: can I allocate a memory in C# that can be resized w/o creating a second memory block, then removing that first one?
No, you cannot resize an array in .NET. If you want to increase the size of an array you will have to create a new and bigger array and copy all the data from the existing array to the new array.
To get around this problem you could provide your own "array" implementation based on allocating smaller chunks of memory but presenting it as one big buffer of data. An example of this is StringBuilder that is based on an implementation of chunks of characters, each chunk being a separate Char[] array.
Another option is to use P/Invoke to get access to low level memory management functions like VirtualAlloc that allows you to reserve pages of memory in advance. You need to do this in a 64 bit process because the virtual address space of a 32 bit process is only 4 GB. You probably also need to work with unsafe code and pointers.

How to build struct with variable 1 to 4 bytes in one value?

What I try to do:
I want to store very much data in RAM. For faster access and less memory footprint I need to use an array of struct values:
MyStruct[] myStructArray = new MyStruct[10000000000];
Now I want to store unsigned integer values with one, two, three or four bytes in MyStruct. But it should only use the less possible memory amount. When I store a value it one byte it should only use one byte and so on.
I could implement this with classes, but that is inappropriate here because the pointer to the object would need 8 bytes on a 64bit system. So it would be better to store just 4 bytes for every array entry. But I want to only store/use one/two/three byte when needed. So I can't use some of the fancy classes.
I also can't use one array with one bytes, one array with two bytes and so on, because I need the special order of the values. And the values are very mixed, so storing an additional reference when to switch to the other array would not help.
Is it possible what want or is the only way to just store an array of 4 byte uints regardless I only need to store one byte, two byte in about 60% of the time and three bytes in about 25% of the time?

This is not possible. How would the CLR process the following expression?
myStructArray[100000]
If the elements are of variable size, the CLR cannot know the address of the 100000th element. Therefore array elements are of fixed size, always.
If you don't require O(1) access, you can implement variable-length elements on top of a byte[] and search the array yourself.
You could split the list into 1000 sublists, which are packed individually. That way you get O(n/2000) search performance on average. Maybe that is good enough in practice.
A "packed" array can only be searched in O(n/2) on average. But if your partial arrays are 1/1000th the size, it becomes O(n/2000). You can pick the partial array in O(1) because they all would be of the same size.
Also, you can adjust the number of partial arrays so that they are individually about 1k elements in size. At that point the overhead of the array object and reference to it vanish. That would give you O(1000/2 + 1) lookup performance which I think is quite an improvement over O(n/2). It is a constant-time lookup (with a big constant).

You could get close to that what you want if you are willing to sacrifice some additional CPU time and waste additional 2 or 4 bits per one stored value.
You could just use byte byte[] and combine it with BitArray collection. In byte[] you would then just sequentially store one, two, three or four bytes and in BitArray denote in binary form (pairs of two bits) or just put a bit to value 1 to denote a new set of bytes have just started (or ended, however you implement it) in your data array.
However you could get something like this in memory:
byte[] --> [byte][byte][byte][byte][byte][byte][byte]...
BitArray --> 1001101...
Which means you have 3 byte, 1 byte, 2 bytes etc. values stored in your byte array.
Or you could alternatively encode your bitarray as binary pairs to make it even smaller. This means you would vaste somewhere between 1.0625 and 1.25 bytes per your actual data byte.
It depends on your actual data (your MyStruct) if this will suffice. If you need to distinguish to which values in your struct those bytes really corresponds, you could waste some additional bits in BitArray.
Update to your O(1) requirement:
Use another index structure which would store one index for each N elements, for example 1000. You could then for example access item with index 234241 as
indexStore[234241/1000]
which gives you index of element 234000, then you just calculate the exact index of element 234241 by examining those few hundred elements in BitArray.
O(const) is acheieved this way, const can be controlled with density of main index, of course you trade time for space.

You can't do it.
If the data isn't sorted, and there is nothing more you can say about it, then you are not going to be able to do what you want.
Simple scenario:
array[3]
Should point to some memory address. But, how would you know what are dimensions of array[0]-array[2]? To store that information in an O(1) fashion, you would only waste MORE memory than you want to save in the first place.
You are thinking out of the box, and that's great. But, my guess is that this is the wrong box that you are trying to get out of. If your data is really random, and you want direct access to every array member, you'll have to use MAXIMUM width that is needed for your number for every number. Sorry.
I had one similar situation, with having numbers of length smaller than 32 bits that I needed to store. But they were all fixed width, so I was able to solve that, with custom container and some bit shifting.
HOPE:
http://www.dcc.uchile.cl/~gnavarro/ps/spire09.3.pdf
Maybe you can read it, and you'll be able not only to have 8, 16, 24, 32 bit per number, but ANY number size...

I'd almost start looking at some variant of short-word encoding like a PkZip program.
Or even RLE encoding.
Or try to understand the usage of your data better. Like, if these are all vectors or something, then there are certain combinations that are disallowed like, -1,-1,-1 is basically meaningless to a financial graphing application, as it denotes data outsides the graphable range. If you can find some oddities about your data, you may be able to reduce the size by having different structures for different needs.

How can I determine the sizeof for an unmanaged array in C#?

I'm trying to optimize some code where I have a large number of arrays containing structs of different size, but based on the same interface. In certain cases the structs are larger and hold more data, othertimes they are small structs, and othertimes I would prefer to keep null as a value to save memory.
My first question is. Is it a good idea to do something like this? I've previously had an array of my full data struct, but when testing mixing it up I would virtually be able to save lots of memory. Are there any other downsides?
I've been trying out different things, and it seams to work quite well when making an array of a common interface, but I'm not sure I'm checking the size of the array correctly.
To simplified the example quite a bit. But here I'm adding different structs to an array. But I'm unable to determine the size using the traditional Marshal.SizeOf method. Would it be correct to simply iterate through the collection and count the sizeof for each value in the collection?
IComparable[] myCollection = new IComparable[1000];
myCollection[0] = null;
myCollection[1] = (int)1;
myCollection[2] = "helloo world";
myCollection[3] = long.MaxValue;
System.Runtime.InteropServices.Marshal.SizeOf(myCollection);
The last line will throw this exception:
Type 'System.IComparable[]' cannot be marshaled as an unmanaged structure; no meaningful size or offset can be computed.
Excuse the long post:
Is this an optimal and usable solution?
How can I determine the size
of my array?

I may be wrong but it looks to me like your IComparable[] array is a managed array? If so then you can use this code to get the length
int arrayLength = myCollection.Length;
If you are doing platform interop between C# and C++ then the answer to your question headline "Can I find the length of an unmanaged array" is no, its not possible. Function signatures with arrays in C++/C tend to follow the following pattern
void doSomeWorkOnArrayUnmanaged(int * myUnmanagedArray, int length)
{
// Do work ...
}
In .NET the array itself is a type which has some basic information, such as its size, its runtime type etc... Therefore we can use this
void DoSomeWorkOnManagedArray(int [] myManagedArray)
{
int length = myManagedArray.Length;
// Do work ...
}
Whenever using platform invoke to interop between C# and C++ you will need to pass the length of the array to the receiving function, as well as pin the array (but that's a different topic).
Does this answer your question? If not, then please can you clarify

Optimality always depends on your requirements. If you really need to store many elements of different classes/structs, your solution is completely viable.
However, I guess your expectations on the data structure might be misleading: Array elements are per definition all of the same size. This is even true in your case: Your array doesn't store the elements themselves but references (pointers) to them. The elements are allocated somewhere on the VM heap. So your data structure actually goes like this: It is an array of 1000 pointers, each pointer pointing to some data. The size of each particular element may of course vary.
This leads to the next question: The size of your array. What are you intending to do with the size? Do you need to know how many bytes to allocate when you serialize your data to some persistent storage? This depends on the serialization format... Or do you need just a rough estimate on how much memory your structure is consuming? In the latter case you need to consider the array itself and the size of each particular element. The array which you gave in your example consumes approximately 1000 times the size of a reference (should be 4 bytes on a 32 bit machine and 8 bytes on a 64 bit machine). To compute the sizes of each element, you can indeed iterate over the array and sum up the size of the particular elements. Please be aware that this is only an estimate: The virtual machine adds some memory management overhead which is hard to determine exactly...

C# Increasing an array by one element at the end

In my program I have a bunch of growing arrays where a new element is grown one by one to the end of the array. I identified Lists to be a speed bottleneck in a critical part of my program due to their slow access time in comparison with an array - switching to an array increased performance tremendously to an acceptable level. So to grow the array i'm using Array.Resize. This works well as my implementation restricts the array size to approximately 20 elements, so the O(N) performance of Array.Resize is bounded.
But it would be better if there was a way to just increase an array by one element at the end without having to use Array.Resize; which I believe does a copy of the old array to the newly sized array.
So my question is, is there a more efficiant method for adding one element to the end of an array without using List or Array.Resize?

A List has constant time access just like an array. For 'growing arrays' you really should be using List.
When you know that you may be adding elements to an array backed structure, you don't want to add one new size at a time. Usually it is best to grow an array by doubling it's size when it fills up.

As has been previously mentioned, List<T> is what you are looking for. If you know the initial size of the list, you can supply an initial capacity to the constructor, which will increase your performance for your initial allocations:
List<int> values = new List<int>(5);
values.Add(1);
values.Add(2);
values.Add(3);
values.Add(4);
values.Add(5);

List's allocate 4 elements to begin with (unless you specify a capacity when you construct it) and then grow every 4 elements.
Why don't you try a similar thing with Array? I.e. create it as having 4 elements, then when you insert the fifth element, first grow the array by another 4 elements.

There is no way to resize an array, so the only way to get a larger array is to use Array.Resize to create a new array.
Why not just create the arrays to have 20 elements from start (or whatever capacity you need at most), and use a variable to keep track of how many elements are used in the array? That way you never have to resize any arrays.

Growing an array AFAIK means that a new array is allocated, the existing content being copied to the new instance. I doubt that this should be faster than using List...?

it's much faster to resize an array in chunks (like 10) and store this as a seperate variable e.g capacity and then only resize the array when the capacity is reached. This is how a list works but if you prefer to use arrays then you should look into resizing them in larger chunks especially if you have a large number of Array.Resize calls

I think that every method, that wants to use array, will not be ever optimized because an array is a static structure so I think it's better to use dynamic structures like List or others.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.