Memory-efficient IList<T> implementation

Memory-efficient IList<T> implementation - c#

I need a collection type for received bytes in my socket application (which deals with ~5k of concurrent connections).
I tried using a List<byte> but since it has one internal array and I receive lots of data, it can cause OutOfMemoryExceptions.
So I need a collection that,
Keeps the data in smaller blocks; like an Unrolled Linked List.
Provides fast lookup (Preferably an IList<T>) because I look for a delimiter that marks the end of the message after each receive operation.
What I use right now is Stream. I supply a MemoryStream for the operations that don't involve too much data and supply a FileStream of a temporary file for the operations that involve serious amounts of data.
MemoryStream is no different than a List<T>, though and I prefer not to use files as buffers.
So...
What collection or approach do you recommend?

It appears that you are using inappropriate architecture for a network application. You should buffer only those data which is required. Here you are using a list to buffer the data until the required amount of data is received.
I would recommend that you should check for delimiter on each receipt of data in the data itself and if it is there, you should push in only the data till you encounter the delimiter. Once the data is ready, you should fetch it out from list and use it and dispose off the list. Adding up everything to the list is not a good approach and will surely consume a lot of memory.
Ideally, you should have a protocol which always inform you before you actually receive the data about the length of data you are going to receive. This way, you can be sure that required data has been received and you should not rely on the delimiter.

A possible quick and dirty solution:
At the start of the program, allocate a buffer large enough for the largest amount of data you will receive. Use a separate 'count' field to keep track of how much data is currently in use.
(I don't really like this solution; I'd use files or find some way of working with the data in blocks, but it might work for you).

Related

C# list vs byte array efficiency over network

Okay so I'm trying to make a multiplayer game using a self built netcode in unity3d using c#,
The thing is since I'm using raw tcp I need to convert everything to a byte[] but I got sick of using Array.Copy. Since I'm reserving a few bytes of every message sent over a network as sort of a message identifier that I can use to interpret the data I receive.
So my question is, for the purpose of making this code more friendly to myself, is it a terrible idea to use a list of bytes instead of a byte array and once I've prepared the message to be sent I can just call .ToArray on that list?
Would this be terrible for performance?

As a general rule when dealing with sockets: you should usually be working with oversized arrays (ArrayPool<byte>.Shared can be useful here), and then use the Send overloads that accepts either the byte[], int, int (offset+count), or ArraySegment<byte> - so you aren't constantly re-copying things, and can re-use buffers. However, frankly: you may also way to look into "pipelines"; this is the new IO API that Kestrel uses that deals with all the buffer management for you, so you don't have to. In the core framework, pipelines is currently mostly server-focused, but this is being improved hopefully in .NET 5 as part of "Bedrock" - however, client-side pipelines wrappers are available in Pipelines.Sockets.Unofficial (on NuGet).
To be explicit: no, constantly calling ToArray() on a List<byte> is not a good way of making buffers more convenient; that will probably work, but could contribute to GC stalls, plus it is unnecessary overhead in your socket code.

Semantic difference of Read and Load

I wonder what the semantic difference is between Read and Load (in C#). I don't see a difference when comparing e.g.
System.IO.MemoryStream.Read()
System.Console.Read()
System.IO.StreamReader.Read()
System.IO.File.ReadAllText()
vs
System.Xml.XmlDocument.Load()
System.Xml.Linq.XDocument.Load()
System.Reflection.Assembly.Load()
Since I want a consistent naming over my program that deals a lot with simply getting files from persistent storage and higher level functions that also initialize, cross reference and errorcheck I kindly ask for your input.

In your examples, "Read" generally refers to reading a portion of the data. Whether this is for the purpose of limiting the amount of data that needs to be stored and/or handled in a given operation, or because the data itself is not immediately available in its entirety (e.g. Console.Read() or reading from a network stream), the fundamental behavior is the same: data is processed in pieces smaller than the entire set of data that can or will be processed.
There is the exception ReadAllText(), which does in fact read all of the data at once. But that's in a type where all the other methods that perform similarly also use the word "Read". Using "Read" in that context keeps the API consistent, and failing to use "Load" doesn't significantly hinder comprehension of the API (especially since the method name also explicitly states "All Text"…no one should be surprised to see all of the text read in that case, right? :) ).
In your examples that use "Load", they consume all of the data at once, and turn it into something else, e.g. an XML DOM or an assembly. This is a distinctly different kind of operation from just reading data and at most doing minimal processing on it (e.g. decoding some text format). In contrast to "Read" operations, "Load" will always consume all of the data, rather than allowing the option of reading just a portion at a time.

Read APIs are about:
Reading data in smaller units like bytes, characters
Have a pointer, which is mostly forward only type like DataReader
Connected and reading from the source
Works fine for all kinds of data, but may be a costly option if a live connection is maintained over a period of time
Needs consistent connectivity, throughout the process
ADO.Net connected architecture is the example
Load APIs on the other end:
Load all data in memory in one go
Opens connection, Read everything and close it, doesn't maintain a live connection
Can work with data to apply logic, move forward / backward in the memory
Works fine for smaller sets, may have trouble for larger sets of data due to memory and network requirements
Once loaded and can be worked upon at convenience over a period of time as its a disconnected data
ADO.Net disconnected architecture, Dataset, DataTable and IEnumerable are valid examples

What is the best way to implement a read/write byte buffer for a remote file system wrapper

I am playing around with the idea of using Dokan, the user-mode file system helper library, to create a file system around the new OneDrive API that has recently come out. I realise that there are existing tools to do this but I want to try and make an optimised version around read-only media streaming from the service for use with Plex.
I want to try and optimise the experience by having some kind of buffer, somewhere between 20-50 megabytes, that is used to make the whole experience smoother if the internet connection being used to access OneDrive API is variable and I also want to download chunks of data from OneDrive in parallel to maximise the bandwidth of the connection.
All this means that when Dokan is requesting bytes from the file residing on OneDrive that it will be essentially reading from this buffer whilst I simultaneously write into it, potentially with gaps in the data from the parallel segments not having completing in order and I am unsure what is the most efficient way of doing this!
I figured the most simple way would be to allocate a large byte array, start to fill it up and then as data is read from it by Dokan, do a Buffer.BlockCopy to a new array, essentially discarding the read data and leaving room at the end to then allow more data to be downloaded and filled in. This process would then try and maintain the buffer at its maximum size as the whole file is streamed through, appearing to the consumer that it is a local file.
Would continuously BlockCopying a large byte array like this be a terrible way to do it and is there an established pattern for achieving something like this more easily in .NET? I have been researching for a few hours now but can't find any examples that are trying to do the same thing I am even though I would have thought it would be quite common!
Any suggestions or examples you can think of would be very gratefully received, thank you!

What's the most efficient way to manage large amounts of data (height data) and replace this huge array?

I need to be able to look up this data quickly and need access to all of this data. Unfortunately, I also need to conserve memory (several of this will cause OutofMemoryExceptions)
short[,,] data = new short[8000,8000,2];
I have attempted the following:
tried jagged array - same memory problems
tried breaking into smaller arrays - still get memory issues
only resolution is to map this data efficiently using a memory mapped file or is there some other way to do this?

How about a database? After all, they are made for this.
I'd suggest you take a look at some NoSQL database. Depending on your needs, there are also in-memory databases [which obviously could suffer from the same out-of-memory problem] and databases that can be copy deployed or linked to your application.
I wouldn't want to mess with the storage details manually, and memory-mapping files is what some databases (at least MongoDB) are doing internally. So essentially, you'd be rolling your own DB, and writing a database is not trivial -- even if you narrow down the use case.
Redis or Membase sound like suitable alternatives for your problem. As far as I can see, both are able to manage the RAM utilization for you, that is, read data from the disk as needed and cache data in RAM for fast access. Of course, your access patterns will play a role here.
Keep in mind that a lot of effort went into building these DBs. According to Wikipedia, Zynga is using Membase and Redis is sponsored by VMWare.

Are you sure you need access to all of it all of the time? ...or could you load a portion of it, do your processing then move onto the next?
Could you get away with using mip-mapping or LoD representations if it's just height data? Both of those could allow you to hold lower resolutions until you need to load up specific chunks of the higher resolution data.
How much free memory do you have on your machine? What operating system are you using? Is it 64 bit?
If you're doing memory / processing intensive operations, have you considered implementing those parts in C++ where you have greater control over such things?
It's difficult to help you much further without knowing some more specifics of your system and what your actually doing with your data... ?

I wouldn't recommend a traditional relational database if you're doing numeric calculations with this data. I suspect what you're running into here isn't the size of the data itself, but rather a known problem with .NET called Large Object Heap Fragmentation. If you're running into a problem after allocating these buffers frequently (even though they should be being garbage collected), this is likely your culprit. Your best solution is to keep as many buffers as you need pre-allocated and re-use them, to prevent the reallocation and subsequent fragmentation.

How are you interacting with this large multi dimensional array? Are you using Recursion? If so, make sure your recursive methods are passing parameters by reference, rather than by value.
On a side note, do you need 100% of this data accessible at the same time? The best way to deal with large volumes of data is usually via a stream, or some kind of reader object. Try to deal with the data in segments. I've got a few processes that deal with Gigs worth of data, and it can process it in a minor amount of memory due to how I'm streaming it in via a SqlDataReader.
TL;DR: look at how you pass data between your function calls O(ref) and maybe use streaming patterns to deal with the data in smaller chunks.
hope that helps!

.NET stores shorts as 32-bit values even though they only contain 16 bits. So you could save a factor two by using an array of ints and decoding the int to two shorts yourself using bit operations.
Then you pretty much have the most efficient way of storing such an array. What you can do then is:
Use a 64-bit machine. Then you can allocate a lot of memory and the OS will take care of paging the data to disk for you if you run out of RAM (make sure you have a large enough swap file). Then you can use 8 TERAbytes of data (if you have a large enough disk).
Read parts of this data from disk as you need them manually using file IO, or using memory mapping.

Larger File Streams using C#

There are some text files(Records) which i need to access using C#.Net. But the matter is those files are larger than 1GB. (minimum size is 1 GB)
what should I need to do?
What are the factors which I need to be concentrate on?
Can some one give me an idea to over come from this situation.
EDIT:
Thanks for the fast responses. yes they are fixed length records. These text files coming from a local company. (There last month transaction records)
Is it possible to access these files like normal text files (using normal file stream).
and
How about the memory management????

Expanding on CasperOne's answer
Simply put there is no way to reliably put a 100GB file into memory at one time. On a 32 bit machine there is simply not enough addressing space. In a 64 bit machine there is enough addressing space but during the time in which it would take to actually get the file in memory, your user will have killed your process out of frustration.
The trick is to process the file incrementally. The base System.IO.Stream() class is designed to process a variable (and possibly infinite) stream in distinct quantities. It has several Read methods that will only progress down a stream a specific number of bytes. You will need to use these methods in order to divide up the stream.
I can't give more information because your scenario is not specific enough. Can you give us more details or your record delimeters or some sample lines from the file?
Update
If they are fixed length records then System.IO.Stream will work just fine. You can even use File.Open() to get access to the underlying Stream object. Stream.Read has an overload that requests the number of bytes to be read from the file. Since they are fixed length records this should work well for your scenario.
As long as you don't call ReadAllText() and instead use the Stream.Read() methods which take explicit byte arrays, memory won't be an issue. The underlying Stream class will take care not to put the entire file into memory (that is of course, unless you ask it to :) ).

You aren't specifically listing the problems you need to overcome. A file can be 100GB and you can have no problems processing it.
If you have to process the file as a whole then that is going to require some creative coding, but if you can simply process sections of the file at a time, then it is relatively easy to move to the location in the file you need to start from, process the data you need to process in chunks, and then close the file.
More information here would certainly be helpful.

What are the main problems you are having at the moment? The big thing to remember is to think in terms of streams - i.e. keep the minimum amount of data in memory that you can. LINQ is excellent at working with sequences (although there are some buffering operations you need to avoid, such as OrderBy).
For example, here's a way of handling simple records from a large file efficiently (note the iterator block).
For performing multiple aggregates/analysis over large data from files, consider Push LINQ in MiscUtil.
Can you add more context to the problems you are thinking of?

Expanding on JaredPar's answer.
If the file is a binary file (i.e. ints stored as 4 bytes, fixed length strings etc) you can use the BinaryReader class. Easier than pulling out n bytes and then trying to interrogate that.
Also note, the read method on System.IO.Stream is a non blocking operation. If you ask for 100 bytes it may return less than that, but still not have reached end of file.
The BinaryReader.ReadBytes method will block until it reads the requested number of bytes, or End of file - which ever comes first.
Nice collaboration lads :)

Hey Guys, I realize that this post hasn't been touched in a while, but I just wanted to post a site that has the solution to your problem.
http://thedeveloperpage.wordpress.com/c-articles/using-file-streams-to-write-any-size-file-introduction/
Hope it helps!
-CJ

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Memory-efficient IList<T> implementation - c#

Related

C# list vs byte array efficiency over network

Semantic difference of Read and Load

What is the best way to implement a read/write byte buffer for a remote file system wrapper

What's the most efficient way to manage large amounts of data (height data) and replace this huge array?

Larger File Streams using C#

Categories

Resources