Complication in adding and removing object to a large collection in c#

Complication in adding and removing object to a large collection in c# - c#

Couple of days ago, I faced a question that I have a collection which is having historical data or very large amount of objects. This collection is exposed to so many threads or clients, So, people might be iterating over it and some might be adding to it and some might be removing to this collection while iteration. So, modification might throw "collection changed exception" in c#.
Now, I need to design a data structure or a collection in c#
which fulfills following challenges :
You can't copy the collection to different object as Collection is very large,
So copying it would cause us lot of memory wastage.
while any user adds to collection while iterating the collection, new object should be added in the collection and should not throw any exception and should also be read in the end of the iteration as well as.
But in case user removes any item from the collection, then it should throw the exception.
Adding, removing and iterating should be thread safe. No race condition should be there.

If you really insist to use a collection and not a database. None of the regular .NET lists will be a good option. In that case you could create your own list type that is optimised for your situation.
It depends on other details (can you work with paging, do you need to acces items by index etc) what technics you could use.
An idea could be that you create one mutable list and only when it change, you create one immutable copy. All the clients use the last immutable copy.

Related

Safeguarding against user error when saving a list of information

I have a private List<Experience> experiences; that tracks generic experiences and experience specific information. I am using Json Serialize and Deserialize to save and load my list. When you start the application the List populates itself with the current saved information automatically and when a new experience is added to the list it saves the new list to file.
A concern that is popping into my head that I would like to get ahead of is, there is nothing that would stop the user from at any point doing something like experiences = new List<Experience>(); and then adding new experiences to it. Saving this would result in a loss of all previous data as right now the file is overwritten with each save. In an ideal world, this wouldn't happen, but I would like to figure out how to better structure my code to guard against it. Essentially I want to disallow removing items from the List or setting the list to a new list after the list has already been populated from load.
I have toyed with the idea of just appending the newest addition to the file, but I also want to cover the case where you change properties of an existing item in the List, and given that the list will never be all that large of a file, I thought overwriting would be the simplest approach as the cost isn't a concern.
Any help in figuring out the best approach is greatly appreciated.
Edit* Looked into the repository pattern https://www.infoworld.com/article/3107186/application-development/how-to-implement-the-repository-design-pattern-in-c.html and this seems like a potential approach.

I'm making an assumption that your user in this case is a code-level consumer of your API and that they'll be using the results inside the same memory stack, which is making you concerned about reference mutation.
In this situation, I'd return a copy of the list rather than the list itself on read-operations, and on writes allow only add and remove as maccettura recommends in the comments. You could keep the references to the items in the list intact if you want the consumer to be able to mutate them, but I'd think carefully about whether that's appropriate for your use case and consider instead requiring the consumer to call an update function (which could be the same as your add function a-la HTTP PUT).

Sometimes when you want to highlight that your collection should not be modified, exposing it as an IEnumerable except List may be enough, but in case you are writing some serious API, something like repository pattern seems to, be a good solution.

Most efficient way, Tags or List<GameObject>?

In my game I can use a list of game objects or tags to iterate but i prefer knows what is the most efficient way.
Save more memory using tags or unity requires many resources to do a search by tag?
public List<City> _Citys = new List<City>();
or
foreach(GameObject go in GameObject.FindGameObjectsWithTag("City"))

You're better of using a List of City objects and doing a standard for loop to iterate over the 'City' objects. The List just simply holds references to the 'City' objects, so impact on memory should be minimal - you could use an array of GameObjects[] instead of a List (which is what FindGameObjectsWithTag returns).
It's better for performance to use a populated List/Array rather than searching by Tags and of course you're explicitly pointing to an object rather than using 'magic' strings -- if you change the tag name later on then the FindGameObjectsWithTag method will silently break, as it will no longer find any objects.
Also, avoid using a foreach loop in Unity as this unfortunately creates a lot of garbage (the garbage collector in Unity isn't great so it's best to create as little garbage as possbile), instead just use a standard for loop:
Replace the “foreach” loops with simple “for” loops. For some reason, every iteration of every “foreach” loop generated 24 Bytes of garbage memory. A simple loop iterating 10 times left 240 Bytes of memory ready to be collected which was just unacceptable
EDIT: As mentioned in pid's answer - measure. You can use the built-in Unity profiler to inspect memory usage: http://docs.unity3d.com/Manual/ProfilerMemory.html

Per Microsoft's C# API rules, verbs such as Find* or Count* denote active code while terms such as Length stand for actual values that require no code execution.
Now, if the Unity3D folks respected those guidelines is a matter of debate, but from the name of the method I can already tell that it has a cost and should not be taken too lightly.
On the other side, your question is about performance, not correctness. Both ways are correct per se, but one is supposed to have better performance.
So, the main rule of refactoring for performance is: MEASURE.
It depends on memory allocation and garbage collection, it is impossible to tell which really is faster without measuring.
So the best advice I could give you is pretty general. Whenever you feel the need to enhance performance of code you have to actually measure what you are about to improve, before and after.

Your code examples are 2 distinctly different things. One is instantiating a list, and one is enumerating over an IEnumerable returned from a function call.
I assume you mean the difference between iterating over your declared list vs iterating over the return value from GameObject.FindObjectsWithTag() in which case;
Storing a List as a member variable in your class, populating it once and then iterating over it several times is more efficient than iterating over GameObject.FindObjectsWithTag several times.
This is because you keep your List and your references to the objects in your list at all times without having to repopulate it.
GameObject.FindObjectsWithTag will search your entire object hierarchy and compile a list of all the objects that it finds that matches your search criteria. This is done every time you call the function, so there is additional overhead even if the amount of objects it finds is the same as it still searches your hierarchy.
To be honest, you could just cache your results with a List object using GameObject.FindObjectWithTag providing the amount of objects returned will not change. (As in to say you are not instantiating or destroying any of those objects)

.NET Collection Classes

Group of related data like a list of parts etc., can be handled either using Arrays(Array of Parts) or using Collection. I understand that When Arrays are used, Insertion, Deletion and some other operations have performance impact when it is compared with Collections. Does this mean that Arrays are not used internally by the collections?, If so what is the data structure used for collections like List, Collection etc?
How the collections are handled internally?

List<T> uses an internal array. Removing/inserting items near the beginning of the list will be more expensive than doing the same near the end of the list, since the entire contents of the internal array need to be shifted in one direction. Also, once you try to add an item when the internal list is full, a new, bigger array will be constructed, the contents copied, and the old array discarded.
The Collection<T> class, when used with the parameterless constructor, uses a List<T> internally. So performance-wise they will be identical, with the exception of overhead caused by wrapping. (Essentially one more level of indirection, which is going to be negligible in most scenarios.)
LinkedList<T> is, as its name implies, a linked list. This will sacrifice iteration speed for insertion/removal speed. Since iterating means traversing pointers-to-pointers-to-pointers ad infinitum, this is going to take more work overall. Aside from the pointer traversal, two nodes may not be allocated anywhere near each other, reducing the effectiveness of CPU RAM caches.
However, the amount of time required to insert or remove a node is constant, since it requires the same number of operations no matter the state of the list. (This does not take into account any work that must be done to actually locate the item to remove, or to traverse the list to find the insertion point!)
If your primary concern with your collection is testing if something is in the collection, you might consider a HashSet<T> instead. Addition of items to the set will be relatively fast, somewhere between insertion into a list and a linked list. Removal of items will again be relatively fast. But the real gain is in lookup time -- testing if a HashSet<T> contains an item does not require iterating the entire list. On average it will perform faster than any list or linked list structure.
However, a HashSet<T> cannot contain equivalent items. If part of your requirements is that two items that are considered equal (by an Object.Equals(Object) overload, or by implementing IEquatable<T>) coexist independently in the collection, then you simply cannot use a HashSet<T>. Also, HashSet<T> does not guarantee insertion order, so you also can't use a HashSet<T> if maintaining some sort of ordering is important.

There are two basic ways to implement a simple collection:
contiguous array
linked list
Contiguous arrays have performance disadvantages for the operations you mentioned because the memory space of the collection is either preallocated or allocated based on the contents of the collection. Thus deletion or insertion requires moving many array elements to keep the entire collection contiguous and in the proper order.
Linked lists remove these issues because the items in the collection do not need to be stored in memory contiguously. Instead each element contains a reference to one or more of the other elements. Thus, when an insertion is made, the item in question is created anywhere in memory and only the references on one or two of the elements already in the collection need to be modified.
For example:
LinkedList<object> c = new LinkedList<object>(); // a linked list
object[] a = new object[] { }; // a contiguous array
This is simplified of course. The internals of LinkedList<> are doubtless more complex than a simple singly or doubly linked list, but that is the basic structure.

I think that some collection classes might use arrays internally as well as linked lists or something similar. The benefit of using collections from the System.Collections namespace instead of arrays, is that you do not need to spend any extra time writing code to perform update operations.
Arrays will always be more lightweight, and if you know some very good search algorithms, then you might even be able to use them more efficiently, but most of the the time you can avoid reinventing the wheel by using classes from System.Collections. These classes are meant to help the programmer avoid writing code that has already been written and tuned hundreds of times, so it is unlikely that you'll get a significant performance boost by manipulating arrays yourself.
When you need a static collection that doesn't require much adding, removing or editing, then perhaps it is a good time to use an array, since they don't require the extra memory that collections do.

Is there a LinkedList collection that supports dictionary type operations

I was recently profiling an application trying to work out why certain operations were extremely slow. One of the classes in my application is a collection based on LinkedList. Here's a basic outline, showing just a couple of methods and some fluff removed:
public class LinkInfoCollection : PropertyNotificationObject, IEnumerable<LinkInfo>
{
private LinkedList<LinkInfo> _items;
public LinkInfoCollection()
{
_items = new LinkedList<LinkInfo>();
}
public void Add(LinkInfo item)
{
_items.AddLast(item);
}
public LinkInfo this[Guid id]
{ get { return _items.SingleOrDefault(i => i.Id == id); } }
}
The collection is used to store hyperlinks (represented by the LinkInfo class) in a single list. However, each hyperlink also has a list of hyperlinks which point to it, and a list of hyperlinks which it points to. Basically, it's a navigation map of a website. As this means you can having infinite recursion when links go back to each other, I implemented this as a linked list - as I understand it, it means for every hyperlink, no matter how many times it is referenced by another hyperlink, there is only ever one copy of the object.
The ID property in the above example is a GUID.
With that long winded description out the way, my problem is simple - according to the profiler, when constructing this map for a fairly small website, the indexer referred to above is called no less than 27906 times. Which is an extraordinary amount. I still need to work out if it's really necessary to be called that many times, but at the same time, I would like to know if there's a more efficient way of doing the indexer as this is the primary bottleneck identified by the profiler (also assuming it isn't lying!). I still needed the linked list behaviour as I certainly don't want more than one copy of these hyperlinks floating around killing my memory, but I also do need to be able to access them by a unique key.
Does anyone have any advice to offer on improving the performance of this indexer. I also have another indexer which uses a URI rather than a GUID, but this is less problematic as the building incoming/outgoing links is done by GUID.
Thanks;
Richard Moss

You should use a Dictionary<Guid, LinkInfo>.

You don't need to use LinkedList in order to have only one copy of each LinkInfo in memory. Remember that LinkInfo is a managed reference type, and so you can place it in any collection, and it'll just be a reference to the object that gets placed in the list, not a copy of the object itself.
That said, I'd implement the LinkInfo class as containing two lists of Guids: one for the things this links to, one for the things linking to this. I'd have just one Dictionary<Guid, LinkInfo> to store all the links. Dictionary is a very fast lookup, I think that'll help with your performance.
The fact that this[] is getting called 27,000 times doesn't seem like a big deal to me, but what's making it show up in your profiler is probably the SingleOrDefault call on the LinkedList. Linked lists are best for situations where you need fast insertions & removals, particularly in the middle of the list. For quick lookups, which is probably more important here, let the Dictionary do its work with hash tables.

Should I worry about releasing resources in this case?

Let's say I have a class Collection which holds a list of Items.
public class Collection
{
private List<Item> MyList;
//...
}
I have several instances of this Collection class which all have different MyLists but share some Items.
For example: There are 10 Items, Collection1 references Items 1-4, Collection2 has Items 2-8 and Collection3 4,7,8 and 10 on its List.
I implemented this as follows: I have one global List which holds any Items available. Before I create a new Collection I check if there are already Items I need in this list -- if not I create the Item and add it to the global List (and to the Collection of course).
The problem I see is that those Items will never be released - even if all Collections are gone, the memory they consume is still not freed because the global list still references them.
Is this something I need to worry about? If so, what should I do? I thought of adding a counter to the global list to see when an Item is not needed anymore and remove its reference.
Edit:
It is in fact a design problem, I think. I will discard the idea of a global list and instead loop through all Collections and see if they have the needed Item already.

If the global list needs references to the items then you can't realistically free them. Do you actually need references to the items in the global list? When should you logically be able to remove items from the global list?
You could consider using weak references in the global list, and periodically pruning the WeakReference values themselves if their referents have been collected.

It looks like a bit of a design problem, do you really need the global list?
Apart from weakreferences that Jon mentions, you could also periodically rebuild the global list (for example after deleting a collection) or only build it dynamically when you need it and release it again.
You'll have to decide which method is most appropriate, we don't have enough context here.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.