Maintaining a sorted list

Maintaining a sorted list - c#

I need to store a collection of nodes:
class Node
{
int Value;
//other info
}
I have three requirements:
Need to be able to efficiently retrieve the node with the lowest Value in the collection
Need to be able to efficiently insert a node into the collection
Two nodes can have the same Value
I thought the best collection to use for this would be some sort of sorted list. That way requirement #1 is satisfied efficiently by just taking the first element from the sorted list. Requirement #2 is satisfied efficiently by inserting a new node in the right place in the list.
But the SortedList collection in .Net is like SortedDictionary and requires the key being sorted on to be unique, which violates requirement #3.
There appears to be no collection in .Net that satisfies these requirements, mainly because the self-sorting collections that do exist require keys being sorted on to be unique. What is the reason for this? I assume it cannot be an oversight. What am I not grasping here? I can find similar questions about this but they usually involve someone suggesting SortList, followed by realizing this doesn't work, and then the conversation fades out without a standard solution. At least if someone would say "There is no collection in C# for this task, you need to hack something together" that would be an answer.
Is it acceptable to use a regular List<Node> and re-sort the list whenever a new node is added? Seems like that wouldn't be as efficient as inserting the node in the right place to begin with. Perhaps that is what I should do? Manually iterate over the list until I find the place to insert a new node myself?

If all you need is to efficiently insert, and quickly retrieve the item with the lowest value, then you don't need a sorted list. You need a heap. Check out A Generic Binary Heap Class.

Make your list_key unique by adding the object id or another unique identifier: IDs 4 and 5, both having value "1" will become "1_4" and "1_5", which can be added to the sorted List without trouble and will be sorted as expected.

You could use a SortedList<int, List<NodeInfo>>, where you'll put the Value in the key and all the other properties in the value:
public class NodeList : SortedList<int, List<NodeInfo>>
{
public void Add(int key, NodeInfo info)
{
if (this.Keys.Contains(key))
{
this[key].Add(info);
}
else
{
this.Add(key, new List<NodeInfo>() { info } );
}
}
public NodeInfo FirstNode()
{
if (this.Count == 0)
return null;
return this.First().Value.First();
}
}
public class NodeInfo
{
public string Info { get; set; }
// TODO: add other members
}
Here's some sample usage:
var list = new NodeList();
// adding
list.Add(3, new NodeInfo() { Info = "some info 3" });
// inserting
for (int i = 0; i < 100000; i++)
{
list.Add(1, new NodeInfo() { Info = "some info 1" });
list.Add(2, new NodeInfo() { Info = "some info 2" });
list.Add(1, new NodeInfo() { Info = "some info 1.1" });
}
// retrieving the first item
var firstNodeInfo = list.FirstNode();
// retrieving an item
var someNodeInfo = list[2].First();

In my opinion, it is acceptable to use a normal list and re-sort it after every insert. Sorting is pretty efficient in .NET. See this thread : String sorting performance degradation in VS2010 vs. VS2008

You can use OrderedMultiDictionary in Wintellect's Power Collections for .NET. That's exactly what you are looking for.

Related

list property containing two types (string/int)

I need to have a property that will be an array that can hold both ints and strings.
if i set the property to an array of ints it should be ints so when I am searching through this array the search will be fast, and at odd times this property will also contain strings which the search will be slow.
Is there any other way other than the following to have a list that contain native types
two properties one for ints and one for strings
use List< object >
UPDATE:
The use-case is as follow. I have a database field [ReferenceNumber] that holds the values (integers and strings) and another field [SourceID] (used for other things) which can be used to determine if record holds an int or string.
I will be fetching collections of these records based on the source id, of course depending on what the source is, the list either will be integers or strings. Then I will go through this collection looking for certain reference numbers, if they exist not add them or they dont then add them. I will be pre-fetching a lot of records instead of hitting the database over and over.
so for example if i get a list for sourceid =1 that means they are ints and if searching i want the underline list to be int so the search will be fast. and if sourceid say is 2 which means they are strings and very rare its okay if the search is slow because those number of records are not that many and a performance hit on searching through strings is okay.

I will go through this collection looking for certain reference numbers, if they exist not add them or they dont then add them.
It sounds to me like you don't need a List<>, but rather a HashSet<>. Simply use a HashSet<object>, and Add() all the items, and the collection will ignore duplicate items. It will be super-fast, regardless of whether you're dealing with ints or strings.
On my computer, the following code shows that it takes about 50 milliseconds to populate an initial 400,000 unique strings in the hashset, and about 2 milliseconds to add an additional 10,000 random strings:
var sw = new Stopwatch();
var initial= Enumerable.Range(1, 400000).Select(i => i.ToString()).ToList();
sw.Start();
var set = new HashSet<object>(initial);
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
var random = new Random();
var additional = Enumerable.Range(1, 10000).Select(i => random.Next(1000000).ToString()).ToList();
sw.Restart();
foreach (var item in additional)
{
set.Add(item);
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
Also, in case it's important, HashSet<>s do retain order of insertion.

The only other thing I would suggest is a custom object that implements IComparable
class Multitype: IComparable
{
public int? Number { get; set; }
public string Words {get; set; }
public int CompareTo(object obj)
{
Multitype other = obj as Multitype;
if (Number != null && other != null && other.Number != null)
{
//...
}
else
{
//...
}
}
}
There will be some extra comparison steps between numbers, but not as much as string parsing.
Are you storing a ton of data, is that performance difference really going to matter?

It's possible to use generics if you implement them on the class. Not sure if this solves your problem. Would be interested to hear the real-world example of a property that can have different types.
class Foo<T>
{
public List<T> GenericList { get; set; }
public Foo()
{
this.GenericList = new List<T>();
}
}

If by "use List" you mean the object primitive or provided System.Object, that is an option, but I think it would behoove you to make your own wrapper object -- IntString or similar -- that would handle everything for you. It should implement IComparable, as the other gal mentioned.
You can increase the efficiency of sorting your object in collections by writing a CompareTo method that does exactly what you need it to. Writing a good CompareTo method is a whole can of worms in itself, so you should probably start a new question for that, if that's what you want.
If you're looking for a property that is strongly typed as a List<Int> or List<String> at instantiation, but can change afterwards, then you want an interface. IList exists, but won't help you, since that must also be strongly typed upon declaration. You should probably make something like IIntStringList that can only be one of List<Int> or List<String>.
Sorry this answer doesn't have that many details (I need to leave the office now), but I hope I've set you on the right track.

How can I compare two C# collections and issue Add, Delete commands to make them equal?

I have two ICollection collections:
public partial class ObjectiveDetail
{
public int ObjectiveDetailId { get; set; }
public int Number { get; set; }
public string Text { get; set; }
}
var _objDetail1: // contains a list of ObjectiveDetails from my database.
var _objDetail2: // contains a list of ObjectiveDetails from web front end.
How can I iterate through these and issue and Add, Delete or Update to synchronize the database with the latest from the web front end?
If there is a record present in the first list but not the second then I would like to:
_uow.ObjectiveDetails.Delete(_objectiveDetail);
If there is a record present in the second list but not the first then I would like to:
_uow.ObjectiveDetails.Add(_objectiveDetail);
If there is a record (same ObjectiveDetailId) in the first and second then I need to see if they are the same and if not issue an:
_uow.ObjectiveDetails.Update(_objectiveDetail);
I was thinking to do this with some kind of:
foreach (var _objectiveDetail in _objectiveDetails) {}
but then I think I might need to have two of these and I am also wondering if there is a better way. Does anyone have any suggestions as to how I could do this?

The following code is one of some possible solutions
var toBeUpdated =
objectiveDetail1.Where(
a => objectiveDetail2.Any(
b => (b.ObjectiveDetailId == a.ObjectiveDetailId) &&
(b.Number != a.Number || !b.Text.Equals(a.Text))));
var toBeAdded =
objectiveDetail1.Where(a => objectiveDetail2.All(
b => b.ObjectiveDetailId != a.ObjectiveDetailId));
var toBeDeleted =
objectiveDetail2.Where(a => objectiveDetail1.All(
b => b.ObjectiveDetailId != a.ObjectiveDetailId));
The rest is a simple code to Add, Delete, Update the three collections to the database.

It's look like you just want the two lists to be a copy of one another, you can just implement a Copy method and replace the outdated collection, if you implement ICollection you will need to implement CopyTo, also you can add a version field to the container so you can know if you need to update it.
If you don't want to do it this way and you want to go through the elements and update them check if you can save in each object the state (modified, deleted, updated) this will help in the comparison.

foreach (var _objectiveDetail in _objectiveDetails) {} but then I
think I might need to have two of these and I am also wondering if
there is a better way. Does anyone have any suggestions as to how I
could do this?
instead of looping through whole collection use LINQ query:
var query = from _objectiveDetail in _objectiveDetails
where (condition)
select ... ;
update:
It's pointless to iterate through whole collection if you want to update/delete/add something from web end. Humans are a bit slower than computers, isn't it? Do it one by one. In fact I don't understand the idea of 2 collections. What is it for? If you still want it: use event to run query, select updated/deleted/added record, do appropriate operation on it.

Why is dictionary so much faster than list?

I am testing the speed of getting data from Dictionary VS list.
I've used this code to test :
internal class Program
{
private static void Main(string[] args)
{
var stopwatch = new Stopwatch();
List<Grade> grades = Grade.GetData().ToList();
List<Student> students = Student.GetStudents().ToList();
stopwatch.Start();
foreach (Student student in students)
{
student.Grade = grades.Single(x => x.StudentId == student.Id).Value;
}
stopwatch.Stop();
Console.WriteLine("Using list {0}", stopwatch.Elapsed);
stopwatch.Reset();
students = Student.GetStudents().ToList();
stopwatch.Start();
Dictionary<Guid, string> dic = Grade.GetData().ToDictionary(x => x.StudentId, x => x.Value);
foreach (Student student in students)
{
student.Grade = dic[student.Id];
}
stopwatch.Stop();
Console.WriteLine("Using dictionary {0}", stopwatch.Elapsed);
Console.ReadKey();
}
}
public class GuidHelper
{
public static List<Guid> ListOfIds=new List<Guid>();
static GuidHelper()
{
for (int i = 0; i < 10000; i++)
{
ListOfIds.Add(Guid.NewGuid());
}
}
}
public class Grade
{
public Guid StudentId { get; set; }
public string Value { get; set; }
public static IEnumerable<Grade> GetData()
{
for (int i = 0; i < 10000; i++)
{
yield return new Grade
{
StudentId = GuidHelper.ListOfIds[i], Value = "Value " + i
};
}
}
}
public class Student
{
public Guid Id { get; set; }
public string Name { get; set; }
public string Grade { get; set; }
public static IEnumerable<Student> GetStudents()
{
for (int i = 0; i < 10000; i++)
{
yield return new Student
{
Id = GuidHelper.ListOfIds[i],
Name = "Name " + i
};
}
}
}
There is list of students and grades in memory they have StudentId in common.
In first way I tried to find Grade of a student using LINQ on a list that takes near 7 seconds on my machine and in another way first I converted List into dictionary then finding grades of student from dictionary using key that takes less than a second .

When you do this:
student.Grade = grades.Single(x => x.StudentId == student.Id).Value;
As written it has to enumerate the entire List until it finds the entry in the List that has the correct studentId (does entry 0 match the lambda? No... Does entry 1 match the lambda? No... etc etc). This is O(n). Since you do it once for every student, it is O(n^2).
However when you do this:
student.Grade = dic[student.Id];
If you want to find a certain element by key in a dictionary, it can instantly jump to where it is in the dictionary - this is O(1). O(n) for doing it for every student. (If you want to know how this is done - Dictionary runs a mathematical operation on the key, which turns it into a value that is a place inside the dictionary, which is the same place it put it when it was inserted)
So, dictionary is faster because you used a better algorithm.

The reason is because a dictionary is a lookup, while a list is an iteration.
Dictionary uses a hash lookup, while your list requires walking through the list until it finds the result from beginning to the result each time.
to put it another way. The list will be faster than the dictionary on the first item, because there's nothing to look up. it's the first item, boom.. it's done. but the second time the list has to look through the first item, then the second item. The third time through it has to look through the first item, then the second item, then the third item.. etc..
So each iteration the lookup takes more and more time. The larger the list, the longer it takes. While the dictionary is always a more or less fixed lookup time (it also increases as the dictionary gets larger, but at a much slower pace, so by comparison it's almost fixed).

When using Dictionary you are using a key to retrieve your information, which enables it to find it more efficiently, with List you are using Single Linq expression, which since it is a list, has no other option other than to look in entire list for wanted the item.

Dictionary uses hashing to search for the data. Each item in the dictionary is stored in buckets of items that contain the same hash. It's a lot quicker.
Try sorting your list, it will be a a bit quicker then.

A dictionary uses a hash table, it is a great data structure as it maps an input to a corresponding output almost instantaneously, it has a complexity of O(1) as already pointed out which means more or less immediate retrieval.
The cons of it is that for the sake of performance you need lots of space in advance (depending on the implementation be it separate chaining or linear/quadratic probing you may need at least as much as you're planning to store, probably double in the latter case) and you need a good hashing algorithm that maps uniquely your input ("John Smith") to a corresponding output such as a position in an array (hash_array[34521]).
Also listing the entries in a sorted order is a problem. If I may quote Wikipedia:
Listing all n entries in some specific order generally requires a
separate sorting step, whose cost is proportional to log(n) per entry.
Have a read on linear probing and separate chaining for some gorier details :)

Dictionary is based on a hash table which is a rather efficient algorithm to look up things. In a list you have to go element by element in order to find something.
It's all a matter of data organization...

When it comes to lookup of data, a keyed collection is always faster than a non-keyed collection. This is because a non-keyed collection will have to enumerate its elements to find what you are looking for. While in a keyed collection you can just access the element directly via the key.
These are some nice articles for comparing list to dictionary.
Here. And this one.

From MSDN - Dictionary mentions close to O(1) but I think it depends on the types involved.
The Dictionary(TKey,TValue) generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table.
Note:
The speed of retrieval depends on the quality of the hashing algorithm of the type specified for TKey.
List(TValue) does not implement a hash lookup so it is sequential and the performance is O(n). It also depends on the types involved and boxing/unboxing needs to be considered.

How to efficiently sort list of nodes with next and previous node references?

I have a collection of items in random each having the following data structures:
// NOTE: Was "Vertex" in the comments...
public class Item
{
public string Data { get; set; }
}
public class Node
{
public Item Next { get; set; }
public Item Previous { get; set; }
}
Example:
var a = new Item();
var b = new Item();
var c = new Item();
var d = new Item();
var nodeA = new Node() { Previous = null };
var nodeB = new Node() { Previous = a };
nodeA.Next = b;
var nodeC = new Node() { Previous = b };
nodeB.Next = c;
var nodeD = new Node() { Previous = c, Next = null };
nodeC.Next = d;
// This would be input to the sort method (completely random order).
var items = new []{ nodeC, nodeA, nodeD, nodeB };
// Execute sort
// Result: nodeA, nodeB, nodeC, nodeD.
Obviously a O(n2) solution is possible. However, I would like to sort these in the correct order in less than O(n2). Is this possible?

Looking at it... assuming you aren't using circular lists, couldn't you just iterate through your random-order array until you find the starting node (the one with .Previous == null) and then return the node? I mean, one of the advantages of a linked list is that you don't have to store references to all the nodes in a separate data structure, just have them each connected to each other. (Well, depending on how the language implementation you're using does reference counting and garbage collection, if it does them at all.)
But basically, unless you have an immediate need after the operation to access an element a certain distance from the starting node, I'd recommend just immediately returning the starting node when encountered and then lazily assigning to an array of the proper size as you use each successive node. In fact, even if you create a new array and assign to it, wouldn't the worst-time case still just be O(n), with n being the number of nodes? O(n) to find the starting node, and then another O(n) n to iterate through the list, assigning each node to the corresponding index in an array of the same size as your input array.
Based on your updated question, it might be a good idea for you to implement a temporary set of linked lists. As you initially iterate through the list, you'd check the Next and Previous elements of each node, and then store the Nexts and Previouses in Dictionary-esque objects (I'm not sure what .NET object would be best suited for that) as keys, with linked-list nodes wrapped around the existing Nodes referencing the Items being the values. That way you'd build up the links as you go along without any actual sorting, and would ultimately just iterate through your temporary list, assigning the Nodes wrapped by the listnodes to a new array to return.
This should be better than O(n^2) due to dictionary accesses generally being constant-time on average (though worst-case asymptotic behavior is still O(n)), I believe.

I think merge sort can work. Something like...
merge_sort_list(list, chain_length)
1. if chain_length > 1 then
2. merge_sort_list(list, chain_length/2)
3. middle_node = step_into_by(list, chain_length)
4. merge_sort_list(middle, chain_length - chain_length/2)
5. merge_list_halves(list, middle, chain_length)
merge_list_halves(list, middle, chain_length)
1. ... you get the idea

Merge Sort comes to mind... I think this should be applicable here and it performs (worst case) in O(n log n).
Merge sort is often the best choice for sorting a linked list: in this
situation it is relatively easy to implement a merge sort in such a
way that it requires only Θ(1) extra space, and the slow random-access
performance of a linked list makes some other algorithms (such as
quicksort) perform poorly, and others (such as heapsort) completely
impossible.
Edit: after re-reading your question, it doesn't make much sense anymore. Basically you want to sort so that the items are in the order given by the list? That is doable in linear O(n) time: first you travel backwards to the first item in the list (via references) and then you just yield each item forward. Or are your Next/Previous not references?

Removing duplicate string from List (.NET 2.0!)

I'm having issues finding the most efficient way to remove duplicates from a list of strings (List).
My current implementation is a dual foreach loop checking the instance count of each object being only 1, otherwise removing the second.
I know there are MANY other questions out there, but they all the best solutions require above .net 2.0, which is the current build environment I'm working in. (GM and Chrysler are very resistant to changes ... :) )
This limits the possible results by not allowing any LINQ, or HashSets.
The code I'm using is Visual C++, but a C# solution will work just fine as well.
Thanks!

This probably isn't what you're looking for, but if you have control over this, the most efficient way would be to not add them in the first place...
Do you have control over this? If so, all you'd need to do is a myList.Contains(currentItem) call before you add the item and you're set

You could do the following.
List<string> list = GetTheList();
Dictionary<string,object> map = new Dictionary<string,object>();
int i = 0;
while ( i < list.Count ) {
string current = list[i];
if ( map.ContainsKey(current) ) {
list.RemoveAt(i);
} else {
i++;
map.Add(current,null);
}
}
This has the overhead of building a Dictionary<TKey,TValue> object which will duplicate the list of unique values in the list. But it's fairly efficient speed wise.

I'm no Comp Sci PhD, but I'd imagine using a dictionary, with the items in your list as the keys would be fast.
Since a dictionary doesn't allow duplicate keys, you'd only have unique strings at the end of iteration.

Just remember when providing a custom class to override the Equals() method in order for the Contains() to function as required.
Example
List<CustomClass> clz = new List<CustomClass>()
public class CustomClass{
public bool Equals(Object param){
//Put equal code here...
}
}

If you're going the route of "just don't add duplicates", then checking "List.Contains" before adding an item works, but its O(n^2) where n is the number strings you want to add. Its no different from your current solution using two nested loops.
You'll have better luck using a hashset to store items you've already added, but since you're using .NET 2.0, a Dictionary can substitute for a hash set:
static List<T> RemoveDuplicates<T>(List<T> input)
{
List<T> result = new List<T>(input.Count);
Dictionary<T, object> hashSet = new Dictionary<T, object>();
foreach (T s in input)
{
if (!hashSet.ContainsKey(s))
{
result.Add(s);
hashSet.Add(s, null);
}
}
return result;
}
This runs in O(n) and uses O(2n) space, it will generally work very well for up to 100K items. Actual performance depends on the average length of the strings -- if you really need to maximum performance, you can exploit some more powerful data structures like tries make inserts even faster.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.