I am testing the speed of getting data from Dictionary VS list.
I've used this code to test :
internal class Program
{
private static void Main(string[] args)
{
var stopwatch = new Stopwatch();
List<Grade> grades = Grade.GetData().ToList();
List<Student> students = Student.GetStudents().ToList();
stopwatch.Start();
foreach (Student student in students)
{
student.Grade = grades.Single(x => x.StudentId == student.Id).Value;
}
stopwatch.Stop();
Console.WriteLine("Using list {0}", stopwatch.Elapsed);
stopwatch.Reset();
students = Student.GetStudents().ToList();
stopwatch.Start();
Dictionary<Guid, string> dic = Grade.GetData().ToDictionary(x => x.StudentId, x => x.Value);
foreach (Student student in students)
{
student.Grade = dic[student.Id];
}
stopwatch.Stop();
Console.WriteLine("Using dictionary {0}", stopwatch.Elapsed);
Console.ReadKey();
}
}
public class GuidHelper
{
public static List<Guid> ListOfIds=new List<Guid>();
static GuidHelper()
{
for (int i = 0; i < 10000; i++)
{
ListOfIds.Add(Guid.NewGuid());
}
}
}
public class Grade
{
public Guid StudentId { get; set; }
public string Value { get; set; }
public static IEnumerable<Grade> GetData()
{
for (int i = 0; i < 10000; i++)
{
yield return new Grade
{
StudentId = GuidHelper.ListOfIds[i], Value = "Value " + i
};
}
}
}
public class Student
{
public Guid Id { get; set; }
public string Name { get; set; }
public string Grade { get; set; }
public static IEnumerable<Student> GetStudents()
{
for (int i = 0; i < 10000; i++)
{
yield return new Student
{
Id = GuidHelper.ListOfIds[i],
Name = "Name " + i
};
}
}
}
There is list of students and grades in memory they have StudentId in common.
In first way I tried to find Grade of a student using LINQ on a list that takes near 7 seconds on my machine and in another way first I converted List into dictionary then finding grades of student from dictionary using key that takes less than a second .
When you do this:
student.Grade = grades.Single(x => x.StudentId == student.Id).Value;
As written it has to enumerate the entire List until it finds the entry in the List that has the correct studentId (does entry 0 match the lambda? No... Does entry 1 match the lambda? No... etc etc). This is O(n). Since you do it once for every student, it is O(n^2).
However when you do this:
student.Grade = dic[student.Id];
If you want to find a certain element by key in a dictionary, it can instantly jump to where it is in the dictionary - this is O(1). O(n) for doing it for every student. (If you want to know how this is done - Dictionary runs a mathematical operation on the key, which turns it into a value that is a place inside the dictionary, which is the same place it put it when it was inserted)
So, dictionary is faster because you used a better algorithm.
The reason is because a dictionary is a lookup, while a list is an iteration.
Dictionary uses a hash lookup, while your list requires walking through the list until it finds the result from beginning to the result each time.
to put it another way. The list will be faster than the dictionary on the first item, because there's nothing to look up. it's the first item, boom.. it's done. but the second time the list has to look through the first item, then the second item. The third time through it has to look through the first item, then the second item, then the third item.. etc..
So each iteration the lookup takes more and more time. The larger the list, the longer it takes. While the dictionary is always a more or less fixed lookup time (it also increases as the dictionary gets larger, but at a much slower pace, so by comparison it's almost fixed).
When using Dictionary you are using a key to retrieve your information, which enables it to find it more efficiently, with List you are using Single Linq expression, which since it is a list, has no other option other than to look in entire list for wanted the item.
Dictionary uses hashing to search for the data. Each item in the dictionary is stored in buckets of items that contain the same hash. It's a lot quicker.
Try sorting your list, it will be a a bit quicker then.
A dictionary uses a hash table, it is a great data structure as it maps an input to a corresponding output almost instantaneously, it has a complexity of O(1) as already pointed out which means more or less immediate retrieval.
The cons of it is that for the sake of performance you need lots of space in advance (depending on the implementation be it separate chaining or linear/quadratic probing you may need at least as much as you're planning to store, probably double in the latter case) and you need a good hashing algorithm that maps uniquely your input ("John Smith") to a corresponding output such as a position in an array (hash_array[34521]).
Also listing the entries in a sorted order is a problem. If I may quote Wikipedia:
Listing all n entries in some specific order generally requires a
separate sorting step, whose cost is proportional to log(n) per entry.
Have a read on linear probing and separate chaining for some gorier details :)
Dictionary is based on a hash table which is a rather efficient algorithm to look up things. In a list you have to go element by element in order to find something.
It's all a matter of data organization...
When it comes to lookup of data, a keyed collection is always faster than a non-keyed collection. This is because a non-keyed collection will have to enumerate its elements to find what you are looking for. While in a keyed collection you can just access the element directly via the key.
These are some nice articles for comparing list to dictionary.
Here. And this one.
From MSDN - Dictionary mentions close to O(1) but I think it depends on the types involved.
The Dictionary(TKey,TValue) generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table.
Note:
The speed of retrieval depends on the quality of the hashing algorithm of the type specified for TKey.
List(TValue) does not implement a hash lookup so it is sequential and the performance is O(n). It also depends on the types involved and boxing/unboxing needs to be considered.
Related
I have a complex type as:
class Row : IEquatable<Row>
{
public Type Type1 { get; }
public Type Type2 { get; }
public int dummy;
public override int GetHashCode()
{
var type1HashCode = Type1.GetHashCode();
//djb2 hash
unchecked
{
return ((type1HashCode << 5) + type1HashCode) ^ Type2.GetHashCode();
}
}
// Equals method also overrided
}
I have a HashSet<Row> and I want to merge it with another HashSet with two different strategies; first I want to merge and keep duplicates from main HashSet, I tried main.UnionWith(second) now I want to merge main with second (result being in main) and keep duplicates from second one; How can I do that? (it's a performance critical code)
My code:
var main = new HashSet<Row>()
{
new Row(typeof(int), typeof(long))
{
dummy = 10
}
};
var second = new HashSet<Row>()
{
new Row(typeof(int), typeof(long))
{
dummy = 20
}
};
// Merge here.
Trace.Write(main.First().dummy) //I want 20
I expect main.First().dummy to be 20.
The second strategy can be implemented by calling main.ExceptWith(second); first and then main.UnionWith(second) like the first strategy.
Since the UnionWith is basically a shortcut for
foreach (var element in second)
main.Add(element);
and ExceptWith - a shortcut for
foreach (var element in second)
main.Remove(element);
the second strategy can also be implemented with a single loop:
foreach (var element in second)
{
main.Remove(element);
main.Add(element);
}
But I think the performance gain would be negligible compared to ExceptWith + UnionWith approach.
If I'm reading this correctly, you want to keep duplicated values after merging. In this scenario, HashSet is the wrong data structure for your objective.
From the MSDN documentation for HashSet(T):
A HashSet collection is not sorted and cannot contain duplicate elements. If order or element duplication is more important than performance for your application, consider using the List class together with the Sort method.
I need to have a property that will be an array that can hold both ints and strings.
if i set the property to an array of ints it should be ints so when I am searching through this array the search will be fast, and at odd times this property will also contain strings which the search will be slow.
Is there any other way other than the following to have a list that contain native types
two properties one for ints and one for strings
use List< object >
UPDATE:
The use-case is as follow. I have a database field [ReferenceNumber] that holds the values (integers and strings) and another field [SourceID] (used for other things) which can be used to determine if record holds an int or string.
I will be fetching collections of these records based on the source id, of course depending on what the source is, the list either will be integers or strings. Then I will go through this collection looking for certain reference numbers, if they exist not add them or they dont then add them. I will be pre-fetching a lot of records instead of hitting the database over and over.
so for example if i get a list for sourceid =1 that means they are ints and if searching i want the underline list to be int so the search will be fast. and if sourceid say is 2 which means they are strings and very rare its okay if the search is slow because those number of records are not that many and a performance hit on searching through strings is okay.
I will go through this collection looking for certain reference numbers, if they exist not add them or they dont then add them.
It sounds to me like you don't need a List<>, but rather a HashSet<>. Simply use a HashSet<object>, and Add() all the items, and the collection will ignore duplicate items. It will be super-fast, regardless of whether you're dealing with ints or strings.
On my computer, the following code shows that it takes about 50 milliseconds to populate an initial 400,000 unique strings in the hashset, and about 2 milliseconds to add an additional 10,000 random strings:
var sw = new Stopwatch();
var initial= Enumerable.Range(1, 400000).Select(i => i.ToString()).ToList();
sw.Start();
var set = new HashSet<object>(initial);
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
var random = new Random();
var additional = Enumerable.Range(1, 10000).Select(i => random.Next(1000000).ToString()).ToList();
sw.Restart();
foreach (var item in additional)
{
set.Add(item);
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
Also, in case it's important, HashSet<>s do retain order of insertion.
The only other thing I would suggest is a custom object that implements IComparable
class Multitype: IComparable
{
public int? Number { get; set; }
public string Words {get; set; }
public int CompareTo(object obj)
{
Multitype other = obj as Multitype;
if (Number != null && other != null && other.Number != null)
{
//...
}
else
{
//...
}
}
}
There will be some extra comparison steps between numbers, but not as much as string parsing.
Are you storing a ton of data, is that performance difference really going to matter?
It's possible to use generics if you implement them on the class. Not sure if this solves your problem. Would be interested to hear the real-world example of a property that can have different types.
class Foo<T>
{
public List<T> GenericList { get; set; }
public Foo()
{
this.GenericList = new List<T>();
}
}
If by "use List" you mean the object primitive or provided System.Object, that is an option, but I think it would behoove you to make your own wrapper object -- IntString or similar -- that would handle everything for you. It should implement IComparable, as the other gal mentioned.
You can increase the efficiency of sorting your object in collections by writing a CompareTo method that does exactly what you need it to. Writing a good CompareTo method is a whole can of worms in itself, so you should probably start a new question for that, if that's what you want.
If you're looking for a property that is strongly typed as a List<Int> or List<String> at instantiation, but can change afterwards, then you want an interface. IList exists, but won't help you, since that must also be strongly typed upon declaration. You should probably make something like IIntStringList that can only be one of List<Int> or List<String>.
Sorry this answer doesn't have that many details (I need to leave the office now), but I hope I've set you on the right track.
I need to store a collection of nodes:
class Node
{
int Value;
//other info
}
I have three requirements:
Need to be able to efficiently retrieve the node with the lowest Value in the collection
Need to be able to efficiently insert a node into the collection
Two nodes can have the same Value
I thought the best collection to use for this would be some sort of sorted list. That way requirement #1 is satisfied efficiently by just taking the first element from the sorted list. Requirement #2 is satisfied efficiently by inserting a new node in the right place in the list.
But the SortedList collection in .Net is like SortedDictionary and requires the key being sorted on to be unique, which violates requirement #3.
There appears to be no collection in .Net that satisfies these requirements, mainly because the self-sorting collections that do exist require keys being sorted on to be unique. What is the reason for this? I assume it cannot be an oversight. What am I not grasping here? I can find similar questions about this but they usually involve someone suggesting SortList, followed by realizing this doesn't work, and then the conversation fades out without a standard solution. At least if someone would say "There is no collection in C# for this task, you need to hack something together" that would be an answer.
Is it acceptable to use a regular List<Node> and re-sort the list whenever a new node is added? Seems like that wouldn't be as efficient as inserting the node in the right place to begin with. Perhaps that is what I should do? Manually iterate over the list until I find the place to insert a new node myself?
If all you need is to efficiently insert, and quickly retrieve the item with the lowest value, then you don't need a sorted list. You need a heap. Check out A Generic Binary Heap Class.
Make your list_key unique by adding the object id or another unique identifier: IDs 4 and 5, both having value "1" will become "1_4" and "1_5", which can be added to the sorted List without trouble and will be sorted as expected.
You could use a SortedList<int, List<NodeInfo>>, where you'll put the Value in the key and all the other properties in the value:
public class NodeList : SortedList<int, List<NodeInfo>>
{
public void Add(int key, NodeInfo info)
{
if (this.Keys.Contains(key))
{
this[key].Add(info);
}
else
{
this.Add(key, new List<NodeInfo>() { info } );
}
}
public NodeInfo FirstNode()
{
if (this.Count == 0)
return null;
return this.First().Value.First();
}
}
public class NodeInfo
{
public string Info { get; set; }
// TODO: add other members
}
Here's some sample usage:
var list = new NodeList();
// adding
list.Add(3, new NodeInfo() { Info = "some info 3" });
// inserting
for (int i = 0; i < 100000; i++)
{
list.Add(1, new NodeInfo() { Info = "some info 1" });
list.Add(2, new NodeInfo() { Info = "some info 2" });
list.Add(1, new NodeInfo() { Info = "some info 1.1" });
}
// retrieving the first item
var firstNodeInfo = list.FirstNode();
// retrieving an item
var someNodeInfo = list[2].First();
In my opinion, it is acceptable to use a normal list and re-sort it after every insert. Sorting is pretty efficient in .NET. See this thread : String sorting performance degradation in VS2010 vs. VS2008
You can use OrderedMultiDictionary in Wintellect's Power Collections for .NET. That's exactly what you are looking for.
I am writing an application that validates some cities. Part of the validation is checking if the city is already in a list by matching the country code and cityname (or alt cityname).
I am storing my existing cities list as:
public struct City
{
public int id;
public string countrycode;
public string name;
public string altName;
public int timezoneId;
}
List<City> cityCache = new List<City>();
I then have a list of location strings that contain country codes and city names etc. I split this string and then check if the city already exists.
string cityString = GetCity(); //get the city string
string countryCode = GetCountry(); //get the country string
city = new City(); //create a new city object
if (!string.IsNullOrEmpty(cityString)) //don't bother checking if no city was specified
{
//check if city exists in the list in the same country
city = cityCache.FirstOrDefault(x => countryCode == x.countrycode && (Like(x.name, cityString ) || Like(x.altName, cityString )));
//if no city if found, search for a single match accross any country
if (city.id == default(int) && cityCache.Count(x => Like(x.name, cityString ) || Like(x.altName, cityString )) == 1)
city = cityCache.FirstOrDefault(x => Like(x.name, cityString ) || Like(x.altName, cityString ));
}
if (city.id == default(int))
{
//city not matched
}
This is very slow for lots of records, as I am also checking other objects like airports and countries in the same way. Is there any way I can speed this up? Is there a faster collection for this kind of comparison than List<>, and is there a faster comparison function that FirsOrDefault()?
EDIT
I forgot to post my Like() function:
bool Like(string s1, string s2)
{
if (string.IsNullOrEmpty(s1) || string.IsNullOrEmpty(s2))
return s1 == s2;
if (s1.ToLower().Trim() == s2.ToLower().Trim())
return true;
return Regex.IsMatch(Regex.Escape(s1.ToLower().Trim()), Regex.Escape(s2.ToLower().Trim()) + ".");
}
I would use a HashSet for the CityString and CountryCode.
Something like
var validCountryCode = new HashSet<string>(StringComparison.OrdinalIgnoreCase);
if (validCountryCode.Contains(city.CountryCode))
{
}
etc...
Personally I would do all the validation in the constructor to ensure only valid City objects exist.
Other things to watch out for performance
Use HashSet if you're looking it up in a valid list.
Use IEqualityComparer where appropriate, reuse the object to avoid the construction/GC costs.
Use a Dictionary for anything you need to lookup (e.g. timeZoneId)
Edit 1
You're cityCache could be something like,
var cityCache = new Dictionary<string, Dictionary<string, int>>();
var countryCode = "";
var cityCode = "";
var id = x;
public static IsCityValid(City c)
{
return
cityCache.ContainsKey(c.CountryCode) &&
cityCache[c.CountryCode].ContainsKey(c.CityCode) &&
cityCache[c.CountryCode][c.CityCode] == c.Id;
}
Edit 2
Didn't think I have to explain this, but based on the comments, maybe.
FirstOrDefault() is an O(n) operation. Essentially everytime you are trying to find a find something in a list, you can either be lucky and it is the first in the list, or unlucky and it is the last, average of list.Count / 2. A dictionary on the other hand will be an O(1) lookup. Using the IEqualtiyComparer it will generate a HashCode() and lookup what bucket it sits in. If there are loads of collisions only then will it use the Equals to find what you're after in the list of things in the same bucket. Even with a poor quality HashCode() (short of returning the same HashCode always) because Dictionary / HashSet use prime number buckets you will split your list up reducing the number of Equalities you need to complete.
So a list of 10 objects means you're on average running LIKE 5 times.
A Dictionary of the same 10 objects as below (depending on the quality of the HashCode), could be as little as one HashCode() call followed by one Equals() call.
This sounds like a good candidate for a binary tree.
For binary tree implementations in .NET, see: Objects that represent trees
EDIT:
If you want to search a collection quickly, and that collection is particularly large, then your best option is to sort it and implement a search algorithm based on that sorting.
Binary trees are a good option when you want to search quickly and insert items relatively infrequently. To keep your searches quick, though, you'll need to use a balancing binary tree.
For this to work properly, though, you'll also need a standard key to use for your cities. A numeric key would be best, but strings can work fine too. If you concatenated your city with other information (such as the state and country) you will get a nice unique key. You could also change the case to all upper- or lower-case to get a case-insensitive key.
If you don't have a key, then you can't sort your data. If you can't sort your data, then there's not going to many "quick" options.
EDIT 2:
I notice that your Like function edits your strings a lot. Editing a string is an extremely expensive operation. You would be much better off performing the ToLower() and Trim() functions once, preferably when you are first loading your data. This will probably speed up your function considerably.
I'm having issues finding the most efficient way to remove duplicates from a list of strings (List).
My current implementation is a dual foreach loop checking the instance count of each object being only 1, otherwise removing the second.
I know there are MANY other questions out there, but they all the best solutions require above .net 2.0, which is the current build environment I'm working in. (GM and Chrysler are very resistant to changes ... :) )
This limits the possible results by not allowing any LINQ, or HashSets.
The code I'm using is Visual C++, but a C# solution will work just fine as well.
Thanks!
This probably isn't what you're looking for, but if you have control over this, the most efficient way would be to not add them in the first place...
Do you have control over this? If so, all you'd need to do is a myList.Contains(currentItem) call before you add the item and you're set
You could do the following.
List<string> list = GetTheList();
Dictionary<string,object> map = new Dictionary<string,object>();
int i = 0;
while ( i < list.Count ) {
string current = list[i];
if ( map.ContainsKey(current) ) {
list.RemoveAt(i);
} else {
i++;
map.Add(current,null);
}
}
This has the overhead of building a Dictionary<TKey,TValue> object which will duplicate the list of unique values in the list. But it's fairly efficient speed wise.
I'm no Comp Sci PhD, but I'd imagine using a dictionary, with the items in your list as the keys would be fast.
Since a dictionary doesn't allow duplicate keys, you'd only have unique strings at the end of iteration.
Just remember when providing a custom class to override the Equals() method in order for the Contains() to function as required.
Example
List<CustomClass> clz = new List<CustomClass>()
public class CustomClass{
public bool Equals(Object param){
//Put equal code here...
}
}
If you're going the route of "just don't add duplicates", then checking "List.Contains" before adding an item works, but its O(n^2) where n is the number strings you want to add. Its no different from your current solution using two nested loops.
You'll have better luck using a hashset to store items you've already added, but since you're using .NET 2.0, a Dictionary can substitute for a hash set:
static List<T> RemoveDuplicates<T>(List<T> input)
{
List<T> result = new List<T>(input.Count);
Dictionary<T, object> hashSet = new Dictionary<T, object>();
foreach (T s in input)
{
if (!hashSet.ContainsKey(s))
{
result.Add(s);
hashSet.Add(s, null);
}
}
return result;
}
This runs in O(n) and uses O(2n) space, it will generally work very well for up to 100K items. Actual performance depends on the average length of the strings -- if you really need to maximum performance, you can exploit some more powerful data structures like tries make inserts even faster.