I am writing an application that validates some cities. Part of the validation is checking if the city is already in a list by matching the country code and cityname (or alt cityname).
I am storing my existing cities list as:
public struct City
{
public int id;
public string countrycode;
public string name;
public string altName;
public int timezoneId;
}
List<City> cityCache = new List<City>();
I then have a list of location strings that contain country codes and city names etc. I split this string and then check if the city already exists.
string cityString = GetCity(); //get the city string
string countryCode = GetCountry(); //get the country string
city = new City(); //create a new city object
if (!string.IsNullOrEmpty(cityString)) //don't bother checking if no city was specified
{
//check if city exists in the list in the same country
city = cityCache.FirstOrDefault(x => countryCode == x.countrycode && (Like(x.name, cityString ) || Like(x.altName, cityString )));
//if no city if found, search for a single match accross any country
if (city.id == default(int) && cityCache.Count(x => Like(x.name, cityString ) || Like(x.altName, cityString )) == 1)
city = cityCache.FirstOrDefault(x => Like(x.name, cityString ) || Like(x.altName, cityString ));
}
if (city.id == default(int))
{
//city not matched
}
This is very slow for lots of records, as I am also checking other objects like airports and countries in the same way. Is there any way I can speed this up? Is there a faster collection for this kind of comparison than List<>, and is there a faster comparison function that FirsOrDefault()?
EDIT
I forgot to post my Like() function:
bool Like(string s1, string s2)
{
if (string.IsNullOrEmpty(s1) || string.IsNullOrEmpty(s2))
return s1 == s2;
if (s1.ToLower().Trim() == s2.ToLower().Trim())
return true;
return Regex.IsMatch(Regex.Escape(s1.ToLower().Trim()), Regex.Escape(s2.ToLower().Trim()) + ".");
}
I would use a HashSet for the CityString and CountryCode.
Something like
var validCountryCode = new HashSet<string>(StringComparison.OrdinalIgnoreCase);
if (validCountryCode.Contains(city.CountryCode))
{
}
etc...
Personally I would do all the validation in the constructor to ensure only valid City objects exist.
Other things to watch out for performance
Use HashSet if you're looking it up in a valid list.
Use IEqualityComparer where appropriate, reuse the object to avoid the construction/GC costs.
Use a Dictionary for anything you need to lookup (e.g. timeZoneId)
Edit 1
You're cityCache could be something like,
var cityCache = new Dictionary<string, Dictionary<string, int>>();
var countryCode = "";
var cityCode = "";
var id = x;
public static IsCityValid(City c)
{
return
cityCache.ContainsKey(c.CountryCode) &&
cityCache[c.CountryCode].ContainsKey(c.CityCode) &&
cityCache[c.CountryCode][c.CityCode] == c.Id;
}
Edit 2
Didn't think I have to explain this, but based on the comments, maybe.
FirstOrDefault() is an O(n) operation. Essentially everytime you are trying to find a find something in a list, you can either be lucky and it is the first in the list, or unlucky and it is the last, average of list.Count / 2. A dictionary on the other hand will be an O(1) lookup. Using the IEqualtiyComparer it will generate a HashCode() and lookup what bucket it sits in. If there are loads of collisions only then will it use the Equals to find what you're after in the list of things in the same bucket. Even with a poor quality HashCode() (short of returning the same HashCode always) because Dictionary / HashSet use prime number buckets you will split your list up reducing the number of Equalities you need to complete.
So a list of 10 objects means you're on average running LIKE 5 times.
A Dictionary of the same 10 objects as below (depending on the quality of the HashCode), could be as little as one HashCode() call followed by one Equals() call.
This sounds like a good candidate for a binary tree.
For binary tree implementations in .NET, see: Objects that represent trees
EDIT:
If you want to search a collection quickly, and that collection is particularly large, then your best option is to sort it and implement a search algorithm based on that sorting.
Binary trees are a good option when you want to search quickly and insert items relatively infrequently. To keep your searches quick, though, you'll need to use a balancing binary tree.
For this to work properly, though, you'll also need a standard key to use for your cities. A numeric key would be best, but strings can work fine too. If you concatenated your city with other information (such as the state and country) you will get a nice unique key. You could also change the case to all upper- or lower-case to get a case-insensitive key.
If you don't have a key, then you can't sort your data. If you can't sort your data, then there's not going to many "quick" options.
EDIT 2:
I notice that your Like function edits your strings a lot. Editing a string is an extremely expensive operation. You would be much better off performing the ToLower() and Trim() functions once, preferably when you are first loading your data. This will probably speed up your function considerably.
Related
I'm working on an algorithm for recommendations as restaurants to the client. These recommendations are based on a few filters, but mostly by comparing reviews people have left on restaurants. (I'll spare you the details).
For calculating a pearson correlation (A number which determines how well users fit with eachother) I have to check where users have left a review on the same restaurant. To increase the amount of matches, I've included a match on the price range of the subjects. I'll try to explain, here is my Restaurant class:
public class Restaurant
{
public Guid Id { get; set; }
public int PriceRange { get; set; }
}
This is a simplified version, but it's enough for my example. A pricerange can be an integer of 1-5 which determines how expensive the restaurant is.
Here's the for loop I'm using to check if they left reviews on the same restaurant or a review on a restaurant with the same pricerange.
//List<Review> user1Reviews is a list of all reviews from the first user
//List<Review> user2Reviews is a list of all reviews from the second user
Dictionary<Review, Review> shared_items = new Dictionary<Review, Review>();
foreach (var review1 in user1Reviews)
foreach (var review2 in user2Reviews)
if (review1.Restaurant.Id == review2.Restaurant.Id ||
review1.Restaurant.PriceRange == review2.Restaurant.PriceRange)
if (!shared_items.ContainsKey(review1))
shared_items.Add(review1, review2);
Now here's my actual problem. You can see I'm looping the second list for each review the first user has left. Is there a way to improve the performance of these loops? I have tried using a hashset and the .contains() function, but I need to include more criteria (I.e. the price range). I couldn't figure out how to include that in a hashset.
I hope it's not too confusing, and thanks in advance for any help!
Edit: After testing both linq and the for loops I have concluded that the for loops is twice as fast as using linq. Thanks for your help!
You could try replacing your inner loop by a Linq query using the criteria of the outer loop:
foreach (var review1 in user1Reviews)
{
var review2 = user2Reviews.FirstOrDefault(r2 => r2.Restaurant.Id == review1.Restaurant.Id ||
r2.Restaurant.PriceRange == review1.Restaurant.PriceRange);
if (review2 != null)
{
if (!shared_items.ContainsKey(review1))
shared_items.Add(review1, review2);
}
}
If there are multiple matches you should use Where and deal with the potential list of results.
I'm not sure it would be any quicker though as you still have to check all the user2 reviews against the user1 reviews.
Hoever, if you wrote a custom comparer for your restaurant class you could use this overload of Intersect to return you the common reviews:
var commonReviews = user1Reviews.Intersect(user2Reviews, new RestaurantComparer());
Where RestaurantComparer looks something like this:
// Custom comparer for the Restaurant class
class RestaurantComparer : IEqualityComparer<Restaurant>
{
// Products are equal if their ids and price ranges are equal.
public bool Equals(Restaurant x, Restaurant y)
{
//Check whether the compared objects reference the same data.
if (Object.ReferenceEquals(x, y)) return true;
//Check whether any of the compared objects is null.
if (Object.ReferenceEquals(x, null) || Object.ReferenceEquals(y, null))
return false;
//Check whether the properties are equal.
return x.Id == y.Id && x.PriceRange == y.PriceRange;
}
// If Equals() returns true for a pair of objects
// then GetHashCode() must return the same value for these objects.
public int GetHashCode(Product product)
{
//Check whether the object is null
if (Object.ReferenceEquals(product, null)) return 0;
//Get hash code for the Id field.
int hashId product.Id.GetHashCode();
//Get hash code for the Code field.
int hashPriceRange = product.PriceRange.GetHashCode();
//Calculate the hash code for the product.
return hashId ^ hashPriceRange;
}
}
You basically need a fast way to locate a review by Id or PriceRange. Normally you would use fast hash based lookup structure like Dictionary<TKey, TValue> for a single key, or composite key if the match operation was and. Unfortunately your is or, so the Dictionary doesn't work.
Well, not really. Single dictionary does not work, but you can use two dictionaries, and since the dictionary lookup is O(1), the operation will still be O(N) (rather than O(N * M) as with inner loop / naïve LINQ).
Since the keys are not unique, instead of dictionaries you can use lookups, keeping the same efficiency:
var lookup1 = user2Reviews.ToLookup(r => r.Restaurant.Id);
var lookup2 = user2Reviews.ToLookup(r => r.Restaurant.PriceRange);
foreach (var review1 in user1Reviews)
{
var review2 = lookup1[review.Restaurant.Id].FirstOrDefault() ??
lookup2[review.Restaurant.PriceRange].FirstOrDefault();
if (review2 != null)
{
// do something
}
}
I am trying to find the best way to determine if a DataTable
Contains duplicate data in a specific column
or
If the fields within said column are not found in an external Dictionary<string, string> and the resulting value matches a string literal.
This is what I've come up with:
List<string> dtSKUsColumn = _dataTable.Select()
.Select(x => x.Field<string("skuColumn"))
.ToList();
bool hasError = dtSKUsColumn.Distinct().Count() != dtSKUsColumn.Count() ||
!_dataTable.AsEnumerable()
.All(r => allSkuTypes
.Any(s => s.Value == "normalSKU" &&
s.Key == r.Field<string>("skuColumn")));
allSkuTypes is a Dictionary<string, string> where the key is the SKU itself, and the value is the SKU type.
I cannot just operate on a 'distinct' _dataTable, because there is a column that must contain identical fields (Said column cannot be removed and inferred, since I need to preserve the state of _dataTable).
So my question:
Am I handling this in the best possible way, or is there a simpler and faster method?
UPDATE:
The DataTable is not obtained via an SQL query, rather it is generated by a set of rules from an spreadsheet or csv. I have to make do with only the allSKuTypes and _dataTable objects as my only 'outside information.'
Your solution is not optimal.
Let N = _dataTable.Rows.Count and M = allSkuTypes.Count. Your algorithm has O(2 * N) space complexity (the memory allocated by ToList and Disctinct calls) and O(N * M) time complexity (due to linear search in the allSkuTypes for each _dataTable record).
Here is IMO the optimal solution. It uses single pass over the _dataTable records, a HashSet<string> for detecting the duplicates and TryGetValue method of the Dictionary for checking the second rule, thus ending up with O(N) space and time complexity:
var dtSkus = new HashSet<string>();
bool hasError = false;
foreach (var row in _dataTable.AsEnumerable())
{
var sku = row.Field<string>("skuColumn");
string type;
if (!dtSkus.Add(sku) || !allSkuTypes.TryGetValue(sku, out type) || type != "normalSKU")
{
hasError = true;
break;
}
}
The additional benefit is that you have the row with the broken rule and the code can easily be modified to take different actions depending of the which rule is broken, collect/count only the first or all invalid records etc.
I am testing the speed of getting data from Dictionary VS list.
I've used this code to test :
internal class Program
{
private static void Main(string[] args)
{
var stopwatch = new Stopwatch();
List<Grade> grades = Grade.GetData().ToList();
List<Student> students = Student.GetStudents().ToList();
stopwatch.Start();
foreach (Student student in students)
{
student.Grade = grades.Single(x => x.StudentId == student.Id).Value;
}
stopwatch.Stop();
Console.WriteLine("Using list {0}", stopwatch.Elapsed);
stopwatch.Reset();
students = Student.GetStudents().ToList();
stopwatch.Start();
Dictionary<Guid, string> dic = Grade.GetData().ToDictionary(x => x.StudentId, x => x.Value);
foreach (Student student in students)
{
student.Grade = dic[student.Id];
}
stopwatch.Stop();
Console.WriteLine("Using dictionary {0}", stopwatch.Elapsed);
Console.ReadKey();
}
}
public class GuidHelper
{
public static List<Guid> ListOfIds=new List<Guid>();
static GuidHelper()
{
for (int i = 0; i < 10000; i++)
{
ListOfIds.Add(Guid.NewGuid());
}
}
}
public class Grade
{
public Guid StudentId { get; set; }
public string Value { get; set; }
public static IEnumerable<Grade> GetData()
{
for (int i = 0; i < 10000; i++)
{
yield return new Grade
{
StudentId = GuidHelper.ListOfIds[i], Value = "Value " + i
};
}
}
}
public class Student
{
public Guid Id { get; set; }
public string Name { get; set; }
public string Grade { get; set; }
public static IEnumerable<Student> GetStudents()
{
for (int i = 0; i < 10000; i++)
{
yield return new Student
{
Id = GuidHelper.ListOfIds[i],
Name = "Name " + i
};
}
}
}
There is list of students and grades in memory they have StudentId in common.
In first way I tried to find Grade of a student using LINQ on a list that takes near 7 seconds on my machine and in another way first I converted List into dictionary then finding grades of student from dictionary using key that takes less than a second .
When you do this:
student.Grade = grades.Single(x => x.StudentId == student.Id).Value;
As written it has to enumerate the entire List until it finds the entry in the List that has the correct studentId (does entry 0 match the lambda? No... Does entry 1 match the lambda? No... etc etc). This is O(n). Since you do it once for every student, it is O(n^2).
However when you do this:
student.Grade = dic[student.Id];
If you want to find a certain element by key in a dictionary, it can instantly jump to where it is in the dictionary - this is O(1). O(n) for doing it for every student. (If you want to know how this is done - Dictionary runs a mathematical operation on the key, which turns it into a value that is a place inside the dictionary, which is the same place it put it when it was inserted)
So, dictionary is faster because you used a better algorithm.
The reason is because a dictionary is a lookup, while a list is an iteration.
Dictionary uses a hash lookup, while your list requires walking through the list until it finds the result from beginning to the result each time.
to put it another way. The list will be faster than the dictionary on the first item, because there's nothing to look up. it's the first item, boom.. it's done. but the second time the list has to look through the first item, then the second item. The third time through it has to look through the first item, then the second item, then the third item.. etc..
So each iteration the lookup takes more and more time. The larger the list, the longer it takes. While the dictionary is always a more or less fixed lookup time (it also increases as the dictionary gets larger, but at a much slower pace, so by comparison it's almost fixed).
When using Dictionary you are using a key to retrieve your information, which enables it to find it more efficiently, with List you are using Single Linq expression, which since it is a list, has no other option other than to look in entire list for wanted the item.
Dictionary uses hashing to search for the data. Each item in the dictionary is stored in buckets of items that contain the same hash. It's a lot quicker.
Try sorting your list, it will be a a bit quicker then.
A dictionary uses a hash table, it is a great data structure as it maps an input to a corresponding output almost instantaneously, it has a complexity of O(1) as already pointed out which means more or less immediate retrieval.
The cons of it is that for the sake of performance you need lots of space in advance (depending on the implementation be it separate chaining or linear/quadratic probing you may need at least as much as you're planning to store, probably double in the latter case) and you need a good hashing algorithm that maps uniquely your input ("John Smith") to a corresponding output such as a position in an array (hash_array[34521]).
Also listing the entries in a sorted order is a problem. If I may quote Wikipedia:
Listing all n entries in some specific order generally requires a
separate sorting step, whose cost is proportional to log(n) per entry.
Have a read on linear probing and separate chaining for some gorier details :)
Dictionary is based on a hash table which is a rather efficient algorithm to look up things. In a list you have to go element by element in order to find something.
It's all a matter of data organization...
When it comes to lookup of data, a keyed collection is always faster than a non-keyed collection. This is because a non-keyed collection will have to enumerate its elements to find what you are looking for. While in a keyed collection you can just access the element directly via the key.
These are some nice articles for comparing list to dictionary.
Here. And this one.
From MSDN - Dictionary mentions close to O(1) but I think it depends on the types involved.
The Dictionary(TKey,TValue) generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table.
Note:
The speed of retrieval depends on the quality of the hashing algorithm of the type specified for TKey.
List(TValue) does not implement a hash lookup so it is sequential and the performance is O(n). It also depends on the types involved and boxing/unboxing needs to be considered.
i got a generic list that looks like this:
List<PicInfo> pi = new List<PicInfo>();
PicInfo is a class that looks like this:
[ProtoContract]
public class PicInfo
{
[ProtoMember(1)]
public string fileName { get; set; }
[ProtoMember(2)]
public string completeFileName { get; set; }
[ProtoMember(3)]
public string filePath { get; set; }
[ProtoMember(4)]
public byte[] hashValue { get; set; }
public PicInfo() { }
}
what i'm trying to do is:
first, filter the list with duplicate file names and return the duplicate objects;
than, filter the returned list with duplicate hash value's;
i can only find examples on how to do this which return anonymous types. but i need it to be a generic list.
if someone can help me out, I'd appreciate it. also please explain your code. it's a learning process for me.
thanks in advance!
[EDIT]
the generic list contains a list of objects. these objects are pictures. every picture has a file name, hash value (and some more data which is irrelevant at this point). some pictures have the same name (duplicate file names). and i want to get a list of the duplicate file names from this generic list 'pi'.
But those pictures also have a hash value. from the file names that are identical, i want another list of those identical files names that also have identical hash values.
[/EDIT]
Something like this should work. Whether it is the best method I am not sure. It is not very efficient because for each element you are iterating through the list again to get the count.
List<PicInfo> pi = new List<PicInfo>();
IEnumerable<PicInfo> filt = pi.Where(x=>pi.Count(z=>z.FileName==x.FileName)>1);
I hope the code isn't too complicated to need explaining. I always think its best to work it out on your own anyway but if anythign is confusing then just ask and I'll explain.
If you want the second filter to be filtering for the same filename and same hash being a duplicate then you just need to extend the lambda in the Count to check against hash too.
Obviously if you just want filenames at the end then it is easy enough to do a Select to get just an enumerable list of those filenames, possibly with a Distinct if you only want them to appear once.
NB. Code written by hand so do forgive typos. May not compile first time, etc. ;-)
Edit to explain code - spoilers! ;-)
In english what we want to do is the following:
for each item in the list we want to select it if and only if there is more than one item in the list with the same filename.
Breaking this down to iterate over the list and select things based on a criteria we use the Where method. The condition of our where method is
there is more than one item in the list with the same filename
for this we clearly need to count the list so we use pi.Count. However we have a condition that we are only counting if the filename matches so we pass in an expression to tell it only to count those things.
The expression will work on each item of the list and return true if we want to count it and false if we don't want to.
The filename we are interested in is on x, the item we are filtering. So we want to count how many items have a filename the same as x.FileName. Thus our expression is z=>z.FileName==x.FileName. So z is our variable in this expression and x.FileName in this context is unchanging as we iterate over z.
We then of course put our criteria in of >1 to get the boolean value we want.
If you wanted those that are duplicates when considering the filename and hashvalue then you would expand the part in the Count to be z=>z.FileName==x.FileName && z.hashValue==x.hashValue.
So your final code to get the distinct on both values would be:
List pi = new List();
List filt = pi.Where(x=>pi.Count(z=>z.FileName==x.FileName && z.hashValue==x.hashValue)>1).ToList();
If you wanted those that are duplicates when considering the filename and hashvalue then you would expand the part in the Count to compare the hashValue as well. Since this is an array you will want to use the SequenceEqual method to compare them value by value.
So your final code to get the distinct on both values would be:
List<PicInfo> pi = new List<PicInfo>();
List<PicInfo> filt = pi.Where(x=>pi.Count(z=>z.FileName==x.FileName && z.hashValue.SequenceEqual(x.hashValue))>1).ToList();
Note that I didn't create the intermediary list and just went straight from the original list. You could go from the intermediate list but the code would be much the same if going from the original as from a filtered list.
I think, you have to use SequenceEqual method for finding dublicate
(http://msdn.microsoft.com/ru-ru/library/bb348567.aspx).
For filter use
var p = pi.GroupBy(rs => rs.fileName) // group by name
.Where(rs => rs.Count() > 1) // find group whose count greater than 1
.Select(rs => rs.First()) // select 1st element from each group
.GroupBy(rs => rs.hashValue) // now group by hash value
.Where(rs => rs.Count() > 1) // find group has multiple values
.Select(rs => rs.First()) // select first element from group
.ToList<PicInfo>() // make the list of picInfo of result
How can I store multiple values of a large set to be able to find them quickly with a lambda expression based on a property with non-unique values?
Sample case (not optimized for performance):
class Product
{
public string Title { get; set; }
public int Price { get; set; }
public string Description { get; set; }
}
IList<Product> products = this.LoadProducts();
var q1 = products.Where(c => c.Title == "Hello"); // 1 product.
var q2 = products.Where(c => c.Title == "Sample"); // 5 products.
var q3 = products.Where(c => string.IsNullOrEmpty(c.Title)); // 12 345 products.
If title was unique, it would be easy to optimize performance by using IDictionary or HashSet. But what about the case where the values are not unique?
The simplest solution is to use a dictionary of collections of Product. Easiest is to use
var products = this.LoadProducts().ToLookup(p => p.Title);
var example1 = products["Hello"]; // 1 product
var example2 = products["Sample"]; // 5 products
Your third example is a little harder, but you could use ApplyResultSelector() for that.
What you need is the ability to run indexed queries in LINQ. (same as we do in SQL)
There is a library called i4o which apparently can solve your problem:
http://i4o.codeplex.com/
from their website:
i4o (index for objects) is the first class library that extends LINQ
to allow you to put indexes on your objects. Using i4o, the speed of
LINQ operations are often over one thousand times faster than without
i4o.
i4o works by allowing the developer to specify an
IndexSpecification for any class, and then using the
IndexableCollection to implement a collection of that class that
will use the index specification, rather than sequential search, when
doing LINQ operations that can benefit from indexing.
also the following provides an example of how to use i4o:
http://www.hookedonlinq.com/i4o.ashx
Make it short you need to:
Add [Indexable()] attribute to your "Title" property
Use IndexableCollection<Product> as your data source.
From this point, any linq query that uses an indexable field will use the index rather than doing a sequential search, resulting in order of magnituide performance increases for queries using the index.