Split list of objects - c#

So, here is My code:
private List<IEnumerable<Row>> Split(IEnumerable<Row> rows,
IEnumerable<DateTimePeriod> periods)
{
List<IEnumerable<Row>> result = new List<IEnumerable<Row>>();
foreach (var period in periods)
{
result.Add(rows.Where(row => row.Date >= period.begin && row.Date <= period.end));
}
return result;
}
private class DateTimePeriod
{
public DateTime begin { get; set; }
public DateTime end { get; set; }
}
As you can see, this code is not the best, it iterates throught all rows for each period.
I need advice on how to optimize this code. Maybe there are suitable Enumerable methods for this?
Update: all rows and periods ordered by date, and all of rows is always in one of these periods.

A faster method would be to perform a join on the two structures, however Linq only supports equi-joins (joins where two expressions are equal). In your case you are joining on one value being in a range of values, so an equi-join is not possible.
Before starting to optimize, make sure it needs to be optimized. Would your program be significantly faster if this function were faster? How much of your app time is spent in this function?
If optimization wouldn't benefit the program overall, then don't worry about it - make sure it works and then focus on other features of the program.
That said, since you say the rows and periods are already sorted by date, you might get some performance benefit by using loops, looping through the rows until you're out of the current period, then moving to the next period. At least that way you don't enumerate rows (or periods) multiple times.

There is a little problem in your code: rows is IEnumerable so that it can be enumerated multiple times. in foreach. It's a good idea to change it to something more stable, like array, out side of foreach:
var myRows = rows as Row[] ?? rows.ToArray();
by the way. I changed your code the following code, using Resharper:
var myRows = rows as Row[] ?? rows.ToArray();
return periods.Select(period => myRows.Where(row => row.Date >= period.begin && row.Date <= period.end)).ToList();

Your best chance to optimize an O(n x m) algorithm is to transform it in multiple consecutive O(n) operations. In order to gain time you must trade off space, so maybe if you create some lookup table based on the data in one of your Enumerables will help you in this case.
For example you can construct an int array that will have a value set for each day that belongs to a period (each period has another known hardcoded value). This would be your first O(n) loop. Then you do another O(m) loop and only check if the array position corresponding to row.Date is non zero (then you look up the actual value among the hardcoded ones and you get the actual Period).
Anyway, this is more of a general idea and implementation is important. If n and m are very small you might not get any benefit, but if they are large (huge) I can bet that the Split method will run faster.
Assuming that everything you work with is already in memory (no EF involved).

Related

Iterate over strings that ".StartsWith" without using LINQ

I'm building a custom textbox to enable mentioning people in a social media context. This means that I detect when somebody types "#" and search a list of contacts for the string that follows the "#" sign.
The easiest way would be to use LINQ, with something along the lines of Members.Where(x => x.Username.StartsWith(str). The problem is that the amount of potential results can be extremely high (up to around 50,000), and performance is extremely important in this context.
What alternative solutions do I have? Is there anything similar to a dictionary (a hashtable based solution) but that would allow me to use Key.StartsWith without itterating over every single entry? If not, what would be the fastest and most efficient way to achieve this?
Do you have to show a dropdown of 50000? If you can limit your dropdown, you can for example just display the first 10.
var filteredMembers = new List<MemberClass>
foreach(var member in Members)
{
if(member.Username.StartWith(str)) filteredMembers.Add(member);
if(filteredMembers >= 10) break;
}
Alternatively:
You can try storing all your member's usernames into a Trie in addition to your collection. That should give you a better performance then looping through all 50000 elements.
Assuming your usernames are unique, you can store your member information in a dictionary and use the usernames as the key.
This is a tradeoff of memory for performance of course.
It is not really clear where the data is stored in the first place. Are all the names in memory or in a database?
In case you store them in database, you can just use the StartsWith approach in the ORM, which would translate to a LIKE query on the DB, which would just do its job. If you enable full text on the column, you could improve the performance even more.
Now supposing all the names are already in memory. Remember the computer CPU is extremely fast so even looping through 50 000 entries takes just a few moments.
StartsWith method is optimized and it will return false as soon as it encounters a non-matching character. Finding the ones that actually match should be pretty fast. But you can still do better.
As others suggest, you could build a trie to store all the names and be able to search for matches pretty fast, but there is a disadvantage - building the trie requires you to read all the names and create the whole data structure which is complex. Also you would be restricted only to a given set of characters and a unexpected character would have to be dealt with separately.
You can however group the names into "buckets". First start with the first character and create a dictionary with the character as a key and a list of names as the value. Now you effectively narrowed every following search approximately 26 times (supposing English alphabet). But don't have to stop there - you can perform this on another level, for the second character in each group. And then third and so on.
With each level you are effectively narrowing each group significantly and the search will be much faster afterwards. But there is of course the up-front cost of building the data structure, so you always have to find the right trade-off for you. More work up-front = faster search, less work = slower search.
Finally, when the user types, with each new letter she narrows the target group. Hence, you can always maintain the set of relevant names for the current input and cut it down with each successive keystroke. This will prevent you from having to go from the beginning each time and will improve the efficiency significantly.
Use BinarySearch
This is a pretty normal case, assuming that the data are stored in-memory, and here is a pretty standard way to handle it.
Use a normal List<string>. You don't need a HashTable or a SortedList. However, an IEnumerable<string> won't work; it has to be a list.
Sort the list beforehand (using LINQ, e.g. OrderBy( s => s)), e.g. during initialization or when retrieving it. This is the key to the whole approach.
Find the index of the best match using BinarySearch. Because the list is sorted, a binary search can find the best match very quickly and without scanning the whole list like Select/Where might.
Take the first N entries after the found index. Optionally you can truncate the list if not all N entries are a decent match, e.g. if someone typed "AZ" and there are only one or two items before "BA."
Example:
public static IEnumerable<string> Find(List<string> list, string firstFewLetters, int maxHits)
{
var startIndex = list.BinarySearch(firstFewLetters);
//If negative, no match. Take the 2's complement to get the index of the closest match.
if (startIndex < 0)
{
startIndex = ~startIndex;
}
//Take maxHits items, or go till end of list
var endIndex = Math.Min(
startIndex + maxHits - 1,
list.Count-1
);
//Enumerate matching items
for ( int i = startIndex; i <= endIndex; i++ )
{
var s = list[i];
if (!s.StartsWith(firstFewLetters)) break; //This line is optional
yield return s;
}
}
Click here for a working sample on DotNetFiddle.

It is wise to create a variable to avoid using several times a count?

Many times I'm in front of a code like that:
var maybe = 'some random linq query'
int maybeCount = maybe.Count();
List<KeyValuePair<int, Customer>> lst2scan = maybe.Take(8).ToList();
for (int k = 0; k < 8; k++)
{
if (k + 1 <= maybeCount) custlst[k].Customer = lst2scan[k].Value;
else custlst[k].Customer = new Customer();
}
Each time I have a code like this. I ask me, must I create a variable to avoid the for-each calculate the Count() ?. Maybe for only 1 Count() in the loop it's useless.
Somebody have an advice with the "right way" to code that in a loop. Do you do in case per case or ?
In this case, is my maybeCount variable useless ?
Do you know if the Count() count each time or he just return the content of a count variable.
Thanks for any advice to improve my knowledge.
If the count is guarranteed to not change then yes this will stop the need to calculate the count multiple times, this can become more important when the Count is resolved from method that can take some time to execute.
If the count does change then this can result in erroneous results being generated.
In answer to your questions.
The way you have done it looks pretty reasonable
As mentioned, your cnt variable means that you won't have to repeat your linq query on each iteration.
The count will be determined each time you call it, it has no memory for what the count is
Note: It is important to note that I am talking about the Enumerable.Count method. The List<T>.Count is a property that can just return a value
Well, it depends ;)
I wouldn't use a loop. I would do something like this:
-
var maybe = 'some random linq query'
// if necessary clear list
custlst.Clear();
custlst.AddRange(maybe.Take(8).Select(p => p.Value));
custlst.AddRange(Enumerable.Range(0, 8 - custlst.Count).Select(i => new Customer()));
-
In this case, yes.
It depends if the underlying type is a List (a Collection, in fact) or a simple enumerable. The implementation of Count will look for a precomputed Count value and use it. So, if you're cycling a list, the variable is almost useless (still a little faster, but I don't think it's relevant by any means), if you're cycling a Linq query, a variable is recommended, because Count is going to cycle the entire enumeration.
Just my 2c.
Edit: for reference, you can find the source for .Count() at https://github.com/dotnet/corefx/blob/master/src/System.Linq/src/System/Linq/Enumerable.cs (around line 1489); as you see, it checks for ICollection, and uses the precomputed Count property.

Looping through two collections - performance and optimization possibilities

This is probably a very common problem which has a lot of answers. I was not able to get to an answer because I am not very sure how to search for it.
I have two collections of objects - both come from the database, and in some cases those collections are of the same object type. Further, I need to do some operations for every combination of those collections. So, for example:
foreach(var a in collection1){
foreach(var b in collection2){
if(a.Name == b.Name && a.Value != b.Value)
//do something with this combination
else
//do something else
}
}
This is very inefficient and it gets slower based on the number of objects in both collections.
What is the best way to solve this type of problems?
EDIT:
I am using .NET 4 at the moment so I am also interested in suggestions using Parallelism to speed that up.
EDIT 2:
I have added above an example of the business rules that need to be performed on each combination of objects. However, the business rules defined in the example can vary.
EDIT 3:
For example, inside the loop the following will be done:
If the business rules are satisfied (see above) a record will be created in the database with a reference to object A and object B. This is one of the operations that I need to do. (Operations will be configurable from child classes using this class).
If you really have to to process every item in list b for each item in list a, then it's going to take time proportional to a.Count * b.Count. There's nothing you can do to prevent it. Adding parallel processing will give you a linear speedup, but that's not going to make a dent in the processing time if the lists are even moderately large.
How large are these lists? Do you really have to check every combination of a and b? Can you give us some more information about the problem you're trying to solve? I suspect that there's a way to bring a more efficient algorithm to bear, which would reduce your processing time by orders of magnitude.
Edit after more info posted
I know that the example you posted is just an example, but it shows that you can find a better algorithm for at least some of your cases. In this particular example, you could sort a and b by name, and then do a straight merge. Or, you could sort b into an array or list, and use binary search to look up the names. Either of those two options would perform much better than your nested loops. So much better, in fact, that you probably wouldn't need to bother with parallelizing things.
Look at the numbers. If your a has 4,000 items in it and b has 100,000 items in it, your nested loop will do 400 million comparisons (a.Count * b.Count). But sorting is only n log n, and the merge is linear. So sorting and then merging will be approximately (a.Count * 12) + (b.Count * 17) + a.Count + b.Count, or in the neighborhood of 2 million comparisons. So that's approximately 200 times faster.
Compare that to what you can do with parallel processing: only a linear speedup. If you have four cores and you get a pure linear speedup, you'll only cut your time by a factor of four. The better algorithm cut the time by a factor of 200, with a single thread.
You just need to find better algorithms.
LINQ might also provide a good solution. I'm not an expert with LINQ, but it seems like it should be able to make quick work of something like this.
If you need to check all the variants one by one you can't do anything better. BUT you can parallel the loops. For ex if you are using c# 4.0 you can use parallel foreach loop.
You can find an example here... http://msdn.microsoft.com/en-us/library/dd460720.aspx
foreach(var a in collection1){
Parallel.ForEach(collection2, b =>
{
//do something with a and b
} //close lambda expression
);
}
In the same way you can parallel the first loop as well.
First of all, there is a reason you are searching with a value from the first collection in the second collection.
For example if you want to know that a value excites in the the second collection, you should put the second collection in a hashset, this will allow you to do a fast lookup. Creating the HashSet and accessing it is like 1 vs n for looping the collection.
Parallel.ForEach(a, currentA => Parallel.ForEach(b, currentB =>
{
// do something with currentA and currentB
}));

Remove duplicates from array of struct

I can't figured out in remove duplicates entries from an Array of struct
I have this struct:
public struct stAppInfo
{
public string sTitle;
public string sRelativePath;
public string sCmdLine;
public bool bFindInstalled;
public string sFindTitle;
public string sFindVersion;
public bool bChecked;
}
I have changed the stAppInfo struct to class here thanks to Jon Skeet
The code is like this: (short version)
stAppInfo[] appInfo = new stAppInfo[listView1.Items.Count];
int i = 0;
foreach (ListViewItem item in listView1.Items)
{
appInfo[i].sTitle = item.Text;
appInfo[i].sRelativePath = item.SubItems[1].Text;
appInfo[i].sCmdLine = item.SubItems[2].Text;
appInfo[i].bFindInstalled = (item.SubItems[3].Text.Equals("Sí")) ? true : false;
appInfo[i].sFindTitle = item.SubItems[4].Text;
appInfo[i].sFindVersion = item.SubItems[5].Text;
appInfo[i].bChecked = (item.SubItems[6].Text.Equals("Sí")) ? true : false;
i++;
}
I need that appInfo array be unique in sTitle and sRelativePath members the others members can be duplicates
EDIT:
Thanks to all for the answers but this application is "portable" I mean I just need the .exe file and I don't want to add another files like references *.dll so please no external references this app is intended to use in a pendrive
All data comes form a *.ini file what I do is: (pseudocode)
ReadFile()
FillDataFromFileInAppInfoArray()
DeleteDuplicates()
FillListViewControl()
When I want to save that data into a file I have these options:
Using ListView data
Using appInfo array (this is more faster¿?)
Any other¿?
EDIT2:
Big thanks to: Jon Skeet, Michael Hays thanks for your time guys!!
Firstly, please don't use mutable structs. They're a bad idea in all kinds of ways.
Secondly, please don't use public fields. Fields should be an implementation detail - use properties.
Thirdly, it's not at all clear to me that this should be a struct. It looks rather large, and not particularly "a single value".
Fourthly, please follow the .NET naming conventions so your code fits in with all the rest of the code written in .NET.
Fifthly, you can't remove items from an array, as arrays are created with a fixed size... but you can create a new array with only unique elements.
LINQ to Objects will let you do that already using GroupBy as shown by Albin, but a slightly neater (in my view) approach is to use DistinctBy from MoreLINQ:
var unique = appInfo.DistinctBy(x => new { x.sTitle, x.sRelativePath })
.ToArray();
This is generally more efficient than GroupBy, and also more elegant in my view.
Personally I generally prefer using List<T> over arrays, but the above will create an array for you.
Note that with this code there can still be two items with the same title, and there can still be two items with the same relative path - there just can't be two items with the same relative path and title. If there are duplicate items, DistinctBy will always yield the first such item from the input sequence.
EDIT: Just to satisfy Michael, you don't actually need to create an array to start with, or create an array afterwards if you don't need it:
var query = listView1.Items
.Cast<ListViewItem>()
.Select(item => new stAppInfo
{
sTitle = item.Text,
sRelativePath = item.SubItems[1].Text,
bFindInstalled = item.SubItems[3].Text == "Sí",
sFindTitle = item.SubItems[4].Text,
sFindVersion = item.SubItems[5].Text,
bChecked = item.SubItems[6].Text == "Sí"
})
.DistinctBy(x => new { x.sTitle, x.sRelativePath });
That will give you an IEnumerable<appInfo> which is lazily streamed. Note that if you iterate over it more than once, however, it will iterate over listView1.Items the same number of times, performing the same uniqueness comparisons each time.
I prefer this approach over Michael's as it makes the "distinct by" columns very clear in semantic meaning, and removes the repetition of the code used to extract those columns from a ListViewItem. Yes, it involves building more objects, but I prefer clarity over efficiency until benchmarking has proved that the more efficient code is actually required.
What you need is a Set. It ensures that the items entered into it are unique (based on some qualifier which you will set up). Here is how it is done:
First, change your struct to a class. There is really no getting around that.
Second, provide an implementation of IEqualityComparer<stAppInfo>. It may be a hassle, but it is the thing that makes your set work (which we'll see in a moment):
public class AppInfoComparer : IEqualityComparer<stAppInfo>
{
public bool Equals(stAppInfo x, stAppInfo y) {
if (ReferenceEquals(x, y)) return true;
if (x == null || y == null) return false;
return Equals(x.sTitle, y.sTitle) && Equals(x.sRelativePath,
y.sRelativePath);
}
// this part is a pain, but this one is already written
// specifically for your question.
public int GetHashCode(stAppInfo obj) {
unchecked {
return ((obj.sTitle != null
? obj.sTitle.GetHashCode() : 0) * 397)
^ (obj.sRelativePath != null
? obj.sRelativePath.GetHashCode() : 0);
}
}
}
Then, when it is time to make your set, do this:
var appInfoSet = new HashSet<stAppInfo>(new AppInfoComparer());
foreach (ListViewItem item in listView1.Items)
{
var newItem = new stAppInfo {
sTitle = item.Text,
sRelativePath = item.SubItems[1].Text,
sCmdLine = item.SubItems[2].Text,
bFindInstalled = (item.SubItems[3].Text.Equals("Sí")) ? true : false,
sFindTitle = item.SubItems[4].Text,
sFindVersion = item.SubItems[5].Text,
bChecked = (item.SubItems[6].Text.Equals("Sí")) ? true : false};
appInfoSet.Add(newItem);
}
appInfoSet now contains a collection of stAppInfo objects with unique Title/Path combinations, as per your requirement. If you must have an array, do this:
stAppInfo[] appInfo = appInfoSet.ToArray();
Note: I chose this implementation because it looks like the way you are already doing things. It has an easy to read for-loop (though I do not need the counter variable). It does not involve LINQ (wich can be troublesome if you aren't familiar with it). It requires no external libraries outside of what .NET framework provides to you. And finally, it provides an array just like you've asked. As for reading the file in from an INI file, hopefully you see that the only thing that will change is your foreach loop.
Update
Hash codes can be a pain. You might have been wondering why you need to compute them at all. After all, couldn't you just compare the values of the title and relative path after each insert? Well sure, of course you could, and that's exactly how another set, called SortedSet works. SortedSet makes you implement IComparer in the same way that I implemented IEqualityComparer above.
So, in this case, AppInfoComparer would look like this:
private class AppInfoComparer : IComparer<stAppInfo>
{
// return -1 if x < y, 1 if x > y, or 0 if they are equal
public int Compare(stAppInfo x, stAppInfo y)
{
var comparison = x.sTitle.CompareTo(y.sTitle);
if (comparison != 0) return comparison;
return x.sRelativePath.CompareTo(y.sRelativePath);
}
}
And then the only other change you need to make is to use SortedSet instead of HashSet:
var appInfoSet = new SortedSet<stAppInfo>(new AppInfoComparer());
It's so much easier in fact, that you are probably wondering what gives? The reason that most people choose HashSet over SortedSet is performance. But you should balance that with how much you actually care, since you'll be maintaining that code. I personally use a tool called Resharper, which is available for Visual Studio, and it computes these hash functions for me, because I think computing them is a pain, too.
(I'll talk about the complexity of the two approaches, but if you already know it, or are not interested, feel free to skip it.)
SortedSet has a complexity of O(log n), that is to say, each time you enter a new item, will effectively go the halfway point of your set and compare. If it doesn't find your entry, it will go to the halfway point between its last guess and the group to the left or right of that guess, quickly whittling down the places for your element to hide. For a million entries, this takes about 20 attempts. Not bad at all. But, if you've chosen a good hashing function, then HashSet can do the same job, on average, in one comparison, which is O(1). And before you think 20 is not really that big a deal compared to 1 (after all computers are pretty quick), remember that you had to insert those million items, so while HashSet took about a million attempts to build that set up, SortedSet took several million attempts. But there is a price -- HashSet breaks down (very badly) if you choose a poor hashing function. If the numbers for lots of items are unique, then they will collide in the HashSet, which will then have to try again and again. If lots of items collide with the exact same number, then they will retrace each others steps, and you will be waiting a long time. The millionth entry will take a million times a million attempts -- HashSet has devolved into O(n^2). What's important with those big-O notations (which is what O(1), O(log n), and O(n^2) are, in fact) is how quickly the number in parentheses grows as you increase n. Slow growth or no growth is best. Quick growth is sometimes unavoidable. For a dozen or even a hundred items, the difference may be negligible -- but if you can get in the habit of programming efficient functions as easily as alternatives, then it's worth conditioning yourself to do so as problems are cheapest to correct closest to the point where you created that problem.
Use LINQ2Objects, group by the things that should be unique and then select the first item in each group.
var noDupes = appInfo.GroupBy(
x => new { x.sTitle, x.sRelativePath })
.Select(g => g.First()).ToArray();
!!! Array of structs (value type) + sorting or any kind of search ==> a lot of unboxing operations.
I would suggest to stick with recommendations of Jon and Henk, so make it as a class and use generic List<T>.
Use LINQ GroupBy or DistinctBy, as for me it is much simple to use built in GroupBy, but it also interesting to take a look at an other popular library, perhaps it gives you some insights.
BTW, Also take a look at the LambdaComparer it will make you life easier each time you need such kind of in place sorting/search, etc...

What .NET collection provides the fastest search

I have 60k items that need to be checked against a 20k lookup list. Is there a collection object (like List, HashTable) that provides an exceptionly fast Contains() method? Or will I have to write my own? In otherwords, is the default Contains() method just scan each item or does it use a better search algorithm.
foreach (Record item in LargeCollection)
{
if (LookupCollection.Contains(item.Key))
{
// Do something
}
}
Note. The lookup list is already sorted.
In the most general case, consider System.Collections.Generic.HashSet as your default "Contains" workhorse data structure, because it takes constant time to evaluate Contains.
The actual answer to "What is the fastest searchable collection" depends on your specific data size, ordered-ness, cost-of-hashing, and search frequency.
If you don't need ordering, try HashSet<Record> (new to .Net 3.5)
If you do, use a List<Record> and call BinarySearch.
Have you considered List.BinarySearch(item)?
You said that your large collection is already sorted so this seems like the perfect opportunity? A hash would definitely be the fastest, but this brings about its own problems and requires a lot more overhead for storage.
You should read this blog that speed tested several different types of collections and methods for each using both single and multi-threaded techniques.
According to the results, a BinarySearch on a List and SortedList were the top performers constantly running neck-in-neck when looking up something as a "value".
When using a collection that allows for "keys", the Dictionary, ConcurrentDictionary, Hashset, and HashTables performed the best overall.
I've put a test together:
First - 3 chars with all of the possible combinations of A-Z0-9
Fill each of the collections mentioned here with those strings
Finally - search and time each collection for a random string (same string for each collection).
This test simulates a lookup when there is guaranteed to be a result.
Then I changed the initial collection from all possible combinations to only 10,000 random 3 character combinations, this should induce a 1 in 4.6 hit rate of a random 3 char lookup, thus this is a test where there isn't guaranteed to be a result, and ran the test again:
IMHO HashTable, although fastest, isn't always the most convenient; working with objects. But a HashSet is so close behind it's probably the one to recommend.
Just for fun (you know FUN) I ran with 1.68M rows (4 characters):
Keep both lists x and y in sorted order.
If x = y, do your action, if x < y, advance x, if y < x, advance y until either list is empty.
The run time of this intersection is proportional to min (size (x), size (y))
Don't run a .Contains () loop, this is proportional to x * y which is much worse.
If it's possible to sort your items then there is a much faster way to do this then doing key lookups into a hashtable or b-tree. Though if you're items aren't sortable you can't really put them into a b-tree anyway.
Anyway, if sortable sort both lists then it's just a matter of walking the lookup list in order.
Walk lookup list
While items in check list <= lookup list item
if check list item = lookup list item do something
Move to next lookup list item
If you're using .Net 3.5, you can make cleaner code using:
foreach (Record item in LookupCollection.Intersect(LargeCollection))
{
//dostuff
}
I don't have .Net 3.5 here and so this is untested. It relies on an extension method. Not that LookupCollection.Intersect(LargeCollection) is probably not the same as LargeCollection.Intersect(LookupCollection) ... the latter is probably much slower.
This assumes LookupCollection is a HashSet
If you aren't worried about squeaking every single last bit of performance the suggestion to use a HashSet or binary search is solid. Your datasets just aren't large enough that this is going to be a problem 99% of the time.
But if this just one of thousands of times you are going to do this and performance is critical (and proven to be unacceptable using HashSet/binary search), you could certainly write your own algorithm that walked the sorted lists doing comparisons as you went. Each list would be walked at most once and in the pathological cases wouldn't be bad (once you went this route you'd probably find that the comparison, assuming it's a string or other non-integral value, would be the real expense and that optimizing that would be the next step).

Categories