This is a real performance problem.
public int FindPreviousFC(int framecode)
{
if (SetTable == null)
throw new NullReferenceException("Not loaded Log_Roadb.");
int previousFrameCode = 0;
for (int i = 0; i < SetTable.Rows.Count; i++)
{
if (framecode == Convert.ToInt32(SetTable.Rows[i][0]))
{
previousFrameCode = Convert.ToInt32(SetTable.Rows[i - 1][0]);
break;
}
}
return previousFrameCode;
}
If the data in the SetTable is ordered on framecode than you can use a binary search through the data structure to reduce the number of lookups.
If there are not patterns in the data that you can exploit optimizing performance may become tricky. This assumes that you can't export the data from SetTable into a structure where lookups are faster.
If this Find method is being called frequently on the same set of data, then you may also want to consider creating an index structure (dictionary) to speed up subsequent lookups. This may mitigate the cost of iterating over the same data over and over.
Also, as an aside, don't throw a NullReferenceException when you check the SetTable argument, throw ArgumentNullExeception instead. Null reference exceptions are thrown by the CLR when a reference variable that is null is dereferenced ... it shouldn't be thrown by your code.
You might get some improvement by exchanging the rows with the columns in the table. Getting elements sequentially from a line in the table is faster than getting every nth element. (It has something to do with cache misses)
Most of your time is going to be spent converting text to integers. Since you say this is a time problem it sounds like you're calling this a lot--is there anything you can do to store the data as integers instead of strings?
Use a Dictionary.
Key -- SetTable.Rows[i][0]
Value -- SetTable.Rows[i-1][0].
Then when you get a framecode, just look it up in the dictionary. If it's there, return the value.
You can gain a little more efficiency by using Convert.Int32 on both the key and value before storing the in the Dictionary, then no further conversions are necessary.
Assuming
(1) SetTable is a DataTable
(2) framecodeColumn is perhaps your column name
(3) framecodeColumn occurs at index 0 (first column)
try the following:
SetTable.Select("framecodeColumn < framecodeValuePassedToYourMethod","framecodeColumn DESC")[0].[0]
Basically, find a DataRowCollection using "Select()" method by passing a filter, sort the result in descending order, and the first row will be the row that you are looking for.
Of course, please protect this line with all needed checks (like for example see if there is indeed results that satisfy the filter condition using the "GetLength()" method and so on).
Related
I have read several different sources over the years that indicate that when storing a collection of data, a List<T> is efficient when you want to insert objects, and an IEnumerable<T> is best for enumerating over a collection.
In LINQ-to-Entities, there is the AsEnumerable() function, that will return an IEnumerable<T>, but it will not resolve the SQL created by the LINQ statement until you start enumerating over the list.
What if I want to store objects from LINQ to Entities in a collection and then query on that collection later?
Using this strategy causes the SQL to be resolved by adding a WHERE clause and querying each record separately. I specifically don't want to do that because I'm trying to limit network chatter:
var myDataToLookup = context.MyData.AsEnumerable();
for(var myOtherDatum in myOtherDataList)
{
// gets singular record from database each time.
var myDatum = myDataToLookup.SingleOrDefault(w => w.key == myOtherDatum.key)
}
How do I resolve the SQL upfront so myDataToLookup actually contains the data in memory? I've tried ToArray:
var myDataToLookup = context.MyData.ToArray();
But I recently learned that it actually uses more memory than ToList does:
Is it better to call ToList() or ToArray() in LINQ queries?
Should I use a join instead?
var myCombinedData = from o in myOtherDataList
join d in myDataToLookup on
o.key equals d.key
select { myOtherData: o, myData: d};
Should I use ToDictionary and store my key as the key to the dictionary? Or am I worrying too much about this?
If you're using LINQ to Entities then you should not worry if ToArray is slower than ToList. There is almost no difference between them in terms of performance and LINQ to Entities itself will be a bottleneck anyway.
Regarding a dictionary. It is a structure optimized for reads by keys. There is an additional cost on adding new items though. So, if you will read by key a lot and add new items not that often then that's the way to go. But to be honest - you probably should not bother at all. If data size is not big enough, you won't see a difference.
Think of IEnumerable, ICollection and IList/IDictionary as a hierarchy each one inheriting from the previous one. Arrays add a level of restriction and complexity on top of Lists. Simply, IEnumerable gives you iteration only. ICollection adds counting and IList then gives richer functionality including find, add and remove elements by index or via lambda expressions. Dictionaries provide efficient access via a key. Arrays are much more static.
So, the answer then depends on your requirements. If it is appropriate to hold the data in memory and you need to frequently re-query it then I usually convert the Entity result to a List. This also loads the data.
If access via a set of keys is paramount then I use a Dictionary.
I cannot remember that last time I used an array except for infrequent and very specific purposes.
SO, not a direct answer, but as your question and the other replies indicate there isn't a single answer and the solution will be a compromise.
When I code and measure performance and data carried over the network, here is how I look at things based on your example above.
Let's say your result returns 100 records. Your code has now run a query on the server and performed 1 second of processing (I made the number up for sake of argument).
Then you need to cast it to a list which is going to be 1 more second of processing. Then you want to find all records that have a value of 1. The code will now Loop through the entire list to find the values with 1 and then return you the result. This is let's say another 1 second of processing and it finds 10 records.
Your network is going to carry over 10 records that took 3 seconds to process.
If you move your logic to your Data layer and make your query search right away for the records that you want, you can then save 2 seconds of performance and still only carry 10 records across the network. The bonus side is also that you can just use IEnumerable<T> as a result and not have to cast it a list. Thus eliminating the 1 second of casting to list and 1 second of iterating through the list.
I hope this helps answer your question.
I have a set of objects of the same type in memory and each has multiple immutable int properties (but not only them).
I need to find an object there (or multiple) whose properties are in small range near specified values. E.g. a == 5+-1 && b == 21+-2 && c == 9 && any d.
What's the best way to store objects so I can efficiently retrieve them like this?
I thought about making SortedList for each property and using BinarySearch but I have a lot of properties so I would like to have a more generic way instead of so many SortedLists.
It's important that the set itself is not immutable: I need an ability to add/remove items.
Is there something like memory db for objects (not just data)?
Just to expand on #j_random_hacker's answer a bit: The usual approach to 'estimates of the selectivity' is to build a histogram for the index. But, you might already intuitively know which criteria is going to yield the smallest initial result set out of "a == 5+-1 && b == 21+-2 && c == 9". Most likely it is "c == 9" unless there's an exceptionally high number of duplicate values and small universe of potential values for 'c'.
So, a simple analysis of the predicates would be an easy starting point. Equality conditions are highly likely to be the most selective (exhibit the highest selectivity).
From that point, RDBMS' will conduct a sequential scan of the records in the result set to filter for the remaining predicates. That's probably your best approach, too.
Or, there's any number of in-memory, small footprint SQL-capable DBMS that will do the heavy lifting for you (eXtremeDB, SQLite, RDM,... google is your friend) and/or that have lower-level interfaces that won't do all the work for you (still, most) but also won't impose SQL on you.
First, having lots of SortedLists is not bad design. It's essentially the way that all modern RDBMSes solve the same problem.
Further to this: If there was a simple, general, close-to-optimally-efficient way to answer such queries, RDBMSes would not bother with the comparatively complicated and slow hack of query plan optimisation: that is, generating large numbers of candidate query plans and then heuristically estimating which one will take the least time to execute.
Admittedly, queries with many joins between tables are what tends to make the space of possible plans huge in practice with RDBMSes, and you don't seem to have those here. But even with just a single table (set of objects), if there are k fields that can be used for selecting rows (objects), then you could theoretically have k! different indices (SortedLists of (key, value) pairs in which the key is some ordered sequence of the k field values, and the value is e.g. a memory pointer to the object) to choose from. If the outcome of the query is a single object (or alternatively, if the query contains a non-range clause for all k fields) then the index used won't matter -- but in every other case, each index will in general perform differently, so a query planner would need to have accurate estimates of the selectivity of each clause in order to choose the best index to use.
I am trying to essentially see if entities exist in a local context and sort them accordingly. This function seems to be faster than others we have tried runs in about 50 seconds for 1000 items but I am wondering if there is something I can do to improve the efficiency. I believe the find here is slowing it down significantly as a simple foreach iteration over 1000 takes milliseconds and benchmarking shows bottle necking there. Any ideas would be helpful. Thank you.
Sample code:
foreach(var entity in entities) {
var localItem = db.Set<T>().Find(Key);
if(localItem != null)
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
If this is a database (which from the comments I've gathered that it is...)
You would be better off doing fewer queries.
list1.AddRange(db.Set<T>().Where(x => x.Key == Key));
list2.AddRange(db.Set<T>().Where(x => x.Key != Key));
This would be 2 queries instead of 1000+.
Also be aware of the fact that by adding each one to a List<T>, you're keeping 2 large arrays. So if 1000+ turns into 10000000, you're going to have interesting memory issues.
See this post on my blog for more information: http://www.artisansoftware.blogspot.com/2014/01/synopsis-creating-large-collection-by.html
If I understand correctly the database seems to be the bottleneck? If you want to (effectivly) select data from a database relation, whose attribute x should match a ==-criteria, you should consider creating a secondary access path for that attribute (an index structure). Depending on your database system and the distribution in your table this might be a hash index (especially good for checks on ==) or a B+-tree (allrounder) or whatever your system offers you.
However this only works if...
you not only get the full data set once and have to live with that in your application.
adding (another) index to the relation is not out of question (or e.g. its not worth to have it for a single need).
adding an index wouldn't be effective - e.g if the attribute you are querying on has very few unique values.
I found your answers very helpful but here is ultimately how I fold the problem. It seemed .Find was the bottleneck.
var tableDictionary = db.Set<T>().ToDictionary(x => x.KeyValue, x => x);
foreach(var entity in entities) {
if (tableDictionary.ContainsKey(entity.yKeyValue))
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
This ran in with 900+ rows in about a 10th of a second which for our purposes was efficient enough.
Rather than querying the DB for each item, you can just do one query, get all of the data (since you want all of the data from the DB eventually) and you can then group it in memory, which can be done (in this case) about as efficiently as in the database. By creating a lookup of whether or not the key is equal, we can easily get the two groups:
var lookup = db.Set<T>().ToLookup(item => item.Key == Key);
var list1 = lookup[true].ToList();
var list2 = lookup[false].ToList();
(You can use AddRange instead if the lists have previous values that should also be in them.)
At the moment I am using a custom class derived from HashSet. There's a point in the code when I select items under certain condition:
var c = clusters.Where(x => x.Label != null && x.Label.Equals(someLabel));
It works fine and I get those elements. But is there a way that I could receive an index of that element within the collection to use with ElementAt method, instead of whole objects?
It would look more or less like this:
var c = select element index in collection under certain condition;
int index = c.ElementAt(0); //get first index
clusters.ElementAt(index).RunObjectMthod();
Is manually iterating over the whole collection a better way? I need to add that it's in a bigger loop, so this Where clause is performed multiple times for different someLabel strings.
Edit
What I need this for? clusters is a set of clusters of some documents collection. Documents are grouped into clusters by topics similarity. So one of the last step of the algorithm is to discover label for each cluster. But algorithm is not perfect and sometimes it makes two or more clusters with the same label. What I want to do is simply merge those cluster into big one.
Sets don't generally have indexes. If position is important to you, you should be using a List<T> instead of (or possibly as well as) a set.
Now SortedSet<T> in .NET 4 is slightly different, in that it maintains a sorted value order. However, it still doesn't implement IList<T>, so access by index with ElementAt is going to be slow.
If you could give more details about why you want this functionality, it would help. Your use case isn't really clear at the moment.
In the case where you hold elements in HashSet and sometimes you need to get elements by index, consider using extension method ToList() in such situations. So you use features of HashSet and then you take advantage of indexes.
HashSet<T> hashset = new HashSet<T>();
//the special situation where we need index way of getting elements
List<T> list = hashset.ToList();
//doing our special job, for example mapping the elements to EF entities collection (that was my case)
//we can still operate on hashset for example when we still want to keep uniqueness through the elements
There's no such thing as an index with a hash set. One of the ways that hash sets gain efficincy in some cases is by not having to maintain them.
I also don't see what the advantage is here. If you were to obtain the index, and then use it this would be less efficient than just obtaining the element (obtaining the index would be equally efficient, and then you've an extra operation).
If you want to do several operations on the same object, just hold onto that object.
If you want to do something on several objects, do so on the basis of iterating through them (normal foreach or doing foreach on the results of a Where() etc.). If you want to do something on several objects, and then do something else on those several same objects, and you have to do it in such batches, rather than doing all the operations in the same foreach then store the results of the Where() in a List<T>.
why don't use a dictionary?
Dictionary<string, int> dic = new Dictionary<string, int>();
for (int i = 0; i < 10; i++)
{
dic.Add("value " + i, dic.Count + 1);
}
string find = "value 3";
int position = dic[find];
Console.WriteLine("the position of " + find + " is " + position);
example
On a LINQ-result you like this:
var result = from x in Items select x;
List<T> list = result.ToList<T>();
However, the ToList<T> is Really Slow, does it make the list mutable and therefore the conversion is slow?
In most cases I can manage to just have my IEnumerable or as Paralell.DistinctQuery but now I want to bind the items to a DataGridView, so therefore I need to as something else than IEnumerable, suggestions on how I will gain performance on ToList or any replacement?
On 10 million records in the IEnumerable, the .ToList<T> takes about 6 seconds.
.ToList() is slow in comparison to what?
If you are comparing
var result = from x in Items select x;
List<T> list = result.ToList<T>();
to
var result = from x in Items select x;
you should note that since the query is evaluated lazily, the first line doesn't do much at all. It doesn't retrieve any records. Deferred execution makes this comparison completely unfair.
It's because LINQ likes to be lazy and do as little work as possible. This line:
var result = from x in Items select x;
despite your choice of name, isn't actually a result, it's just a query object. It doesn't fetch any data.
List<T> list = result.ToList<T>();
Now you've actually requested the result, hence it must fetch the data from the source and make a copy of it. ToList guarantees that a copy is made.
With that in mind, it's hardly surprising that the second line is much slower than the first.
No, it's not creating the list that takes time, it's fetching the data that takes time.
Your first code line doesn't actually fetch the data, it only sets up an IEnumerable that is capable of fetching the data. It's when you call the ToList method that it will actually get all the data, and that is why all the execution time is in the second line.
You should also consider if having ten million lines in a grid is useful at all. No user is ever going to look through all the lines, so there isn't really any point in getting them all. Perhaps you should offer a way to filter the result before getting any data at all.
I think it's because of memory reallocations: ToList cannot know the size of the collection beforehand, so that it could allocate enough storage to keep all items. Therefore, it has to reallocate the List<T> as it grows.
If you can estimate the size of your resultset, it'll be much faster to preallocate enough elements using List<T>(int) constructor overload, and then manually add items to it.