Efficient design to search for objects by multiple parameters with range - c#

I have a set of objects of the same type in memory and each has multiple immutable int properties (but not only them).
I need to find an object there (or multiple) whose properties are in small range near specified values. E.g. a == 5+-1 && b == 21+-2 && c == 9 && any d.
What's the best way to store objects so I can efficiently retrieve them like this?
I thought about making SortedList for each property and using BinarySearch but I have a lot of properties so I would like to have a more generic way instead of so many SortedLists.
It's important that the set itself is not immutable: I need an ability to add/remove items.
Is there something like memory db for objects (not just data)?

Just to expand on #j_random_hacker's answer a bit: The usual approach to 'estimates of the selectivity' is to build a histogram for the index. But, you might already intuitively know which criteria is going to yield the smallest initial result set out of "a == 5+-1 && b == 21+-2 && c == 9". Most likely it is "c == 9" unless there's an exceptionally high number of duplicate values and small universe of potential values for 'c'.
So, a simple analysis of the predicates would be an easy starting point. Equality conditions are highly likely to be the most selective (exhibit the highest selectivity).
From that point, RDBMS' will conduct a sequential scan of the records in the result set to filter for the remaining predicates. That's probably your best approach, too.
Or, there's any number of in-memory, small footprint SQL-capable DBMS that will do the heavy lifting for you (eXtremeDB, SQLite, RDM,... google is your friend) and/or that have lower-level interfaces that won't do all the work for you (still, most) but also won't impose SQL on you.

First, having lots of SortedLists is not bad design. It's essentially the way that all modern RDBMSes solve the same problem.
Further to this: If there was a simple, general, close-to-optimally-efficient way to answer such queries, RDBMSes would not bother with the comparatively complicated and slow hack of query plan optimisation: that is, generating large numbers of candidate query plans and then heuristically estimating which one will take the least time to execute.
Admittedly, queries with many joins between tables are what tends to make the space of possible plans huge in practice with RDBMSes, and you don't seem to have those here. But even with just a single table (set of objects), if there are k fields that can be used for selecting rows (objects), then you could theoretically have k! different indices (SortedLists of (key, value) pairs in which the key is some ordered sequence of the k field values, and the value is e.g. a memory pointer to the object) to choose from. If the outcome of the query is a single object (or alternatively, if the query contains a non-range clause for all k fields) then the index used won't matter -- but in every other case, each index will in general perform differently, so a query planner would need to have accurate estimates of the selectivity of each clause in order to choose the best index to use.

Related

Do I need to index on a id in EF Core if I'm searching for an id in 2 different columns?

If I do a query like below where I'm searching for the same ID but on two different columns. Should I have an index like this? Or should I create 2 separate indexes, one for each column?
modelBuilder.Entity<Transfer>()
.HasIndex(p => new { p.SenderId, p.ReceiverId });
Query:
var transfersCount = await _dbContext.Transfers
.Where(p => p.ReceiverId == user.Id || p.SenderId == user.Id)
.CountAsync();
What if I have a query like this below, would I need a multicolumn index on all 4 columns?
var transfersCount = await _dbContext.Transfers
.Where(p => (p.SenderId == user.Id || p.ReceiverId == user.Id) &&
(!transferParams.Status.HasValue || p.TransferStatus == (TransferStatus)transferParams.Status) &&
(!transferParams.Type.HasValue || p.TransferType == (TransferType)transferParams.Type))
.CountAsync();
I recommend two single-column indices.
The two single-column indices will perform better in this query because both columns would be in a fully ordered index. By contrast, in a multi-column index, only the first column is fully ordered in the index.
If you were using an AND condition for the sender and receiver, then you would benefit from a multi-column index. The multi-column index is ideal for situations where multiple columns have conditional statements that must all be evaluated to build the result set (e.g., WHERE receiver = 1 AND sender = 2). In an OR condition, a multi-column index would be leveraged as though it were a single-column index only for the first column; the second column would be unindexed.
The full intricacies of index design would take well more than an SO answer to explain; there are probably books about it, and it will feature as a reasonable proportion of a database administrator's job
Indexes have a cost to maintain so you generally strive to have the fewest possible that offer you the most flexibility with what you want to do. Generally an index will have some columns that define its key and a reference to rows in the table that have those keys. When using an index the database engine can quickly look up the key, and discover which rows it needs to read from. It then looks up those rows as a secondary operation.
Indexes can also store table data that aren't part of the lookup key, so you might find yourself creating indexes that also track other columns from the row so that by the time the database has found the key it's looking for in the index it also has access to the row data the query wants and doesn't then need to launch a second lookup operation to find the row. If a query wants too many rows from a table, the database might decide to skip using the index at all; there's some threshold beyond which it's faster to just read all the rows direct from the table and search them rather than suffer the indirection of using the index to find which rows need to be read
The columns that an index indexes can serve more than one query; order is important. If you always query a person by name and also sometimes query by age, but you never query by age alone, it would be better to index (name,age) than (age,name). An index on (name,age) can serve a query for just WHERE name = ..., and also WHERR name = ... and age = .... If you use an OR keyword in a where clause you can consider that as a separate query entirely that would need its own index. Indeed the database might decide to run "name or age" as two parallel queries and combine the results to remove duplicates. If your app needs later change so that instead of just querying a mix of (name), (name and age) it is now frequently querying (name), (name and age), (name or age), (age), (age and height) then it might make sense to have two indexes: (name, age) plus (age, height). The database can use part or all of both of these to server the common queries. Remember that using part of an index only works from left to right. An index on (name, age) wouldn't typically serve a query for age alone.
If you're using SQLServer and SSMS you might find that showing the query plan also reveals a missing index recommendation and it's worth considering carefully whether an index needs to be added. Apps deployed to Microsoft azure also automatically look at common queries where performance suffers because of a lack of an index and it can be the impetus to take a look at the query being run and seeing how existing or new indexes might be extended or rearranged to cover it; as first noted it's not really something a single SO answer of a few lines can prep you for with a "always do this and it will be fine" - companies operating at large scale hire people whose sole mission is to make sure the database runs well they usually grumble a lot about the devs and more so about things like entity framework because an EF LINQ query is a layer disconnected from the actual SQL being run and may not be the most optimal approach to getting the data. All these things you have to contend with.
In this particular case it seems like indexes on SenderId+TransferStatus+TransferType and another on ReceiverId+TransferStatus+TransferType could help the two queries shown, but I wouldn't go as far as to say "definitely do that" without taking a holistic view of everything this table contains, how many different values there are in those columns and what it's used for in the context of the app. If Sender/Receiver are unique, there may be no point in adding more columns to the index as keys. If TransferStatus and Type change such that some combination of them helps uniquely identify some particular row out of hundreds then it may make sense, but then if this query only runs once a day compared to another that is used 10 times a second... There's too much variable and unknown to provide a concrete answer to the question as presented; don't optmize prematurely - indexing columns just because they're used in some where clause somewhere would be premature

Most efficient collection for storing data from LINQ to Entities?

I have read several different sources over the years that indicate that when storing a collection of data, a List<T> is efficient when you want to insert objects, and an IEnumerable<T> is best for enumerating over a collection.
In LINQ-to-Entities, there is the AsEnumerable() function, that will return an IEnumerable<T>, but it will not resolve the SQL created by the LINQ statement until you start enumerating over the list.
What if I want to store objects from LINQ to Entities in a collection and then query on that collection later?
Using this strategy causes the SQL to be resolved by adding a WHERE clause and querying each record separately. I specifically don't want to do that because I'm trying to limit network chatter:
var myDataToLookup = context.MyData.AsEnumerable();
for(var myOtherDatum in myOtherDataList)
{
// gets singular record from database each time.
var myDatum = myDataToLookup.SingleOrDefault(w => w.key == myOtherDatum.key)
}
How do I resolve the SQL upfront so myDataToLookup actually contains the data in memory? I've tried ToArray:
var myDataToLookup = context.MyData.ToArray();
But I recently learned that it actually uses more memory than ToList does:
Is it better to call ToList() or ToArray() in LINQ queries?
Should I use a join instead?
var myCombinedData = from o in myOtherDataList
join d in myDataToLookup on
o.key equals d.key
select { myOtherData: o, myData: d};
Should I use ToDictionary and store my key as the key to the dictionary? Or am I worrying too much about this?
If you're using LINQ to Entities then you should not worry if ToArray is slower than ToList. There is almost no difference between them in terms of performance and LINQ to Entities itself will be a bottleneck anyway.
Regarding a dictionary. It is a structure optimized for reads by keys. There is an additional cost on adding new items though. So, if you will read by key a lot and add new items not that often then that's the way to go. But to be honest - you probably should not bother at all. If data size is not big enough, you won't see a difference.
Think of IEnumerable, ICollection and IList/IDictionary as a hierarchy each one inheriting from the previous one. Arrays add a level of restriction and complexity on top of Lists. Simply, IEnumerable gives you iteration only. ICollection adds counting and IList then gives richer functionality including find, add and remove elements by index or via lambda expressions. Dictionaries provide efficient access via a key. Arrays are much more static.
So, the answer then depends on your requirements. If it is appropriate to hold the data in memory and you need to frequently re-query it then I usually convert the Entity result to a List. This also loads the data.
If access via a set of keys is paramount then I use a Dictionary.
I cannot remember that last time I used an array except for infrequent and very specific purposes.
SO, not a direct answer, but as your question and the other replies indicate there isn't a single answer and the solution will be a compromise.
When I code and measure performance and data carried over the network, here is how I look at things based on your example above.
Let's say your result returns 100 records. Your code has now run a query on the server and performed 1 second of processing (I made the number up for sake of argument).
Then you need to cast it to a list which is going to be 1 more second of processing. Then you want to find all records that have a value of 1. The code will now Loop through the entire list to find the values with 1 and then return you the result. This is let's say another 1 second of processing and it finds 10 records.
Your network is going to carry over 10 records that took 3 seconds to process.
If you move your logic to your Data layer and make your query search right away for the records that you want, you can then save 2 seconds of performance and still only carry 10 records across the network. The bonus side is also that you can just use IEnumerable<T> as a result and not have to cast it a list. Thus eliminating the 1 second of casting to list and 1 second of iterating through the list.
I hope this helps answer your question.

Compare very large lists of database objects in c#

I have inherited a poorly designed database table (no primary key or indexes, oversized nvarchar fields, dates stored as nvarchar, etc.). This table has roughly 350,000 records. I get handed a list of around 2,000 potentially new records at predefined intervals, and I have to insert any of the potentially new records if the database does not already have a matching record.
I initially tried making comparisons in a foreach loop, but it quickly became obvious that there was probably a much more efficient way. After doing some research, I then tried the .Any(), .Contains(), and .Exclude() methods.
My research leads me to believe that the .Exclude() method would be the most efficient, but I get out of memory errors when trying that. The .Any() and .Contains() methods seem to both take roughly the same time to complete (which is faster than the foreach loop).
The structure of the two lists are identical, and each contain multiple strings. I have a few questions that I have not found satisfying answers to, if you don't mind.
When comparing two lists of objects (made up of several strings), is the .Exclude() method considered to be the most efficient?
Is there a way to use projection when using the .Exclude() method? What I would like to find a way to accomplish would be something like:
List<Data> storedData = db.Data;
List<Data> incomingData = someDataPreviouslyParsed;
// No Projection that runs out of memory
var newData = incomingData.Exclude(storedData).ToList();
// PsudoCode that I would like to figure out if is possible
// First use projection on db so as to not get a bunch of irrelevant data
List<Data> storedData = db.Data.Select(x => new { x.field1, x.field2, x.field3 });
var newData = incomingData.Select(x => new { x.field1, x.field2, x.field3 }).Exclude(storedData).ToList();
Using a raw SQL statement in SQL Server Studio Manager, the query takes slightly longer than 10 seconds. Using EF, it seems to take in excess of a minute. Is that poorly optimized SQL by EF, or is that overhead from EF that makes such a difference?
Would raw SQL in EF be a better practice in a situation like this?
Semi-Off-Topic:
When grabbing the data from the database and storing it in the variable storedData, does that eliminate the usefulness of any indexes (should there be any) stored in the table?
I hate to ask so many questions, and I'm sure that many (if not all) of them are quite noobish. However, I have nowhere else to turn, and I have been looking for clear answers all day. Any help is very much so appreciated.
UPDATE
After further research, I have found what seems to be a very good solution to this problem. Using EF, I grab the 350,000 records from the database keeping only the columns I need to create a unique record. I then take that data and convert it to a dictionary grouping the kept columns as the key (like can be seen here). This solves the problem of there already being duplicates in the returned data, and gives me something fast to work with to compare my newly parsed data to. The performance increase was very noticeable!
I'm still not sure if this would be approaching the best practice, but I can certainly live with the performance of this. I have also seen some references to ToLookup() that I may try to get working to see if there is a performance gain there as well. Nevertheless, here is some code to show what I did:
var storedDataDictionary = storedData.GroupBy(k => (k.Field1 + k.Field2 + k.Field3 + k.Field4)).ToDictionary(g => g.Key, g => g.First());
foreach (var item in parsedData)
{
if (storedDataDictionary.ContainsKey(item.Field1 + item.Field2 + item.Field3 + item.Field4))
{
// duplicateData is a previously defined list
duplicateData.Add(item);
}
else
{
// newData is a previously defined list
newData.Add(item);
}
}
No reason to use EF for that.
Grab only columns that are required for you to make decision if you should update or insert the record (so those which represent missing "primary key"). Don't waste memory for other columns.
Build a HashSet of existing primary keys (i.e. if primary key is a number, HashSet of int, if it has multiple keys - combine them to string).
Check your 2000 items against HashSet, that is very fast.
Update or insert items with raw sql.
I suggest you consider doing it in SQL, not C#. You don't say what RDBMS you are using, but you could look at the MERGE statement, e.g. (for SQL Server 2008):
https://technet.microsoft.com/en-us/library/bb522522%28v=sql.105%29.aspx
Broadly, the statement checks if a record is 'new' - if so, you can INSERT it; if not there is UPDATE and DELETE capabilities, or you just ignore it.

LINQ Intersection of two different types

I have two different list types. I need to remove the elements from list1 that is not there in list2 and list2 element satisfies certain criteria.
Here is what I tried, seems to work but each element is listed twice.
var filteredTracks =
from mtrack in mTracks
join ftrack in tracksFileStatus on mtrack.id equals ftrack.Id
where mtrack.id == ftrack.Id && ftrack.status == "ONDISK" && ftrack.content_type_id == 234
select mtrack;
Ideally I don't want to create a new copy of the filteredTracks, is it possible modify mTracks in place?
If you're getting duplicates, it's because your id fields are not unique in one or both of the two sequences. Also, you don't need to say where mtrack.id == ftrack.Id since that condition already has to be met for the join to succeed.
I would probably use loops here, but if you are dead set on LINQ, you may need to group tracksFileStatus by its Id field. It's hard to tell by what you posted.
As far as "modifying mTracks in place", this is probably not possible or worthwhile (I'm assuming that mTracks is some type derived from IEnumerable<T>). If you're worried about the efficiency of this approach, then you may want to consider using another kind of data structure, like a dictionary with Id values as the keys.
Since the Q was about lists primarily...
this is probably better linq wise...
var test = (from m in mTracks
from f in fTracks
where m.Id == f.Id && ...
select m);
However you should optimize, e.g.
Are your lists sorted? If they are, see e.g. Best algorithm for synchronizing two IList in C# 2.0
If it's coming from Db (it's not clear here), then you need to build your linq query based on the SQL / relations and indexes you have in the Db and go a bit different route.
If I were you, I'd make a query (for each of the lists, presuming it's not Db bound) so that tracks are sorted in the first place (and sort on whatever is used to compare them, usually),
then enumerate in parallel (using enumerators), comparing other things in the process (like in that link).
that's likely the most efficient way.
if/when it comes from database, optimize at the 'source' - i.e. fetch data already sorted and filtered as much as you can. And basically, build an SQL first, or inspect the returned SQL from the linq query (let me know if you need the link).

Is this an efficient way of doing a linq query with dynamic order by fields?

I have the following method to apply a sort to a list of objects (simplified it for the example):
private IEnumerable<Something> SetupOrderSort(IEnumerable<Something> input,
SORT_TYPE sort)
{
IOrderedEnumerable<Something> output = input.OrderBy(s => s.FieldA).
ThenBy(s => s.FieldB);
switch (sort)
{
case SORT_TYPE.FIELD1:
output = output.ThenBy(s => s.Field1);
break;
case SORT_TYPE.FIELD2:
output = output.ThenBy(s => s.Field2);
break;
case SORT_TYPE.UNDEFINED:
break;
}
return output.ThenBy(s => s.FieldC).ThenBy(s => s.FieldD).
AsEnumerable();
}
What I needs is to be able to insert a specific field in the midst of the orby clause. By default the ordering is: FIELDA, FIELDB, FIELDC, FIELDD.
When a sort field is specified though I need to insert the specified field between FIELDB and FIELDC in the sort order.
Currently there is only 2 possible fields to sort by but could be up to 8. Performance wise is this a good approach? Is there a more efficient way of doing this?
EDIT: I saw the following thread as well: Dynamic LINQ OrderBy on IEnumerable<T> but I thought it was overkill for what I needed. This is a snippet of code that executes a lot so I just want to make sure I am not doing something that could be easily done better that I am just missing.
Don't try and "optimize" stuff you haven't proved slow with a profiler.
It's highly unlikely that this will be slow enough to notice. I strongly suspect the overhead of actually sorting the list is higher than switching on one string.
The important question is: Is this code maintainable? Will you forget to add another case the next time you add a property to Something? If that will be a problem, consider using the MS Dynamic Query sample, from the VS 2008 C# samples page.
Otherwise, you're fine.
There's nothing inefficient about your method, but there is something unintuitive about it, which is that you can't sort by multiple columns - something that end users are almost sure to want to do.
I might hand-wave this concern away on the chance that both columns are unique, but the fact that you subsequently hard-code in another sort at the end leads me to believe that Field1 and Field2 are neither related nor unique, in which case you really should consider the possibility of having an arbitrary number of levels of sorting, perhaps by accepting an IEnumerable<SORT_TYPE> or params SORT_TYPE[] argument instead of a single SORT_TYPE.
Anyway, as far as performance goes, the OrderBy and ThenBy extensions have deferred execution, so each successive ThenBy in your code is probably no more than a few CPU instructions, it's just wrapping one function in another. It will be fine; the actual sorting will be far more expensive.

Categories