Dictionary/List speed for database update - c#

I've got my model updating the database according to some information that comes in in the form of a Dictionary. The way I currently do it is below:
SortedItems = db.SortedItems.ToList();
foreach (SortedItem si in SortedItems)
{
string key = si.PK1 + si.PK2 + si.PK3 + si.PK4;
if (updates.ContainsKey(key) && updatas[key] != si.SortRank)
{
si.SortRank = updates[key];
db.SortedItems.ApplyCurrentValues(si);
}
}
db.SaveChanges();
Would it be faster to iterate through the dictionary, and do a db lookup for each item? The dictionary only contains the item that have changed, and can be anywhere from 2 items to the entire set. My idea for the alternate method would be:
foreach(KeyValuePair<string, int?> kvp in updates)
{
SortedItem si = db.SortedItems.Single(s => (s.PK1 + s.PK2 + s.PK3 + s.PK4).Equals(kvp.Key));
si.SortRank = kvp.Value;
db.SortedItems.ApplyCurrentValues(si);
}
db.SaveChanges();
EDIT: Assume the number of updates is usually about 5-20% of the db entires

Let's look:
Method 1:
You'd iterate through all 1000 items in the database
You'd still visit every item in the Dictionary and have 950 misses against the dictionary
You'd still have 50 update calls to the database.
Method 2:
You'd iterate every item in the dictionary with no misses in the dictionary
You'd have 50 individual lookup calls to the database.
You'd have 50 update calls to the database.
This really depends on how big the dataset is and what % on average get modified.
You could also do something like this:
Method 3:
Build a set of all the keys from the dictionary
Query the database once for all items matching those keys
Iterate over the results and update each item
Personally, I would try to determine your typical case scenario and profile each solution to see which is best. I really think the 2nd solution, though, will result in a ton of database and network hits if you have a large set and a large number of updates, since for each update it would have to hit the database twice (once to get the item, once to update the item).
So yes, this is a very long winded way of saying, "it depends..."
When in doubt, I'd code both and time them based on reproductions of production scenarios.

To add to #James' answer, you would get fastest results using a stored proc (or a regular SQL command).
The problem with LINQ-to-Entities (and other LINQ providers, if they haven't updated recently) is that they don't know how to produce SQL updates with where clauses:
update SortedItems set SortRank = #NewRank where PK1 = #PK1 and (etc.)
A stored procedure would do this at the server side, and you would only need a single db call.

Related

Compare very large lists of database objects in c#

I have inherited a poorly designed database table (no primary key or indexes, oversized nvarchar fields, dates stored as nvarchar, etc.). This table has roughly 350,000 records. I get handed a list of around 2,000 potentially new records at predefined intervals, and I have to insert any of the potentially new records if the database does not already have a matching record.
I initially tried making comparisons in a foreach loop, but it quickly became obvious that there was probably a much more efficient way. After doing some research, I then tried the .Any(), .Contains(), and .Exclude() methods.
My research leads me to believe that the .Exclude() method would be the most efficient, but I get out of memory errors when trying that. The .Any() and .Contains() methods seem to both take roughly the same time to complete (which is faster than the foreach loop).
The structure of the two lists are identical, and each contain multiple strings. I have a few questions that I have not found satisfying answers to, if you don't mind.
When comparing two lists of objects (made up of several strings), is the .Exclude() method considered to be the most efficient?
Is there a way to use projection when using the .Exclude() method? What I would like to find a way to accomplish would be something like:
List<Data> storedData = db.Data;
List<Data> incomingData = someDataPreviouslyParsed;
// No Projection that runs out of memory
var newData = incomingData.Exclude(storedData).ToList();
// PsudoCode that I would like to figure out if is possible
// First use projection on db so as to not get a bunch of irrelevant data
List<Data> storedData = db.Data.Select(x => new { x.field1, x.field2, x.field3 });
var newData = incomingData.Select(x => new { x.field1, x.field2, x.field3 }).Exclude(storedData).ToList();
Using a raw SQL statement in SQL Server Studio Manager, the query takes slightly longer than 10 seconds. Using EF, it seems to take in excess of a minute. Is that poorly optimized SQL by EF, or is that overhead from EF that makes such a difference?
Would raw SQL in EF be a better practice in a situation like this?
Semi-Off-Topic:
When grabbing the data from the database and storing it in the variable storedData, does that eliminate the usefulness of any indexes (should there be any) stored in the table?
I hate to ask so many questions, and I'm sure that many (if not all) of them are quite noobish. However, I have nowhere else to turn, and I have been looking for clear answers all day. Any help is very much so appreciated.
UPDATE
After further research, I have found what seems to be a very good solution to this problem. Using EF, I grab the 350,000 records from the database keeping only the columns I need to create a unique record. I then take that data and convert it to a dictionary grouping the kept columns as the key (like can be seen here). This solves the problem of there already being duplicates in the returned data, and gives me something fast to work with to compare my newly parsed data to. The performance increase was very noticeable!
I'm still not sure if this would be approaching the best practice, but I can certainly live with the performance of this. I have also seen some references to ToLookup() that I may try to get working to see if there is a performance gain there as well. Nevertheless, here is some code to show what I did:
var storedDataDictionary = storedData.GroupBy(k => (k.Field1 + k.Field2 + k.Field3 + k.Field4)).ToDictionary(g => g.Key, g => g.First());
foreach (var item in parsedData)
{
if (storedDataDictionary.ContainsKey(item.Field1 + item.Field2 + item.Field3 + item.Field4))
{
// duplicateData is a previously defined list
duplicateData.Add(item);
}
else
{
// newData is a previously defined list
newData.Add(item);
}
}
No reason to use EF for that.
Grab only columns that are required for you to make decision if you should update or insert the record (so those which represent missing "primary key"). Don't waste memory for other columns.
Build a HashSet of existing primary keys (i.e. if primary key is a number, HashSet of int, if it has multiple keys - combine them to string).
Check your 2000 items against HashSet, that is very fast.
Update or insert items with raw sql.
I suggest you consider doing it in SQL, not C#. You don't say what RDBMS you are using, but you could look at the MERGE statement, e.g. (for SQL Server 2008):
https://technet.microsoft.com/en-us/library/bb522522%28v=sql.105%29.aspx
Broadly, the statement checks if a record is 'new' - if so, you can INSERT it; if not there is UPDATE and DELETE capabilities, or you just ignore it.

How to save and reuse an IQueryable, or it's Where clause?

I have an IQueryable that has a Where clause with many parameters. I could save each parameter to the ASP.NET session and recreate the IQueryable from zero, but I figured is easier to save only one parameter to the session: the IQueryable or at least the where clause of the IQueryable.
How to do this?
The query:
IQueryable<DAL.TradeCard> data = dc.TradeCard.Include("Address").Include("Vehicle");
data = data.Where(it =>
(tbOrderNumber.Text == null || it.orderNumber == tbOrderNumber.Text) &&
(tbPlateNumber.Text == null || it.Vehicle.plateNumber == tbPlateNumber.Text));
(now there are only 2 but there will be many more parameters)
PS: I don't want to save the query result to the session.
I'm not sure if this will work, but you could try to use DataContext.GetCommand to get the SQL command for the query and then save it to use it with DataContext.ExecuteQuery.
I had whole big answer ready but I misunderstood what you're trying to achieve :)
Ok, so let me write out what I think you're trying to do:
Server gets request ("get data for paramA = 1, paramB = 2, ... paramZ = 24")
Server runs a series of "Where" and gets a filtered result set
Server sends data to client
Client does some stuff that operates on the same set of filtered data. But you don't want server to re-run the query! And you cannot save the data to client's session cause it's a lot of records.
Until client explicitely calls the query with different params, the query should not be re-run
I was working on simillar problem lately, but there isn't one magic bullet :)
Some ideas for the solution:
Cache list of ID's. Unless the data goes into hundreds of thousands records, you probably can save the indexed Id of the selected items to the session. It's what, 4-8 bytes per ID + overhead? But that does re-runs the query, just more more efficiently: data = source.Include("...").Where(i => IdsFromSession.Contain(i.Id));
(Added in edit) Cache the query input string/object/however your search values are passed. You can probably fairly easily serialize it and use that (or hash of that) as a server-side cache key.
(I love the idea, but it's a bit wonky :) ) Cache the Wheres! Now, this workes like this:
Create method that takes expressions instead of Funcs
Write your where lambdas to that method and have that magic method actually return Funcs for "Where"
Get an unique hash for the where lambdas
Check the server-side cache for that hash, if needed run the query and save the results under that hash.
Now, this is a huge overkill and overengineered solution, but I am personally dying to actually implement this. It would look like this:
class MagicClass{ // don't have time for name-inventing :)
private List<string> hashes = new List<string>();
public string Hash {get{ return String.Join("_", hashes);}}
public Func<TIn,bool> MagicWhere(Expression<Func<TIn,bool>> where){
var v = new MagicExpressionVisitor();
v.Visit(where);
hashes.Add(v.ExpressionHash);
return where.Compile(); // I think that should do...
}
}
class MagicExpressionVisitor : ExpressionVisitor
{
public string ExpressionHash {get;set;}
// Override ExpressionVisitor methods to get a possibly unique hash depending on what's actually in that expression
}
Usage:
var magic = new MagicClass();
data = data.Where(magic.MyWhere<DAL.TradeCard,int>(it => it.IsSomething && it.Name != "some name"));
if(!Cache.HasKey(magic.Hash))
Cache[magic.Hash] = data.ToList();
return Cache[magic.Hash];
This is obviously untested, but doable. It won't execute the Wheres on data source, if such query was already run (in cache period). It has two advantages over caching Id's: 1. It works for many clients simultanously (so second client will benefit from the fact, that the first one requested the same query, 2. It doesn't touch the datasource at all.
If the datasource is DB, you can probably find out what is the actual SQL command being run, SHA-### it and save results under that hash, but my crazy solution works for other datasources, like LINQ-to-Objects etc. ;)

Appending and deleting linked tables in access using c# issue

I have a piece of code that goes through all the linked tables and tables in an access database and for every table(all linked in this case) that matches a certain criteria it should add a new table and delete the old. The new is on a sql server database and the old the oracle, however this is irrelevant. The code is:
var dbe = new DBEngine();
Database db = dbe.OpenDatabase(#"C:\Users\x339\Documents\Test.accdb");
foreach (TableDef tbd in db.TableDefs)
{
if (tbd.Name.Contains("CLOASEUCDBA_T_"))
{
useddatabases[i] = tbd.Name;
string tablename = CLOASTableDictionary[tbd.Name];
string tablesourcename = CLOASTableDictionary[tbd.Name].Substring(6);
var newtable = db.CreateTableDef(tablename.Trim());
newtable.Connect = "ODBC;DSN=sql server copycloas;Trusted_Connection=Yes;APP=Microsoft Office 2010;DATABASE=ILFSView;";
newtable.SourceTableName = tablesourcename;
db.TableDefs.Append(newtable);
db.TableDefs.Delete(tbd.Name);
i++;
}
}
foreach (TableDef tbd in db.TableDefs)
{
Console.WriteLine("After loop "+tbd.Name);
}
There are 3 linked tables in this database 'CLOASEUCDBA_T_AGENT', 'CLOASEUCDBA_T_CLIENT' and 'CLOASEUCDBA_T_BASIC_POLICY'. The issue with the code is that it updates the first two tables perfectly but for some unknown reason, it never finds the third. Then in the second loop, it prints it out... it seems to just skip over 'CLOASEUCDBA_T_BASIC_POLICY'. I really dont know why. The weird thing is then that if run the code again, it will change 'CLOASEUCDBA_T_BASIC_POLICY'. Any help would be greatly appreciated.
Modifying a collection while you are iterating over it can sometimes mess things up. Try using a slightly different approach:
Iterate over the TableDefs collection and build a List (or perhaps a Dictionary) of the items you need to change. Then,
Iterate over the List and update the items in the TableDefs collection.

Multiple operations under linq2Entities

I've been using Linq2Entities for quite some time now on small scale programs. So usually my queries are basic (return elements with a certain value in a specific column, add/update element, remove element,...).
Now i'm moving to a larger scale program and given my experience went with Linq2Entities and everything was fine until i had to remove a large number of elements.
In Linq2Entities i can only find a remove method that takes on entity - and so to remove 1000 elemnts i need first to retrieve them then apply a remove one by one then call savechanges.
Either i am missing something or i feel a) this is really bad performance-wise and b) i would be better off making this a procedure in the database.
Am i right in my assumptions? Should massive removals / additions be made as procedures only? or is there a way to do this efficiently with Linq2Entities that i am missing?
If you have the primary key values then you can use the other features of the Context by creating the objects manually, setting their key(s), and attaching them to the context and setting their state to delete.
var ids = new List<Guid>();
foreach (var id in ids)
{
Employee employee = new Employee();
employee.Id = id;
entityEntry = context.Entry(employee);
entityEntry.State = EntityState.Deleted;
}
context.SaveChanges();

Efficiency of C# Find on 1000+ records

I am trying to essentially see if entities exist in a local context and sort them accordingly. This function seems to be faster than others we have tried runs in about 50 seconds for 1000 items but I am wondering if there is something I can do to improve the efficiency. I believe the find here is slowing it down significantly as a simple foreach iteration over 1000 takes milliseconds and benchmarking shows bottle necking there. Any ideas would be helpful. Thank you.
Sample code:
foreach(var entity in entities) {
var localItem = db.Set<T>().Find(Key);
if(localItem != null)
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
If this is a database (which from the comments I've gathered that it is...)
You would be better off doing fewer queries.
list1.AddRange(db.Set<T>().Where(x => x.Key == Key));
list2.AddRange(db.Set<T>().Where(x => x.Key != Key));
This would be 2 queries instead of 1000+.
Also be aware of the fact that by adding each one to a List<T>, you're keeping 2 large arrays. So if 1000+ turns into 10000000, you're going to have interesting memory issues.
See this post on my blog for more information: http://www.artisansoftware.blogspot.com/2014/01/synopsis-creating-large-collection-by.html
If I understand correctly the database seems to be the bottleneck? If you want to (effectivly) select data from a database relation, whose attribute x should match a ==-criteria, you should consider creating a secondary access path for that attribute (an index structure). Depending on your database system and the distribution in your table this might be a hash index (especially good for checks on ==) or a B+-tree (allrounder) or whatever your system offers you.
However this only works if...
you not only get the full data set once and have to live with that in your application.
adding (another) index to the relation is not out of question (or e.g. its not worth to have it for a single need).
adding an index wouldn't be effective - e.g if the attribute you are querying on has very few unique values.
I found your answers very helpful but here is ultimately how I fold the problem. It seemed .Find was the bottleneck.
var tableDictionary = db.Set<T>().ToDictionary(x => x.KeyValue, x => x);
foreach(var entity in entities) {
if (tableDictionary.ContainsKey(entity.yKeyValue))
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
This ran in with 900+ rows in about a 10th of a second which for our purposes was efficient enough.
Rather than querying the DB for each item, you can just do one query, get all of the data (since you want all of the data from the DB eventually) and you can then group it in memory, which can be done (in this case) about as efficiently as in the database. By creating a lookup of whether or not the key is equal, we can easily get the two groups:
var lookup = db.Set<T>().ToLookup(item => item.Key == Key);
var list1 = lookup[true].ToList();
var list2 = lookup[false].ToList();
(You can use AddRange instead if the lists have previous values that should also be in them.)

Categories