I receive a daily XML file which I use to update a database with the content. The file is always a complete file, i.e. everything is included whether it is changed or not. I am using Linq2Sql to update the database and I was debating whether to check if anything had changed in each record (most will not change) and only update those which did change, or just update each record with the current data.
I feel that I need to hit the database with an update for each record to enable me to weed out the records which are not included in the xml file. I am setting a processed date on each record, then revisiting those not processed to delete them. Then I wondered whether I should just find the corresponding record in the database ad update the object with the current information whether it has changed or not. That led me to taking a closer look at the sql generated for updates. I found that only the data which has changed is set in the update statement to the database, but I found that the WHERE clause includes all of the columns in the record, not just the primary key. This seems very wasteful in terms of data flying around the system and therefore set me wondering why this is the case and whether there is setting for the LinqToSql context to use only the primary key in the clause.
So I have two questions:
Why does LinqToSql where clause include all of the current data, not just the primary key?
Is there a way to configure the context to only use the primary key in the where clause?
This is optimistic concurrency - it's basically making sure that it doesn't stomp on changes made by anything else. You can tweak the concurrency settings in various ways, although I'm not an expert on it.
The MSDN page for Linq to Sql optimistic concurrency is a good starting point.
If you have a column representing the "version" of the row (e.g. an autoupdated timestamp) you can use just that - or you can just set UpdateCheck=Never on all the columns if you know nothing else will have changed the data.
You haven't really described enough about "your use of the processed date" to answer the third point.
To answer #2, in the dbml designer, set the property "Update Check" equal to "Never" on the column level for each column in the table to avoid the generation of massive where clauses.
Related
I have a web application where user register by clicking a button "Join". There can be so many users on the website and that is why in order to keep my database queries fast; I chose not to have foriegnkey constraint added in database(Though it is relational database).
Now what happens is when user with same userId opens the application in two different browsers and hit the "Join" button exactly at the same time; two rows for same user is added into database which is wrong.
The ideas I have to stop this are:
Do the check/insertion logic in stored procedure and within a transaction with SQL Transaction Isolation level as SERIALIZABLE; but with this approach table will be locked even if two different users would be hitting "JOIN" button at the same time.
Use lock keyword in c# and perform check/insertion logic from inside it but I believe if the same user from two browser they will acquire their own lock and would still be able to have two entries in database. Also for different users it might create a problem as other's code would be waiting for the first one to free the resources.
Use Optimistic concurrency which is supported out of box by EntityFramework but I am not sure if it will solve my problem.
Could you please help me with this?
You can easyly solve your problem by creating an unique index in the user name. So, only the first one will be saved. The next one will be reported as an error, because it would break the unique index.
In fact, it should be the primary key.
According to your comments, your table is huge. So it must be much worse to look for a row in the whole table without using an index on every insert operation, than updating the an index on each insert/delete/update operation. You should consider this.
Anyway, the only way to solve the problem of not inserting the value if already exists means checking it.
Optimistic concurrency has nothing to do with that. Optimistic concurrency has to do with reading data, modifying it, and saving changes, without locking the table. What optimistic concurrency does can be explained in this steps:
read the original row from the DB, without any locks or transactions
the app modifies the original row
when the app tries to save the changes, it checks if the row in the DB is exactly as it was when it was read on the step 1. If it is, the changes are saved. If it isn't a concurrency exception is thrown.
So optimistic concurrency will not help you.
I insist on using an unique index, which is the safest, most simple, and probably more preformant solution.
I would use Entity and its Optimistic Concurrency.
It will wrap it in a transaction and handle these problems for you. Remember to place both identity and a primary key on the table. In case the username has to be unique then add the unique annotation on the table.
I'm using an sql server and i have a specific table that can contain ~1million-~10 million recrdords max.
In each record i retrieve I do some checkings (i run a few simple lines of code), and then I want to mark that the records was checked in DateTime.Now;
so what i do is retrieve a record, check some stuff, run an 'update' query to set the 'last_checked_time' field to DateTime.Now, and then move to the next record.
I can then get all the records ordered by their 'last_checked_time' field (ascending), and then i can iterate over them ordered by the their check time..
Is this a good practice ? Can it still remain speedy as long as i have no more than 10 million records on that table ?
I've read somewhere that every 'update' query is actually a deletion and a creation of a new record.
I'd also like to mention that these records will be frequently retrieved by my ASP.net website ..
I was thinking of writing down the 'last_checked_time' on a local txt file/binary file,but i'm guessing it would mean implementing something that the database can already do for you.
If you need that "last checked time" value then the best, most efficient, place to hold it is on the row in the table. It doesn't matter how many rows there are in the table, each update will affect just the row(s) you updated.
How an update is implemented is up to the DBMS, but it is not generally done by deleting and re-inserting the row.
I would recommend retrieving your data or a portion of the data, doing your checks on all of them and sending the updates back in transactions to let the database operate more effectively. This would provide for fewer round trips.
As to if this is a good practice, I would say yes especially since you are using in in your queries. Definitely, do not store the last checked time in a file and try to match up after you load your database data. The database RDBMS is designed to effeciently handle this for you. Don't reinvent the wheel using cubes.
Personally, I see no issues with it. It seems perfectly reasonable to store the last checked time in the database, especially since it might be used in queries (for example, to find records that haven't been checked in over a week).
Maybe (just maybe) you could create a new table containing two rows: the id of the row in the first table and the checked date.
That way you wouldn't alter the original table, but depending on the usage of the data and the check date you would be forced to make a joined query which is maybe something you also don't want to do.
It makes sense to store the 'checked time' as part of the row you're updating, rather than in a separate file or even a separate table in the database. This approach should provide optimal performance and help to maintain consistency. Solutions involving more than one table or external data stores may introduce a requirement for distributed or multi-table transactional updates that involve significant locking, which can negatively impact performance and make it much more difficult to guarantee consistency.
In general, solutions that minimize the scope of transactions and, by extension, locking, are worth striving for. Also, simplicity itself is a useful goal.
Using only microsoft based technologies (MS SQL Server, C#, EAB, etc) if you needed keep the track of changes done on a record in a database which strategy would you will use? Triggers, AOP on the DAL, Other? And how you will display the collected data? Is there a pattern about it? Is there a tool or a framework that help to implement this kind of solution?
The problem with Change Data capture is that it isn't flexible enough for real auditing. You can't add the columns you need. Also it dumps the records every three days by default (you can change this, but I don't think you can store forever) so you have to have a job dunping the records to a real audit table if you need to keep the data for a long time which is typical of the need to audit records (we never dump our audit records).
I prefer the trigger approach. You have to be careful when you write the triggers to ensure that they will capture the data if multiple records are changed. We have two tables for each table audited, one to store the datetime and id of the user or process that took the action and one to store the old and new data. Since we do a lot of multiple record processes this is critical for us. If someone reports one bad record, we want to be able to see if it was a process that made the change and if so, what other records might have been affected as well.
At the time you create the audit process, create the scripts to restore a set of audited data to the old values. It's a lot easier to do this when under the gun to fix things, if you already have this set up.
Sql Server 2008 R2 has this built-in - lookup Change Data Capture in books online
This is probably not a popular opinion, but I'm going to throw it out there anyhow.
I prefer stored procedures for all database writes. If auditing is required, it's right there in the stored procedure. There's no magic happening outside the code, everything that happens is documented right at the point where writes occur.
If, in the future, a table needs to change, one has to go to the stored procedure to make the change. The need to update the audit is documented right there. And because we used a stored procedure, it's simpler to "version" both the table and its audit table.
I've a SQL DB with various tables that save info about a product (it's for an online shop) and I'm coding in C#. There are options associated with a given product and as mentioned the info recorded about these options is spread across a few tables when saved.
Now when I come to edit this product in the CMS I see a list of the existing product options and I can add to that list or delete from it, as you'd expect.
When I save the product I need to check if the record already exists and if so update it, if not then save a new record. I'm trying to find an efficient way of doing this. It's very important that I maintain the ID's associated with the product options so clearing them all out each time and re-saving them isn't viable unfortunately.
To describe again, possibly more clearly: Imagine I have a collection of options when I load the product, this is loaded into memory and added to / deleted from depending on what the user chooses. When they click 'Save' I need to check what options are updates and what ones are new to the list.
Any suggestions of an efficient way of doing this?
Thanks.
If the efficiency you are looking to achieve is in relation to the number of round trips to the database then you could write a stored procedure to do the update or insert for you.
In most cases however it's not really necessary to avoid the SELECT first, provided you have appropriate primary keys or unique indices on your tables this should be very quick.
If the efficiency is in terms of elegant or reduced code on the server side then I would look at using some sort of ORM, for example Entity Framework 4.0. With a proper ORM architecture you can almost stop thinking in terms of the database records and INSERT/UPDATE and just work with a collection of objects in memory.
I usually do this by performing the following:
For each item, execute an update query that will update the item if it exists.
After each update, check how many rows were updated (using ##ROWCOUNT in SQL Server). If zero rows were updated, execute an insert to create the row.
Alternatively, you can do the opposite, if you create a unique constraint that prevents duplicate rows:
For each item, try to insert it.
If the insert fails because of theconstraint (check the error code), perform the update instead.
Run a select query checking for the ID. If it exists then you need to update. If it does not exist then you need to insert.
Without more details I'm not really sure what else to tell you. This is fairly standard.
I need to update about 250k rows on a table and each field to update will have a different value depending on the row itself (not calculated based on the row id or the key but externally).
I tried with a parametrized query but it turns out to be slow (I still can try with a table-value parameter, SqlDbType.Structured, in SQL Server 2008, but I'd like to have a general way to do it on several databases including MySql, Oracle and Firebird).
Making a huge concat of individual updates is also slow (BUT about 2 times faster than making thousands of individual calls (roundtrips!) using parametrized queries)
What about creating a temp table and running an update joining my table and the tmp one? Will it work faster?
How slow is "slow"?
The main problem with this is that it would create an enormous entry in the database's log file (in case there's a power failure half-way through the update, the database needs to log each action so that it can rollback in the event of failure). This is most likely where the "slowness" is coming from, more than anything else (though obviously with such a large number of rows, there are other ways to make the thing inefficient [e.g. doing one DB roundtrip per update would be unbearably slow], I'm just saying once you eliminate the obvious things, you'll still find it's pretty slow).
There's a few ways you can do it more efficiently. One would be to do the update in chunks, 1,000 rows at a time, say. That way, the database writes lots of small log entries, rather than one really huge one.
Another way would be to turn off - or turn "down" - the database's logging for the duration of the update. In SQL Server, for example, you can set the Recovery Model to "simple" or "bulk update" which would speed it up considerably (with the caveat that you are more at risk if there's a power failure or something during the update).
Edit Just to expand a little more, probably the most efficient way to actually execute the queries in the first place would be to do a BULK INSERT of all the new rows into a temporary table, and then do a single UPDATE of the existing table from that (or to do the UPDATE in chunks of 1,000 as I said above). Most of my answer was addressing the problem once you've implemented it like that: you'll still find it's pretty slow...
call a stored procedure if possible
If the columns updated are part of indexes you could
drop these indexes
do the update
re-create the indexes.
If you need these indexes to retrieve the data, well, it doesn't help.
You should use the SqlBulkCopy with the KeepIdentities flag set.
As part of a SqlTransaction do a query to SELECT all the records that need updating and then DELETE THEM, returning those selected (and now removed) records. Read them into C# in a single batch. Update the records on the C# side in memory, now that you've narrowed the selection and then SqlBulkCopy those updated records back, keys and all. And don't forget to commit the transaction. It's more work, but it's very fast.
Here's what I would do:
Retrieve the entire table, that is, the columns you need in order to calculate/retrieve/find/produce the changes externally
Calculate/produce those changes
Run a bulk insert to a temporary table, uploading the information you need server-side in order to do the changes. This would require the key information + new values for all the rows you intend to change.
Run SQL on the server to copy new values from the temporary table into the production table.
Pros:
Running the final step server-side is faster than running tons and tons of individual SQL, so you're going to lock the table in question for a shorter time
Bulk insert like this is fast
Cons:
Requires extra space in your database for the temporary table
Produces more log data, logging both the bulk insert and the changes to the production table
Here are things that can make your updates slow:
executing updates one by one through parametrized query
solution: do update in one statement
large transaction creates big log entry
see codeka's answer
updating indexes (RDBMS will update index after each row. If you change indexed column, it could be very costly on large table)
if you can, drop indices before update and recreate them after
updating field that has foreign key constraint - for each inserted record RDBMS will go and look for appropriate key
if you can, disable foreign key constraints before update and enable them after update
triggers and row level checks
if you can, disable triggers before update and enable them after