Suppose we have the following situation:
We have 2-3 tables in database with a huge amount of data (let it be 50-100mln of records) and we want to add 2k of new records. But before adding them we need to check our db on duplicates. So if this 2k contains records which we have in our DB we should ignore them. But to find out whether new record is a duplicate or not we need info from both tables (for example we need to make left join).
The idea of solution is: one task or thread create a suitable data for comparison and pushes data into queue (by batches, not record by record), so our queue(or concurrentQueue) is a global variable. The second thread gets batch from queue and look it through. But there's a problem - memory is growing...
How can I clean memory after I've surfed through the batch?
P.S. If smb has another idea how to optimize this process - please describe it...
This is not the specific answer to the question you are asking, because what you are asking, doesn't really make sense to me.
if you are looking to update specific rows:
INSERT INTO tablename (UniqueKey,columnname1, columnname2, etc...)
VALUES (UniqueKeyValue,value1,value2, etc....)
ON DUPLICATE KEY
UPDATE columnname1=value1, columnname2=value2, etc...
If not, simply ignore/remove the update statement.
This would be darn fast, considering, it would use the unique index of whatever field you want to be unique, and just do an insert or update. No need to validate in a separate table or anything.
Related
This may be a dumb question, but I wanted to be sure. I am creating a Winforms app, and using c# oledbconnection to connect to a MS Access database. Right now, i am using a "SELECT * FROM table_name" and looping through each row to see if it is the row with the criteria I want, then breaking out of the loop if it is. I wonder if the performance would be improved if I used something like "SELECT * FROM table_name WHERE id=something" so basically use a "WHERE" statement instead of looping through every row?
The best way to validate the performance of anything is to test. Otherwise, a lot of assumptions are made about what is the best versus the reality of performance.
With that said, 100% of the time using a WHERE clause will be better than retrieving the data and then filtering via a loop. This is for a few different reasons, but ultimately you are filtering the data on a column before retrieving all of the columns, versus retrieving all of the columns and then filtering out the data. Relational data should be dealt with according to set logic, which is how a WHERE clause works, according to the data set. The loop is not set logic and compares each individual row, expensively, discarding those that don’t meet the criteria.
Don’t take my word for it though. Try it out. Especially try it out when your app has a lot of data in the table.
yes, of course.
if you have a access database file - say shared on a folder. Then you deploy your .net desktop application to each workstation?
And furthermore, say the table has 1 million rows.
If you do this:
SELECT * from tblInvoice WHERE InvoiceNumber = 123245
Then ONLY one row is pulled down the network pipe - and this holds true EVEN if the table has 1 million rows. To traverse and pull 1 million rows is going to take a HUGE amount of time, but if you add criteria to your select, then it would be in this case about 1 million times faster to pull one row as opposed to the whole table.
And say if this is/was multi-user? Then again, even on a network - again ONLY ONE record that meets your criteria will be pulled. The only requirement for this "one row pull" over the network? Access data engine needs to have a useable index on that criteria. Of course by default the PK column (ID) always has that index - so no worries there. But if as per above we are pulling invoice numbers from a table - then having a index on that column (InvoiceNumber) is required for the data engine to only pull one row. If no index can be used - then all rows behind the scenes are pulled until a match occurs - and over a network, then this means significant amounts of data will be pulled without that index across that network (or if local - then pulled from the file on the disk).
I want to emphasize that I'm looking for ideas, not necessarily a concrete answer since it's difficult to show what my queries look like, but I don't believe that's needed.
The process looks like this:
Table A keeps filling up, like a bucket - an SQL job keeps calling SP_Proc1 every minute or less and it inserts multiple records into table A.
At the same time a C# process keeps calling another procedure SP_Proc2 every minute or less that does an ordered TOP 5 select from table A and returns the results to the C# method. After C# code finishes processing the results it deletes the selected 5 records from table A.
I bolded the problematic part above. It is necessary that the records from table A be processed 5 at a time in the order specified, but a few times a month SP_Proc2 selects the ordered TOP 5 records in a wrong order even though all the records are present in table A and have correct column values that are used for ordering.
Something to note:
I'm ordering by integers, not varchar.
The C# part is using 1 thread.
Both SP_Proc1 and SP_Proc2 use a transaction and use READ COMMITTED OR READ COMMITTED SNAPSHOT transaction isolation level.
One column that is used for ordering is a computed value, but a very simple one. It just checks if another column in table A is not null and sets the computed column to either 1 or 0.
There's a unique nonclustered index on primary key Id and a clustered index composed of the same columns used for ordering in SP_Proc2.
I'm using SQL Server 2012 (v11.0.3000)
I'm beginning to think that this might be an SQL bug or maybe the records or index in table A get corrupted and then deleted by the C# process and that's why I can't catch it.
Edit:
To clarify, SP_Proc1 commits a big batch of N records to table A at once and SP_Proc2 pulls the records from table A in batches of 5, it orders the records in the table and selects TOP 5 and sometimes a wrong batch is selected, the batch itself is ordered correctly, but a different batch was supposed to be selected according to ORDER BY. I believe Rob Farley might have the right idea.
My guess is that your “out of order TOP 5” is ordered, but that a later five overlaps. Like, one time you get 1231, 1232, 1233, 1234, and 1236, and the next batch is 1235, 1237, and so on.
This can be an issue with locking and blocking. You’ve indicated your processes use transactions, so it wouldn’t surprise me if your 1235 hasn’t been committed yet, but can just be ignored by your snapshot isolation, and your 1236 can get picked up.
It doesn’t sound like there’s a bug here. What I’m describing above is a definite feature of snapshot isolation. If you must have 1235 picked up in an earlier batch than 1236, then don’t use snapshot isolation, and force your table to be locked until each block of inserts is finished.
An alternative suggestion would be to use a table lock (tablock) for the reading and writing procedures.
Though this is expensive, if you desire absolute consistency then this may be the way to go.
We are creating a client server application using WPF/C# with SQL. Here we are generating a unique number b checking DB(To get the last maximum number) and with that max value, we are increment '1' and storing the value in DB. At this time another user also working on the same screen and creating unique numbers, in some case the the unique numbers gets duplicated and throws exception.
We found this is a concurrency issue.
Indeed, fetching a number out, adding one, and hoping it still isn't in use is a thread-race and a race between multiple clients - and should be avoided.
Options:
use an IDENTITY column in the database, and let the database generate the value itself during INSERT; the database server knows how to do this safely and reliably
if that isn't possible, you might want to delay this code until you are ready to INSERT so it is all part of a single database operation - and even then, if it isn't in a "serializable transaction" (with key-range read locks, etc), then you would have to loop on "get the max, increment, try to insert but note that we might have lost a race, so only insert if the value doesn't exist - which it might; repeat from start if unsuccessful"
alternatively, you could create the new record when you first need the number (even though the rest of the data isn't available), noting that you might still need the "loop until successful" approach
Frankly, the IDENTITY column approach is the simplest.
Finally, We have follwed Singleton pattern with lock to resolver this issue.
Thanks.
I'm implementing DynamoDB in our project. We have to put large data strings into database so we are splitting data into small pieces and inserting multiple rows with only one attribute value changed - part of string. One column (range key) contains a number of part. Inserting and selecting data works perfectly fine for small and large strings. The problem is deleting an item. I read that when you want to delete an item you need to specify primary key for such item (hash key or hash key and range key - depends on table). But what if I want to delete items that have particular value for one of attributes? Do I need to scan (scan, not query) entire table and for each row run delete or batch delete? Or is there some another solution without using two queries? What I'm trying to do is to avoid scanning entire table. I think we will have about 100 - 1000 milions of rows in such table, so scanning will be very slow.
Thanks for help.
There are no way to delete an arbitrary element in DynamoDB. You indeed need to know the hash_key and the range_key.
If query does not fit your needs for this (ie. you even do not know the hash_key), then you're stuck.
The best would be to re-thing your data modeling. Build a custom index or do 'lazy delete'.
To achieve 'lazy delete', use a table as a queue of element to delete. Periodically, run an EMR on it to do all the delete in the batch in a single scan operation. It's really not the best solution but the only way I can think of to avoid re-modeling.
TL;DR: There is no real way but workarounds. I highly recommend that you re-model at least part of your data.
Before insert new value to table, I need change one field in all rows of that table.
What the best way to do this? in c# code, ore use trigger? if C# can you show me the code?
UPD
*NEW VERSION of Question*
Hello. Before insert new value to table, I need change one field in all rows of that table with specific ID( It is FK to another table).
What the best way to do this? in c# code, ore use trigger? if C# can you show me the code?
You should probably consider changing your design this doesn't sound like it will scale well, i would probably do it with a trigger if it is always required, but if not, id use ExecuteCommand.
var ctx = new MyDataContext();
ctx.ExecuteCommand("UPDATE myTable SET foo = 'bar'");
Looking at your comment on Paul's answer, I feel like I should chime in here. We have a few tables where we need to keep a history of each entry in that table. We implement this by creating a separate table for each. For example, we may have a Comment table, and then a CommentArchive table with a foreign key reference to the CommentId in the Comment table.
A trigger on the Comment table ensures that each time certain fields in the Comment table are updated, the "old" version (which is accessible via the deleted table in the trigger) gets pushed to the CommentArchive table. Obviously, this means several CommentArchive entries may exist for each Comment, but if you're only looking for the "active" comments, you just look in the Comment table. And if you need information about the history of a comment, you can easily use LINQ to SQL to jump from the Comment you're interested in to the CommentArchives that reference it.
Because the triggers we use in the above example only insert a single value into the Archive table for each update, they run very quickly and we get good performance. We had issues recently where I tried making the triggers more complex and we started getting dead-locks with as few as 15 concurrent transactions. So the lesson is that you should make these triggers simple, and make them touch as few rows in as few tables as possible.