I have a large data stored in a SQL database that is being updated constantly. I need to find the best way to update Solr index noting that i have too many table(s) relations, for example Product, ProductDetails, ProductStocks ... etc.
There seems to be two solution for this:
1) DIH deltaQueryImport - query the database for all records that have LastUpdated greater than the last_index_time and then Import those records for indexing to Solr, the DIH is scheduled every 30 minutes, and the data during the sceduled is not yet reflected, too much time is spent on queries due to the number of records updated.
2) Task Queue - every time a product is updated in the database, we queue a task to index that record to Solr.
Just want to know your recommendation and the pros and cons of each approach?
I worked on a project with similar scenario. We decide to implemented your 2nd solution.
Push is preferable than pull solution. With push solution, you can achieve near real time update, usually that's a big plus for business.
But with this solution, you need to consider the following:
Batch initial load.
Size of queue if you have a batch update.
Related
My table storage has approximately 1-2 million records and I have a daily job that needs needs to retrieve all the records that does not have a property A and do some further processing.
It is expected that there are about 1 - 1.5 million records without property A. I understand there are two approaches.
Query all records then filter results after
Do a table scan
Currently, it is using the approach where we query all records and filter in c#. However, the task is running in an Azure Function App. The query to retrieve all the results is sometimes taking over 10 minutes which is the limit for Azure Functions.
I'm trying to understand why retrieve 1 million records is taking so long and how to optimise the query. The existing design of the table is that the partition and row key are identical and is a guid - this leads me to believe that there is one entity per partition.
Looking at Microsoft docs, here are some key Table Storage limits (https://learn.microsoft.com/en-us/azure/storage/common/storage-scalability-targets#azure-table-storage-scale-targets):
Maximum request rate per storage account: 20,000 transactions per second, which assumes a 1-KiB entity size
Target throughput for a single table partition (1 KiB-entities): Up to 2,000 entities per second.
My initial guess is that I should use another partition key to group 2,000 entities per partition to achieve the target throughput of 2,000 per second per partition. Would this mean that 2,000,000 records could in theory be returned in 1 second?
Any thoughts or advice appreciated.
I found this question after blogging on the very topic. I have a project where I am using the Azure Functions Consumption plan and have a big Azure Storage Table (3.5 million records).
Here's my blog post:
https://www.joelverhagen.com/blog/2020/12/distributed-scan-of-azure-tables
I have mentioned a couple of options in this blog post but I think the fastest is distributing the "table scan" work into smaller work items that can be easily completed in the 10-minute limit. I have an implementation linked in the blog post if you want to try it out. It will likely take some adapting to your Azure Function but most of the clever part (finding the partition key ranges) is implemented and tested.
This looks to be essentially what user3603467 is suggesting in his answer.
I see two approaches to retrieve 1+ records in a batch process, where the result must be saved to a single media - like a file.
First) You identity/select all primary id/key of related data. Then you spawn parallel jobs with chunks of these primary id/keys where you read the actual data and process it. each job then report to the single media with the result.
Second) You identity/select (for update) top n of related data, and mark this data with a state of being processed. Use concurrency locking here, that should prevent others from picking that data up if this is done in parallel.
I will go for the first solution if possible, since it is the simplest and cleanest solution. The second solution is best if you use "select for update", i dont know if that is supported on Azure Table Storage.
You'll need to paralise the task. As you don't know the partition keys, run 24 separate queries PK that start and end for each letter of the alaphabet. Write a query where PK > A && PK < B, and > B < C etc. Then join the 24 results in memory. Super easy to do in a single function. In JS just use Promise.all([]).
I have roughly 30M rows to Insert Update in SQL Server per day what are my options?
If I use SqlBulkCopy, does it handle not inserting data that already exists?
In my scenario I need to be able to run this over and over with the same data without duplicating data.
At the moment I have a stored procedure with an update statement and an insert statement which read data from a DataTable.
What should I be looking for to get better performance?
The usual way to do something like this is to maintain a permanent work table (or tables) that have no constraints on them. Often these might live in a separate work database on the same server.
To load the data, you empty the work tables, blast the data in via BCP/bulk copy. Once the data is loaded, you do whatever cleanup and/or transforms are necessary to prep the newly loaded data. Once that's done, as a final step, you migrate the data to the real tables by performing the update/delete/insert operations necessary to implement the delta between the old data and the new, or by simply truncating the real tables and reloading them.
Another option, if you've got something resembling a steady stream of data flowing in, might be to set up a daemon to monitor for the arrival of data and then do the inserts. For instance, if your data is flat files get dropped into a directory via FTP or the like, the daemon can monitor the directory for changes and do the necessary work (as above) when stuff arrives.
One thing to consider, if this is a production system, is that doing massive insert/delete/update statements is likely to cause blocking while the transaction is in-flight. Also, a gigantic transaction failing and rolling back has its own disadvantages:
The rollback can take quite a while to process.
Locks are held for the duration of the rollback, so more opportunity for blocking and other contention in the database.
Worst, after all that happens, you've achieved no forward motion, so to speak: a lot of time and effort and you're right back where you started.
So, depending on your circumstances, you might be better off doing your insert/update/deletes in smaller batches so as to guarantee that you achieve forward progress. 30 million rows over 24 hours works out to be c. 350 per second.
Bulk insert into a holding table then perform either a single Merge statement or an Update and an Insert statement. Either way you want to compare your source table to your holding table to see which action to perform
I have various large data modification operations in a project built on c# and Fluent NHibernate.
The DB is sqlite (on disk rather than in memory as I'm interested in performance)
I wanted to check performance of these so I created some tests to feed in large amounts of data and let the processes do their thing. The results from 2 of these processes have got me pretty confused.
The first is a fairly simple case of taking data supplied in an XML file doing some light processing and importing it. The XML contains around 172,000 rows and the process takes a total of around 60 seconds to run with the actual inserts taking around 40 seconds.
In the next process, I do some processing on the same set of data. So I have a DB with approx 172,000 rows in one table. The process then works through this data, doing some heavier processing and generating a whole bunch of DB updates (inserts and updates to the same table).
In total, this results in around 50,000 rows inserted and 80,000 updated.
In this case, the processing takes around 30 seconds, which is fine, but saving the changes to the DB takes over 30 mins! and it crashes before it finishes with an sqlite 'disk or i/o error'
So the question is: why are the inserts/updates in the second process so much slower? They are working on the same table of the same database with the same connection. In both cases, IStatelessSession is used and ado.batch_size is set to 1000.
In both cases, the code looks that does the update like this:
BulkDataInsert((IStatelessSession session) =>
{
foreach (Transaction t in transToInsert) { session.Insert(t); }
foreach (Transaction t in transToUpdate) { session.Update(t); }
});
(although the first process has no 'transToUpdate' line as it's only inserts - Removing the update line and just doing the inserts still takes almost 10 minutes.)
The transTo* variables are List with the objects to be updated/inserted.
BulkDataInsert creates the session and handles the DB transaction.
I didn't understand your second process. However, here are some things to consider:
Are there any clustered or non-clustered indexes on the table?
How many disk drives do you have?
How many threads are writing to the DB in the second test?
It seems that you are experiencing IO bottlenecks that can be resolved by having more disks, more threads, indexes, etc.
So, assuming a lot of things, here is what I "think" is happening:
In the first test your table probably has no indexes, and since you are just inserting data, it is a sequential insert in a single thread which can be pretty fast - especially if you are writing to one disk.
Now, in the second test, you are reading data and then updating data. Your SQL instance has to find the record that it needs to update. If you do not have any indexes this "find" action is basically a table scan, which will happen for each one of those 80,000 row updates. This will make your application really really slow.
The simplest thing you could probably do is add a clustered index on the table for a unique key, and the best option is to use the columns that you are using in the where clause to "update" those rows.
Hope this helps.
DISCLAIMER: I made quite a few assumptions
The problem was due to my test setup.
As is pretty common with nhibernate based projects, I had been using in-memory sqlite databases for unit testing. These work great but one downside is that if you close the session, it destroys the database.
Consequently, my unit of work implementation contains a 'PreserveSession' property to keep the session alive and just create new transactions when needed.
My new performance tests are using on-disk databases but they still use the common code for setting up test databases and so have PreserveSession set to true.
It seems that having several sessions all left open (even though they're not doing anything) starts to cause problems after a while including the performance drop off and the disk IO error.
I re-ran the second test with PreserveSession set to false and immediately I'm down from over 30 minutes to under 2 minutes. Which is more where I'd expect it to be.
Have a project that uses the entity framework (v1 with .NET 3.5). It's been in use for a few years, but it's now being used by more people. Started getting timeout errors and have tracked it down to a few things. For simplicity sake let's say my database has three tables, product, part, and product_part. There are ~1400 parts and a handful of products.
The user has the ability to add any number of parts to a product. My problem is that when there are many parts added to the product the inserts take a long time. I think it's mostly due to network traffic/delay, but to insert all 1400 takes around a minute. If someone goes in and tries to view the details of a part while those records are being inserted I get a timeout and can see a block in the Activity Monitor of SQL Server.
What can I do to avoid this? My apologies if this has been asked before and I missed it.
Thanks,
Nick
I think the root problem is that your write transaction is taking so long. EF is not good at executing mass DML. It executes each insert in a separate network roundtrip and separate statement.
If you want to insert 1400 rows, and performance matters, do the insert in one single statement using TVP's (INSERT ... SELECT * FROM #tvp). Or switch to bulk-copy but I don't think that will be advantageous at only 1400 rows.
If your read transactions are getting blocked, and this is a problem, switch on snapshot isolation. That takes care of the readers 100% as they never block under snapshot isolation.
We are building an application which requires a daily insertion of approximately 1.5 million rows of data per table. We have 16 tables.
We keep track of 3-day historical data including the current day's data.
The application is done using C#; on the server side, we run an exe that fills the data tables during market hours (4.5 hours), and we update the 16 tables every 5 seconds.
On the client side, the application gets user queries which require the most recently inserted data ( in the last 5 seconds) and a historical point which could be today or before, and plots them somehow.
We are having some serious performance issue, as one query might take 1 second or more which is too much. The question is, for today's data that is being inserted at runtime, can we make use of caching instead of going to the database each time we want something from today's data? Will that be more efficient? And if so, how can we do that?
P.S one day data is approximately 300 MB, and we have enough RAM
Keep a copy of the data along with the datetime you used to retrieve the data. The next time, retrieve only the new data, which minimizes the amount of data you send over the wire.
If it is that all the queries run in the operation amount to 1 sec, maybe the issue you are seeing is that the UI is freezing. If that is the case, don't do it on the UI thread.
Upate (based on comments): the code you run in the event handlers of the controls, runs in the UI thread, which is what causes the UI to freeze. There isn't a single way to run it in a separate thread, I suggest BackGroundWorker for this scenario. Look the community provided example at the end.