I'am using mongodb in my project to reduce the time, While comparing the time taken for fetching data from SQL and NoSQL, SQL takes 50ms for each fetch form database but NoSQL takes around 180ms for first time and 15ms for other fetches, how to reduce the fetching time for the first time in NoSQL.
Try to create an index for your collection to make queries run faster
http://docs.mongodb.org/manual/core/indexes/
The first query is probably taking more time because it loads the working set into your RAM. Details about working set can be found here:
http://docs.mongodb.org/manual/faq/storage/#what-is-the-working-set
To avoid this problem you may want to pre-load this before any of your actual users hit the database. One way I can think of is an hourly (or any frequency you find suitable) cron that does this query each hour to load the working set into memory, for faster subsequent queries.
Related
I have a large data stored in a SQL database that is being updated constantly. I need to find the best way to update Solr index noting that i have too many table(s) relations, for example Product, ProductDetails, ProductStocks ... etc.
There seems to be two solution for this:
1) DIH deltaQueryImport - query the database for all records that have LastUpdated greater than the last_index_time and then Import those records for indexing to Solr, the DIH is scheduled every 30 minutes, and the data during the sceduled is not yet reflected, too much time is spent on queries due to the number of records updated.
2) Task Queue - every time a product is updated in the database, we queue a task to index that record to Solr.
Just want to know your recommendation and the pros and cons of each approach?
I worked on a project with similar scenario. We decide to implemented your 2nd solution.
Push is preferable than pull solution. With push solution, you can achieve near real time update, usually that's a big plus for business.
But with this solution, you need to consider the following:
Batch initial load.
Size of queue if you have a batch update.
We're seeing some very variable latencies when querying our Azure Table Storage data. We have a number of items each fetching time series data which is broken up by day as follows:
Partition key: {DATA_TYPE}_{YYYMMdd} - 4 different datatypes with about 2 years of data in total
Row Key: {DataObjectId} - About 3-4,000 records per day.
A record itself is a JSON encoded array of dateTime objects spread out every 15 minutes.
So I want to retrieve timeseries data for a specific object for the last few days so I constructed the following query:
string.Format("(PartitionKey ge '{0}') and (PartitionKey le '{1}') and (RowKey eq '{2}')", lowDate, highDate, DataObjectId);
As above we have records going over 2-3 years now.
On the whole the query time is fairly speedy 600-800 ms However once or twice we get a couple of values where it seems to take a very long time to retrieve data from these partitions. i.e. one or two queries have taken 50 seconds plus to return data.
We are not aware that the system is under dramatic load. In fact frustratingly all the graphs in the portal we've found suggest no real problems.
Some suggestions that come to mind:
1.) add year component first making the partition keys immediately more selective.
However the most frustrating thing is the variation in time taken to do the queries.
The Azure storage latency in the Azure portal is averaging at about 117.2ms and the maximum reported is 294ms. I have interpreted this as Network latency.
Of course any suggestions gratefully received. The most vexing thing is that the execution time is so variable. In a very small number of cases we see our application resorting to the use of continuation tokens as the query has taken over 5 seconds to complete.
https://msdn.microsoft.com/en-us/library/azure/dd179421.aspx
Have been looking at this for a while.
I've not found an answer to why querying accross partitions suffered such variable latency. I had assumed that it would work well with the indexes.
However the solution seems to be to simply request data from the 6 different partitions. Therefore all querying takes advantage of both the Partitionkey and rowkey indexing. Once this was implemented our queries began returning much faster.
Would still like to understand why querying accross partitions seemed so slow, but I can only assume the query resulted in a table scan which has variable latency.
I have various large data modification operations in a project built on c# and Fluent NHibernate.
The DB is sqlite (on disk rather than in memory as I'm interested in performance)
I wanted to check performance of these so I created some tests to feed in large amounts of data and let the processes do their thing. The results from 2 of these processes have got me pretty confused.
The first is a fairly simple case of taking data supplied in an XML file doing some light processing and importing it. The XML contains around 172,000 rows and the process takes a total of around 60 seconds to run with the actual inserts taking around 40 seconds.
In the next process, I do some processing on the same set of data. So I have a DB with approx 172,000 rows in one table. The process then works through this data, doing some heavier processing and generating a whole bunch of DB updates (inserts and updates to the same table).
In total, this results in around 50,000 rows inserted and 80,000 updated.
In this case, the processing takes around 30 seconds, which is fine, but saving the changes to the DB takes over 30 mins! and it crashes before it finishes with an sqlite 'disk or i/o error'
So the question is: why are the inserts/updates in the second process so much slower? They are working on the same table of the same database with the same connection. In both cases, IStatelessSession is used and ado.batch_size is set to 1000.
In both cases, the code looks that does the update like this:
BulkDataInsert((IStatelessSession session) =>
{
foreach (Transaction t in transToInsert) { session.Insert(t); }
foreach (Transaction t in transToUpdate) { session.Update(t); }
});
(although the first process has no 'transToUpdate' line as it's only inserts - Removing the update line and just doing the inserts still takes almost 10 minutes.)
The transTo* variables are List with the objects to be updated/inserted.
BulkDataInsert creates the session and handles the DB transaction.
I didn't understand your second process. However, here are some things to consider:
Are there any clustered or non-clustered indexes on the table?
How many disk drives do you have?
How many threads are writing to the DB in the second test?
It seems that you are experiencing IO bottlenecks that can be resolved by having more disks, more threads, indexes, etc.
So, assuming a lot of things, here is what I "think" is happening:
In the first test your table probably has no indexes, and since you are just inserting data, it is a sequential insert in a single thread which can be pretty fast - especially if you are writing to one disk.
Now, in the second test, you are reading data and then updating data. Your SQL instance has to find the record that it needs to update. If you do not have any indexes this "find" action is basically a table scan, which will happen for each one of those 80,000 row updates. This will make your application really really slow.
The simplest thing you could probably do is add a clustered index on the table for a unique key, and the best option is to use the columns that you are using in the where clause to "update" those rows.
Hope this helps.
DISCLAIMER: I made quite a few assumptions
The problem was due to my test setup.
As is pretty common with nhibernate based projects, I had been using in-memory sqlite databases for unit testing. These work great but one downside is that if you close the session, it destroys the database.
Consequently, my unit of work implementation contains a 'PreserveSession' property to keep the session alive and just create new transactions when needed.
My new performance tests are using on-disk databases but they still use the common code for setting up test databases and so have PreserveSession set to true.
It seems that having several sessions all left open (even though they're not doing anything) starts to cause problems after a while including the performance drop off and the disk IO error.
I re-ran the second test with PreserveSession set to false and immediately I'm down from over 30 minutes to under 2 minutes. Which is more where I'd expect it to be.
i have a program wherein it will retrieve data from sql based on a specific date range.. the problem is when the date range is set to a year or greater than that then the loading of data is so slow that sometimes the program will be not responding. Is there a way avoid this?
You can load the data in a background thread using the BackgroundWorker component.
It will still take time, but the program won't be frozen.
Alternatively, you can modify your program to load less data.
For example, you can move the logic that uses the data to a sproc on the server.
Another option is to prevent the user selecting a date range that is so large.
This may appear restrictive but usually when the user is presented with 10,000 separate records, they realise they need to make their query more specific. The time taken to retrieve the large data set is just a waste of server, network and the user's time.
We are building an application which requires a daily insertion of approximately 1.5 million rows of data per table. We have 16 tables.
We keep track of 3-day historical data including the current day's data.
The application is done using C#; on the server side, we run an exe that fills the data tables during market hours (4.5 hours), and we update the 16 tables every 5 seconds.
On the client side, the application gets user queries which require the most recently inserted data ( in the last 5 seconds) and a historical point which could be today or before, and plots them somehow.
We are having some serious performance issue, as one query might take 1 second or more which is too much. The question is, for today's data that is being inserted at runtime, can we make use of caching instead of going to the database each time we want something from today's data? Will that be more efficient? And if so, how can we do that?
P.S one day data is approximately 300 MB, and we have enough RAM
Keep a copy of the data along with the datetime you used to retrieve the data. The next time, retrieve only the new data, which minimizes the amount of data you send over the wire.
If it is that all the queries run in the operation amount to 1 sec, maybe the issue you are seeing is that the UI is freezing. If that is the case, don't do it on the UI thread.
Upate (based on comments): the code you run in the event handlers of the controls, runs in the UI thread, which is what causes the UI to freeze. There isn't a single way to run it in a separate thread, I suggest BackGroundWorker for this scenario. Look the community provided example at the end.