What is a good design for caching the results of an expensive search in an ASP.NET system?
Any ideas would be welcomed ... particularly those that don't require inventing a complex infrastructure of our own.
Here are some general requirements related to the problem:
Each search result can produce include from zero to several hundred result records
Each search is relatively expensive and timeconsuming to execute (5-15 seconds at the database)
Results must be paginated before being displayed at the client to avoid information overload for the user
Users expect to be able to sort, filter, and search within the results returned
Users expect to be able to quickly switch between pages in the search results
Users expect to be able to select multiple items (via checkbox) on any number of pages
Users expect relatively snappy performance once a search has finished
I see some possible options for where and how to implement caching:
1. Cache on the server (in session or App cache), use postbacks or Ajax panels to facilitate efficient pagination, sorting, filtering, and searching.
PROS: Easy to implement, decent support from ASP.NET infrastructure
CONS: Very chatty, memory intensive on server, data may be cached longer than necessary; prohibits load balancing practices
2. Cache at the server (as above) but using serializeable structures that are moved out of memory after some period of time to reduce memory pressure on the server
PROS: Efficient use of server memory; ability to scale out using load balancing;
CONS: Limited support from .NET infrastructure; potentially fragile when data structures change; places additional load on the database; significantly more complicated
3. Cache on the client (using JSON or XML serialization), use client-side Javascript to paginate, sort, filter, and select results.
PROS: User experience can approach "rich client" levels; most browsers can handle JSON/XML natively - decent libraries exist for manipulation (e.g. jQuery)
CONS: Initial request may take a long time to download; significant memory footprint on client machines; will require hand-crafted Javascript at some level to implement
4. Cache on the client using a compressed/encoded representation of the data - call back into server to decode when switching pages, sorting, filtering, and searching.
PROS: Minimized memory impact on server; allows state to live as long as client needs it; slightly improved memory usage on client over JSON/XML
CONS: Large data sets moving back and forth between client/server; slower performance (due to network I/O) as compared with pure client-side caching using JSON/XML; much more complicated to implement - limited support from .NET/browser
5. Some alternative caching scheme I haven't considered...
For #1, have you considered using a state server (even SQL server) or a shared cache mechanism? There are plenty of good ones to choose from, and Velocity is getting very mature - will probably RTM soon. A cache invalidation scheme that is based on whether the user creates a new search, hits any other page besides search pagination, and finally a standard timeout (20 minutes) should be pretty successful at weeding your cache down to a minimal size.
References:
SharedCache (FOSS)
NCache ($995/CPU)
StateServer (~$1200/server)
StateMirror ("Enterprise pricing")
Velocity (Free?)
If you are able to wait until March 2010, .NET 4.0 comes with a new System.Caching.CacheProvider, which promises lots of implementations (disk, memory, SQL Server/Velocity as mentioned).
There's a good slideshow of the technology here. However it is a little bit of "roll your own" or a lot of it infact. But there will probably be a lot of closed and open source providers being written for the Provider model when the framework is released.
For the six points you state, a few questions crops up
What is contained in the search results? Just string data or masses of metadata associated with each result?
How big is the set you're searching?
How much memory would you use storing the entire set in RAM? Or atleast having a cache of the most popular 10 to 100 search terms. Also being smart and caching related searches after the first search might be another idea.
5-15 seconds for a result is a long time to wait for a search so I'm assuming it's something akin to an expedia.com search where multiple sources are being queried and lots of information returned.
From my limited experience, the biggest problem with the client-side only caching approach is Internet Explorer 6 or 7. Server only and HTML is my preference with the entire result set in the cache for paging, expiring it after some sensible time period. But you might've tried this already and seen the server's memory getting eaten.
Raising an idea under the "alternative" caching scheme. This doesn't answer your question with a given cache architecture, but rather goes back to your original requirements of your search application.
Even if/when you implement your own cache, it's effectiveness can be less than optimal -- especially as your search index grows in size. Cache hit rates will decrease as your index grows. At a certain inflection point, your search may actually slow down due to resources dedicated to both searching and caching.
Most search sub-systems implement their own internal caching architecture as a means of efficiency in operation. Solr, an open-source search system built on Lucene, maintains its own internal cache to provide for speedy operation. There are other search systems that would work for you, and they take similar strategies to results caching.
I would recommend you consider a separate search architecture if your search index warrants it, as caching in a free-text keyword search basis is a complex operation to effectively implement.
Since you say any ideas are welcome:
We have been using the enterprise library caching fairly successfully for caching result sets from a LINQ result.
http://msdn.microsoft.com/en-us/library/cc467894.aspx
It supports custom cache expiration, so should support most of your needs (with a little bit of custom code) there. It also has quite a few backing stores including encrypted backing stores if privacy of searches is important.
It's pretty fully featured.
My recommendation is a combination of #1 and #3:
Cache the query results on the server.
Make the results available as both a full page and as a JSON view.
Cache each page retrieved dynamically at the client, but send a REQUEST each time the page changes.
Use ETAGs to do client cache invalidation.
Have a look at SharedCache- it makes 1/2 pretty easy and works fine in a load balanced system. Free, open source, and we've been using it for about a year with no issues.
While pondering your options, consider that no user wants to page through data. We force that on them as an artifact of trying to build applications on top of browsers in HTML, which inherently do not scale well. We have invented all sorts of hackery to fake application state on top of this, but it is essentially a broken model.
So, please consider implementing this as an actual rich client in Silverlight or Flash. You will not beat that user experience, and it is simple to cache data much larger than is practical in a regular web page. Depending on the expected user behavior, your overall bandwidth could be optimized because the round trips to the server will get only a tight data set instead of any ASP.NET overhead.
Related
We currently are facing problems due to high amount of cached objects. We cache Data from an ERP system (for an Online Shop) and IIS will refresh the webpage as it reaches the maximum amount of memory and we loose all cashed objects. As this makes the idea of caching a little bit problematic we are searching for a solution to cache the objects with a different solution.
I have found AppFabric from Microsoft as it is already included into our Windows Server licenses to be a pretty neat solution.
How ever I still fear that we will have enormous performance problems when using AppFabric Velocity instead of the MemoryCache Class (our current solution for Caching).
So my question is now, is this a solution for our problem or am I over-thinking here and is the performance of AppFabric fast enough?
Grid Dynamics did a great report on using AppFabric here. While I don't know the numbers for your specific cache operations, the report showed great numbers performance wise for AppFabric. In one test, they wanted to see how the size of the cache impacted the cache operations performance. When just reading the data, it had little to no impact on the cache operations performance. When updating, there was impact on cache operations performance, but not a ridiculous amount. When testing object size and performance, obviously, larger objects lowered the performance (throughput performance here). Overall, the report has solid tests and statistics that show that the performance of AppFabric Cache is excellent.
No, Grid Dynamics does not compare the results to other products, but they do show you what the performance of AppFabric Cache is like in different tests. They have a particularly useful Appendix section that can provide details to help people in different usage scenarios.
As always, using a solution that is not on the same machine as the IIS instance will add a little bit of time to the fetching of session data from the cache, but we are talking a small amount of time.
If I am understanding your situation than there are object caching solutions available that let you cache objects in memory and expire them according to your application logic or when the cache starts filling up.
Appfabric is not a very mature product in this regard especially when talking about an "inproc" cache. You'd need a client cache, which really is a subset of the distributed cache (meaning all the cached objects) that resides "in proc" and is kept synchronized with the distributed cache.
One solution that I'd recommend is to use NCache as a distributed cache and use its clinet caching feature for your ERP objects.
I have an application which query the database for records. The records can be thousands in numbers and this can shoot up the memory of the process and eventually leads up to a crash or slow in response.
The paginated query is a solution for this but the information in the record always keep changing. Hence to give a unique experience, we are forced to show the information available at the time which user make the query.
Employing paging could dynamically update the content on moving from pages to page. I believe a client-side caching could solve this problem.
One way I am finding is to store the results in to disk in XML format and query using LINQ to XML. Are there any proven client side caching mechanism which can work with desktop application (not web)
See some pattern like http://msdn.microsoft.com/en-us/library/ff664753
It talks about the use of the Enterprise Library Caching Application Block that lets developers incorporate a local cache in their applications.
Read also http://www.codeproject.com/Articles/8977/Using-Cache-in-Your-WinForms-Applications
Enterprise Library 5.0 can be found here http://msdn.microsoft.com/en-us/library/ff632023
Memory usage shouldn't really be an issue unless you are letting your cache grow indefinitely. There is little benefit to pre-fetching too many pages the user may never see, or in holding on to pages that the user has not viewed for a long time. Dynamically fetching the next/previous page would keep performance high, but you should clear from the cache pages that have been edited or are older than a certain timespan. Clearing from the cache simply requires discarding all references to the page (e.g. removing it from any lists or dictionaries) and allowing the garbage collector to do its work.
You can also potentially store a WeakReference to your objects and let the garbage collector collect your objects if it needs to, but this gives you less control over what is an isn't cached.
Alternatively there are some very good third party solutions for this, especially if its a grid control. The DevExpress grid controls have an excellent server mode that can handle very large data sets with good performance.
I have inherited a project from a developer who was rather fond of session variables. He has used them to store all sorts of global stuff - datatables, datasets, locations of files, connection strings etc. I am a little worried that this may not be very scalable and we do have the possibility of a lot more users in the immediate future.
Am I right to be concerned, and if so why?
Is there an easy way to see how much memory this is all using on the live server at the moment?
What would be the best approach for re-factoring this to use a better solution?
Yes, I would say that you do have some cause for concern. Overuse of session can cause a lot of performance issues. Ideally, session should only be used for information that is specific to the user. Obviously there are exceptions to this rule, but keep that in mind when you're refactoring.
As for the refactoring itself, I would look into caching any large objects that are not user-specific, and removing anything that doesn't need to be in session. Don't be afraid to make a few trips to the database to retrieve information when you need it. Go with the option that puts the least overall strain on the server. The trick is keeping it balanced and distributing the weight as evenly as possible across the various layers of the application.
It was probably due to poor design, and yes you should be concerned if you plan on getting heavier traffic or scaling the site.
Connection strings should be stored in web.config. Seems like you would have to do some redesigning of the data-layer and how the pages pass data to each other to steer away from storing datatables and datasets in Session. For example, instead of storing a whole dataset in Session, store, or pass by url, something small (like an ID) that can be used to re-query the database.
Sessions always hurt scalability. However, once sessions are being used, the impact of a little bit more data in a session isn't that bad.
Still, it has to be stored somewhere, has to be retrieved from somewhere, so it's going to have an impact. It's going to really hurt if you have to move to a web-farm to deal with being very successful, since that's harder to do well in a scalable manner. I'd start by taking anything that should be global in the true sense (shared between all sessions) and move it into a truly globally-accessible location.
Then anything that depended upon the previous request, I'd have be sent by that request.
Doing both of those would reduce the amount they were used for immensely (perhaps enough to turn off sessions and get the massive scalability boost that gives).
Depending on the IIS version, using Session to store state can have an impact on scaling. The later versions of IIS are better.
However, the main problem I have run into is that sessions expire and then your data is lost; you may provide your own Session_OnEnd handler where it is possible to regenerate your session.
Overall yes, you should be concerned about this.
Session is a "per user" type of storage that is in memory. Looking at the memory usage of the ASP.NET Worker Process will give you an idea of memory usage, but you might need to use third-party tools if you want to dig in deeper to what is in. In addition session gets really "fun" when you start load balancing etc.
ConnectionStrings and other information that is not "per user" should really not be handled in a "per user" storage location.
As for creating a solution for this though, a lot is going to depend on the data itself, as you might need to find multiple other opportunities/locations to get/store the info.
You are right in feeling concerned about this.
Connection strings should be stored in Web.config and always read from there. The Web.config file is cached, so storing things in there and then on Session is redundant and unnecessary. The same can be said for locations of files: you can probably create key,value pairs in the appSettings section of your web.config to store this information.
As far as storing datasets, datatables, etc; I would only store this information on Session if getting them from the database is really expensive and provided the data is not too big. A lot of people tend to do this kind of thing w/o realizing that their queries are very fast and that database connections are pooled.
If getting the data from the database does take long, the first thing I would try to remedy would be the speed of my queries. Am I missing indexes? What does the execution plan of my queries show? Am I doing table scans, etc., etc.
One scenario where I currently store information on Session (or Cache) is when I do have to call an external web service that takes more than 2 secs on average to retrieve what I need. Once I get this data I don't need to getting again on every page hit, so I cache it.
Obviously an application that stores pretty much everything it can on Session is going to have scalability issues because memory is a limited resource.
if memory is the issue, why not change session mode to sql server so you can store session data in sql server which requires little code changes.
how to store session data in sql server:
http://msdn.microsoft.com/en-us/library/ms178586.aspx
the catch is that the classes stored in sql server must be serializable and you can use json.net to do just that.
I have coded up an ASP.NET website and running on win'08 (remotely hosted). The application queries 11 very large Lucene indexes (each ~100GB). I open IndexSearchers on Page_load() and keep them open for the duration of the user session.
My questions:
The queries take a ~5 seconds to complete - understandable these are very large indexes - but users want faster responses. I was curious to squeeze out better performance. ( I did look over the Apache Lucene website and try some of the ideas over there). Interested in if & how you tweaked it further, especially ones from asp.net perspective.
One ideas was to use Solr instead of querying Lucene directly. But that seems counter-intuitive, introducing another abstraction in between and might add to the latency. Is it worth the headache in porting to Solr? Can anyone share some metrics on what improvement you got following a switch to Solr if it has been worth it.
Are there some key things that could be done in Solr that could be replicated to speed up response times?
Some questions / ideas:
Are you hitting all 11 indexes for a single request?
Can you reorganize the indexes so that you hit only 1 index (i.e. sharding) ?
Have you run a profile of the application (using dotTrace or similar tool)? Where is the time spent? Lucene.Net?
If most of the time is spent on Lucene.Net, then if you migrate to Solr the latency should be negligible (compared to the rest of the spent time). Plus, Solr can be easily distributed to increase performance.
I'm not all too familiar with Lucene (I use Solr) but if you're searching 11 indexes per request, can you run those searches in parallel (e.g. with TPL) ?
The biggest thing is removing the search from the web tier, and isolating it to it's own tier (a search tier). That way, you have a dedicated box with dedicated resources that have the indexes loaded, and "warmed up" in cache, instead of having each user have a copy of it's own index reader.
I'm still yet to find a decent solution to my scenario. Basically I have an ASP.NET MVC website which has a fair bit of database access to make the views (2-3 queries per view) and I would like to take advantage of caching to improve performance.
The problem is that the views contain data that can change irregularly, like it might be the same for 2 days or the data could change several times in an hour.
The queries are quite simple (select... from where...) and not huge joins, each one returns on average 20-30 rows of data (with about 10 columns).
The queries are quite simple at the sites current stage, but over time the owner will be adding more data and the visitor numbers will increase. They are large at the moment and I would be looking at caching as traffic will mostly be coming from Google AdWords etc and fast loading pages will be a benefit (apparently).
The site will be hosted on a Microsoft SQL Server 2005 database (But can upgrade to 2008 if required).
Do I either:
Set the caching to the minimum time an item doesn't change for (E.g. cache for say 3 mins) and tell the owner that any changes will take upto 3 minutes to appear?
Find a way to force the cache to clear and reprocess on changes (E.g. if the owner adds an item in the administration panel it clears the relevant caches)
Forget caching all together
Or is there an option that would be suit this scenario?
If you are using Sql Server, there's also another option to consider:
Use the SqlCacheDependency class to have your cache invalidated when the underlying data is updated. Obviously this achieves a similar outcome to option 2.
I might actually have to agree with Agileguy though - your query descriptions seem pretty simplistic. Thinking forward and keeping caching in mind while you design is a good idea, but have you proven that you actually need it now? Option 3 seems a heck of a lot better than option 1, assuming you aren't actually dealing with significant performance problems right now.
Premature optimization is the root of all evil ;)
That said, if you are going to Cache I'd use a solution based around option 2.
You have less opportunity for "dirty" data in that manner.
Kindness,
Dan
2nd option is the best. Shouldn't be so hard if the same app edits/caches data. Can be more tricky if there is more than one app.
If you can't go that way, 1st might be acceptable too. With some tweaks (i.e. - i would try to update cache silently on another thread when it hits timeout) it might work well enough (if data are allowed to be a bit old).
Never drop caching if it's possible. Everyone knows "premature optimization..." verse, but caching is one of those things that can increase scalability/performance of application dramatically.