ASP.NET Lucene Performance Improvements question

ASP.NET Lucene Performance Improvements question - c#

I have coded up an ASP.NET website and running on win'08 (remotely hosted). The application queries 11 very large Lucene indexes (each ~100GB). I open IndexSearchers on Page_load() and keep them open for the duration of the user session.
My questions:
The queries take a ~5 seconds to complete - understandable these are very large indexes - but users want faster responses. I was curious to squeeze out better performance. ( I did look over the Apache Lucene website and try some of the ideas over there). Interested in if & how you tweaked it further, especially ones from asp.net perspective.
One ideas was to use Solr instead of querying Lucene directly. But that seems counter-intuitive, introducing another abstraction in between and might add to the latency. Is it worth the headache in porting to Solr? Can anyone share some metrics on what improvement you got following a switch to Solr if it has been worth it.
Are there some key things that could be done in Solr that could be replicated to speed up response times?

Some questions / ideas:
Are you hitting all 11 indexes for a single request?
Can you reorganize the indexes so that you hit only 1 index (i.e. sharding) ?
Have you run a profile of the application (using dotTrace or similar tool)? Where is the time spent? Lucene.Net?
If most of the time is spent on Lucene.Net, then if you migrate to Solr the latency should be negligible (compared to the rest of the spent time). Plus, Solr can be easily distributed to increase performance.
I'm not all too familiar with Lucene (I use Solr) but if you're searching 11 indexes per request, can you run those searches in parallel (e.g. with TPL) ?

The biggest thing is removing the search from the web tier, and isolating it to it's own tier (a search tier). That way, you have a dedicated box with dedicated resources that have the indexes loaded, and "warmed up" in cache, instead of having each user have a copy of it's own index reader.

Related

Caching big data in .NET Core Web API

I have a Web API that provides complex statistical/forecast data. I have one endpoint that can take even 20s to complete, so I started looking at caching to boots the performance. My situation is very different from these described in many examples, so I need help.
Long story short, the method returns a batch of forecasts and statistics for item. For single item, it's as quick as 50ms, that's good. But there is also a method (very complex) that needs 2000-3000 items AT ONCE, to calculate different statistics. And this is a problem.
There are probably around 250,000 items in the database, around 200M rows in one table. The good part is: Table only updates ONCE per day and I would need around 1GB of data (around 80M "optimized" rows).
So my idea was, once per day (I know exactly when) the API would query, transform, optimize and put into memory 1GB of data from that table, and during the day, it will be lighting fast.
My question is, is it a good idea? Should I use some external provider (like Memcached or Redis) or just a singleton list with proper locking using semaphores etc?
If Memcache, how can I do it? I don't want to cache this table "as is". It's too big. I need to do some transformation first.
Thanks!

This is a good solution if you are not limited by server RAM, imo. Since it's .Net Core you can try System.Runtime.Caching.MemoryCache

Max Amount / LIMIT of ASP.NET sites on one server

My question is simple. About 2 years ago we began migrating to ASP.NET from ASP Classic.
Our issue is we currently have about 350 sites on a server and the server seems to be getting bogged down. We have been trying various things to improve performance, Query Optimizations, Disabling ViewState, Session State, etc and they have all worked, but as we add more sites we end up using more of the server's resources and so the improvements we made in code are virtually erased.
Basically we're now at a tipping point, our CPUs currently average near 100%. Our IS would like us to find new ways to reword the code on the sites to improve performance.
I have a theory, that we are simply at the limit on the amount of sites one server can handle.
Any ideas? Please only respond if you have a good idea about what you are talking about. I've heard a lot of people theorize about the station. I need someone who has actual knowledge about what might be going on.
Here are the details.
250 ASP.NET Sites
250 Admin Sites (Written in ASP.NET, basically they are backend admin sites)
100 Classic ASP Sites
Running on a virtualized Windows Server 2003.
3 CPUs, 4 GB Memory.
Memory stays around 3 - 3.5 GB
CPUs Spike very badly, sometimes they remain near 100% for short period of time ( 30 - 180 seconds)
The database is on a separate server and is SQL SERVER 2005.

It looks like you've reached that point. You've optimised your apps, you've looked at server performance, you can see you are hitting peak memory usage, maxing out the CPU, and, lets face it, administering so many websites musn't be easy.
Also, the spec of your VM isn't fantastic. It's memory, in particular, potentially isn't great for the number of sites you have.
You have plenty of reasons to move.
However, some things to look at:
1) How many of those 250 sites are actually used? Which ones are the peak performance offenders? Those ones are prime candidates for being moved off onto their own box.
2) How many are not used at all? Can you retire any?
3) You are running on a virtual machine. What kind of virtual machine platform are you using? What other servers are running on that hardware?
4) What kind of redundancy do you currently have? 250 sites on one box with no backup? If you have a backup server, you could use that to round robin requests, or as a web farm, sharing the load.
Lets say you decide to move. The first thing you should probably think about is how.
Are you going to simply halve the number of sites? 125 + admins on one box, 125 + admins on the other? Or are you going to move the most used?
Or you could have several virtual machines, all active, as part of a web farm or load balanced system.
By the sounds of things, though, there's a real resistance to buy more hardware.
At some point, you are going to have to though, as sometimes, things just get old or get left behind. New servers have much more processing power and memory in the same space, and can be cheaper to run.
Oh, and one more thing. The cost of all those repeated optimizations and testing probably could easily be offset by buying more hardware. That's no excuse for not doing any optimization at all, of course, and I am impressed by the number of sites you are running, especially if you have a good number of users, but there is a balance, and I hope you can tilt towards the "more hardware" side of it some more.

I think you've answered your own question really. You've optimised the sites, you've got the database server on a different server. And you have 600 sites (250 + 250 + 100).
The answer is pretty clear to me. Buy a box with more memory and CPU power.

There is no real limit on the number of sites your server can handle, if all 600 sites had no users, you wouldn't have very much load on the server.
I think you might find a better answer at serverfault, but here are my 2 cents.
You can scale up or scale out.
Scale up -- upgrade the machine with more memory / more cores in the CPU.
Scale out -- distribute the load by splitting the sites across 2 or more servers. 300 on server A, 300 on server B, or 200 each across 3 servers.
As #uadrive mentions, this is an issue of load, not of # of sites.

Just thinking this through, it seems like you would be better off measuring the # of users hitting the server instead of # of sites. You could have 300 sites and only half are used. Knowing the usage would be better in my mind.

There's no simple formula answer, like "you can have a maximum of 47.3 sites per gig of RAM". You could surely maintain performance with many more sites if each site had only one user per day. There are likely servers that have only two sites but performance is terrible because each hit requires a massive database query.
In practice, the only way to approach this is empirically: When performance starts to degrade, you have a problem. The fact that somebody wrote in a book somewhere that a server with such-and-such resources should be able to support more sites is of little value if, in practice, YOUR server can't support YOUR sites and YOUR users.
Realistic options are:
(a) Optimize your code and database queries. You say you've already done that. Maybe you can do more. It's unlikely that your code is now the absolute best that it can possibly be, but it may well be that the effort to find further improvements will be hugely expensive.
(b) Buy a bigger server.
(c) Break your sites across multiple servers, and either update DNS or install a front-end to map requests to the correct server.

Maxing out on CPU use can be a good sign, in the sense that moving to a large server or dividing the sites between multiple servers, is likely to help.
There are many things you can do to help improve performance and scalability (in fact, I've written a book on this subject -- see my profile).
It's difficult to make meaningful suggestions without knowing much more about your apps, but here are a few quick tips that might help to get you started:
Multiple AppPools are expensive. How many sites do you have per AppPool? Combine multiple sites per AppPool if you can
Minimize client round-trips: improve client and proxy-level caching, offload static files to a CDN, use image sprites, merge multiple CSS and JS files
Enable output caching on pages and/or controls were possible
Enable compression for static files (more CPU use on first access, but less after that)
Avoid Session state all together if you can (prefer cookies for state management). If you can't, then at least configure EnableSessionState="ReadOnly" session state for pages that don't need to write it, or "false" for pages that don't need it at all
Many things on the SQL Server side: caching, SqlCacheDependency, command batching, grouping multiple insert/update/deletes into a single transaction, using stored procedures instead of dynamic SQL, using async ADO.NET instead of LINQ or EF, make sure your DB logs are on separate spindles from data, etc
Look for algorithmic issues with your code; for example, hash tables are often better than linear searches, etc
Minimize cookie sizes, and only set cookies on pages, not on static content.
In addition, using a VM is likely to cost you up to about 10% in performance -- make sure it's really worth that for what it buys you in terms of improved manageability.

Design considerations for supporting around 400+ concurrent users in web application

I am at the start of a mid sized asp.net c# project and with an application performance requirement to be able to support around 400+ concurrent users.
What are the things I need to keep in mind while architecting an application to meet such performance and availability standards? The page need to be served in under 5 seconds. I plan to have the application and database on separate physical machines. From a coding and application layering perspective:-
If I have the database layer exposed to the application layer via a
WCF service, will it hamper the performance? Should I use a direct
tcp connection instead?
Will it matter if I am using Entity framework or some other ORM or the enterprise library data block?
Should I log exceptions to database or a text file?
How do I check while development if the code being built is going to meet those performance standards eventually? Or is this even a point I need to worry about at development stage?
Do I need to put my database connection code and other classes that hold lookup data that rarely change for the life of the application, in static classes so it is available thru the life of the application?
What kind of caching policy should I apply?
What free tools can I use to measure and test performance? I know of red-gate performance measurement tools but that has a high license cost, so free tools are what I'd prefer.
I apologize if this question is too open ended. Any tips or thoughts on how I should proceed?
Thanks for your time.

An important consideration when designing a scalable application is to make it stateless. No sessions. Another important consideration is to cache everything that you can in order to reduce database queries. And this cache should be distributed to other machines which are specifically design to store it. Then all you have to do is throw an additional server when the application starts to run slowly due to an increased user load.
As far as your questions about WCF are concerned, you can use WCF, it won't be a bottleneck for your application. It will definitely add an additional layer which will slow things a bit but if you want to expose a reusable layer that can scale independently on its own WCF is great.
ORMs might indeed introduce a performance slowdown in your application. It's more due to the fact that you have less control over the generated SQL queries and thus more difficult to tune them. This doesn't mean that you shouldn't use an ORM. It's just to be careful about what SQL it spits and tune it with your DB admin. There are also lightweight ORMs such as dapper, PetaPoco and Massive that you might consider.
As far as static classes are concerned, they won't improve performance that much compared to instance classes. A class instantiation on the CLR is a pretty fast operation as Ayende explains. Static classes will introduce tight coupling between your data access layer and your consuming layer. So you can forget about static classes for the moment.
For error logging, I would recommend you ELMAH.
For benchmarking there are quite a lot of tools, Apanche Bench is one that is simple to use.

There's always a trade-off between developer productivity, maintainability and performance; you can only really make that trade-off sensibly if you can measure. Productivity is measured by how long it takes to get something done; maintainability is harder to measure, but luckily, performance is fairly easy to quantify. In general, I'd say to optimize for productivity and maintainability first, and only optimize for performance if you have a measurable problem.
To work in this way, you need to have performance targets, and a way of regularly assessing the solution against those targets - it's very hard to retro-fit performance into a project. However, optimizing for performance without proven necessity tends to lead to obscure, hard-to-debug software solutions.
Firstly, you need to turn your performance target into numbers you can measure; for web applications, that's typically "dynamic page requests per second". 400 concurrent users probably don't all request pages at exactly the same time - they usually spend some time reading the page, completing forms etc. On the other hand, AJAX-driven sites request a lot more dynamic pages.
Use Excel or something to work from peak concurrent users to dynamic page generations per second based on wait time, requests per interaction, and build in a buffer - I usually over-provision by 50%.
For instance:
400 concurrent users with a session length of 5 interactions and
2 dynamic pages per interaction means 400 * 5 * 2 = 4000 page requests.
With a 30 seconds wait time, those requests will be spread over 30 * 5 = 150 seconds.
Therefore, your average page requests / second is 4000 / 150 = 27 requests / second.
With a 50% buffer, you need to be able to support a peak of roughly 40 requests / second.
That's not trivial, but by no means exceptional.
Next, set up a performance testing environment whose characteristics you completely understand and can replicate, and can map to the production environment. I usually don't recommend re-creating production at this stage. Instead, reduce your page generations / second benchmark to match the performance testing environment (e.g. if you have 4 servers in production and only 2 in the performance testing environment, reduce by half).
As soon as you start developing, regularly (at least once a week, ideally every day) deploy your work-in-progress to this testing environment. Use a load test generator (Apache Benchmark or Apache JMeter work for me), write load tests simulating typical user journeys (but without the wait time), and run them against your performance test environment. Measure success by hitting your target "page generations / second" benchmark. If you don't hit the benchmark, work out why (Redgate's ANTS profiler is your friend!).
Once you get closer to the end of the project, try to get a test environment that's closer to the production system in terms of infrastructure. Deploy your work, and re-run your performance tests, increasing the load to reflect the "real" pages / second requirement. At this stage, you should have a good idea of the performance characteristics of the app, so you're really only validating your assumptions. It's usually a lot harder and more expensive to get such a "production-like" environment, and it's usually a lot harder to make changes to the software, so you should use this purely to validate, not to do the regular performance engineering work.

Caching architecture for search results in an ASP.NET application

What is a good design for caching the results of an expensive search in an ASP.NET system?
Any ideas would be welcomed ... particularly those that don't require inventing a complex infrastructure of our own.
Here are some general requirements related to the problem:
Each search result can produce include from zero to several hundred result records
Each search is relatively expensive and timeconsuming to execute (5-15 seconds at the database)
Results must be paginated before being displayed at the client to avoid information overload for the user
Users expect to be able to sort, filter, and search within the results returned
Users expect to be able to quickly switch between pages in the search results
Users expect to be able to select multiple items (via checkbox) on any number of pages
Users expect relatively snappy performance once a search has finished
I see some possible options for where and how to implement caching:
1. Cache on the server (in session or App cache), use postbacks or Ajax panels to facilitate efficient pagination, sorting, filtering, and searching.
PROS: Easy to implement, decent support from ASP.NET infrastructure
CONS: Very chatty, memory intensive on server, data may be cached longer than necessary; prohibits load balancing practices
2. Cache at the server (as above) but using serializeable structures that are moved out of memory after some period of time to reduce memory pressure on the server
PROS: Efficient use of server memory; ability to scale out using load balancing;
CONS: Limited support from .NET infrastructure; potentially fragile when data structures change; places additional load on the database; significantly more complicated
3. Cache on the client (using JSON or XML serialization), use client-side Javascript to paginate, sort, filter, and select results.
PROS: User experience can approach "rich client" levels; most browsers can handle JSON/XML natively - decent libraries exist for manipulation (e.g. jQuery)
CONS: Initial request may take a long time to download; significant memory footprint on client machines; will require hand-crafted Javascript at some level to implement
4. Cache on the client using a compressed/encoded representation of the data - call back into server to decode when switching pages, sorting, filtering, and searching.
PROS: Minimized memory impact on server; allows state to live as long as client needs it; slightly improved memory usage on client over JSON/XML
CONS: Large data sets moving back and forth between client/server; slower performance (due to network I/O) as compared with pure client-side caching using JSON/XML; much more complicated to implement - limited support from .NET/browser
5. Some alternative caching scheme I haven't considered...

For #1, have you considered using a state server (even SQL server) or a shared cache mechanism? There are plenty of good ones to choose from, and Velocity is getting very mature - will probably RTM soon. A cache invalidation scheme that is based on whether the user creates a new search, hits any other page besides search pagination, and finally a standard timeout (20 minutes) should be pretty successful at weeding your cache down to a minimal size.
References:
SharedCache (FOSS)
NCache ($995/CPU)
StateServer (~$1200/server)
StateMirror ("Enterprise pricing")
Velocity (Free?)

If you are able to wait until March 2010, .NET 4.0 comes with a new System.Caching.CacheProvider, which promises lots of implementations (disk, memory, SQL Server/Velocity as mentioned).
There's a good slideshow of the technology here. However it is a little bit of "roll your own" or a lot of it infact. But there will probably be a lot of closed and open source providers being written for the Provider model when the framework is released.
For the six points you state, a few questions crops up
What is contained in the search results? Just string data or masses of metadata associated with each result?
How big is the set you're searching?
How much memory would you use storing the entire set in RAM? Or atleast having a cache of the most popular 10 to 100 search terms. Also being smart and caching related searches after the first search might be another idea.
5-15 seconds for a result is a long time to wait for a search so I'm assuming it's something akin to an expedia.com search where multiple sources are being queried and lots of information returned.
From my limited experience, the biggest problem with the client-side only caching approach is Internet Explorer 6 or 7. Server only and HTML is my preference with the entire result set in the cache for paging, expiring it after some sensible time period. But you might've tried this already and seen the server's memory getting eaten.

Raising an idea under the "alternative" caching scheme. This doesn't answer your question with a given cache architecture, but rather goes back to your original requirements of your search application.
Even if/when you implement your own cache, it's effectiveness can be less than optimal -- especially as your search index grows in size. Cache hit rates will decrease as your index grows. At a certain inflection point, your search may actually slow down due to resources dedicated to both searching and caching.
Most search sub-systems implement their own internal caching architecture as a means of efficiency in operation. Solr, an open-source search system built on Lucene, maintains its own internal cache to provide for speedy operation. There are other search systems that would work for you, and they take similar strategies to results caching.
I would recommend you consider a separate search architecture if your search index warrants it, as caching in a free-text keyword search basis is a complex operation to effectively implement.

Since you say any ideas are welcome:
We have been using the enterprise library caching fairly successfully for caching result sets from a LINQ result.
http://msdn.microsoft.com/en-us/library/cc467894.aspx
It supports custom cache expiration, so should support most of your needs (with a little bit of custom code) there. It also has quite a few backing stores including encrypted backing stores if privacy of searches is important.
It's pretty fully featured.
My recommendation is a combination of #1 and #3:
Cache the query results on the server.
Make the results available as both a full page and as a JSON view.
Cache each page retrieved dynamically at the client, but send a REQUEST each time the page changes.
Use ETAGs to do client cache invalidation.

Have a look at SharedCache- it makes 1/2 pretty easy and works fine in a load balanced system. Free, open source, and we've been using it for about a year with no issues.

While pondering your options, consider that no user wants to page through data. We force that on them as an artifact of trying to build applications on top of browsers in HTML, which inherently do not scale well. We have invented all sorts of hackery to fake application state on top of this, but it is essentially a broken model.
So, please consider implementing this as an actual rich client in Silverlight or Flash. You will not beat that user experience, and it is simple to cache data much larger than is practical in a regular web page. Depending on the expected user behavior, your overall bandwidth could be optimized because the round trips to the server will get only a tight data set instead of any ASP.NET overhead.

ASP.NET MVC Caching scenario

I'm still yet to find a decent solution to my scenario. Basically I have an ASP.NET MVC website which has a fair bit of database access to make the views (2-3 queries per view) and I would like to take advantage of caching to improve performance.
The problem is that the views contain data that can change irregularly, like it might be the same for 2 days or the data could change several times in an hour.
The queries are quite simple (select... from where...) and not huge joins, each one returns on average 20-30 rows of data (with about 10 columns).
The queries are quite simple at the sites current stage, but over time the owner will be adding more data and the visitor numbers will increase. They are large at the moment and I would be looking at caching as traffic will mostly be coming from Google AdWords etc and fast loading pages will be a benefit (apparently).
The site will be hosted on a Microsoft SQL Server 2005 database (But can upgrade to 2008 if required).
Do I either:
Set the caching to the minimum time an item doesn't change for (E.g. cache for say 3 mins) and tell the owner that any changes will take upto 3 minutes to appear?
Find a way to force the cache to clear and reprocess on changes (E.g. if the owner adds an item in the administration panel it clears the relevant caches)
Forget caching all together
Or is there an option that would be suit this scenario?

If you are using Sql Server, there's also another option to consider:
Use the SqlCacheDependency class to have your cache invalidated when the underlying data is updated. Obviously this achieves a similar outcome to option 2.
I might actually have to agree with Agileguy though - your query descriptions seem pretty simplistic. Thinking forward and keeping caching in mind while you design is a good idea, but have you proven that you actually need it now? Option 3 seems a heck of a lot better than option 1, assuming you aren't actually dealing with significant performance problems right now.

Premature optimization is the root of all evil ;)
That said, if you are going to Cache I'd use a solution based around option 2.
You have less opportunity for "dirty" data in that manner.
Kindness,
Dan

2nd option is the best. Shouldn't be so hard if the same app edits/caches data. Can be more tricky if there is more than one app.
If you can't go that way, 1st might be acceptable too. With some tweaks (i.e. - i would try to update cache silently on another thread when it hits timeout) it might work well enough (if data are allowed to be a bit old).
Never drop caching if it's possible. Everyone knows "premature optimization..." verse, but caching is one of those things that can increase scalability/performance of application dramatically.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.