handling large datasets in web api & odata

handling large datasets in web api & odata - c#

I have been working with asp.net web api over recent weeks with great success. It has really assisted me with producing an interface for mobile clients to programme against over http.
I reached a point where I need some assistance.
I have a new endpoint which will can a database and could return 100K results. I am using OData to filter the data and return a paginated set of the data.
As this could happen for mutliple requests, I am concerned with performance. Returning 100K records from the database every time is not ideal. So I have some ideas.
First one is to cache the 100K results and let OData do its magic on this every time. I am working with AppFabric distributed cache as its a load balanced environment. However caching such an amount of data in AppFabric could result in memory complications so think I am best avoiding this.
Next option is to forget about the magic of OData and send the filters I use in to the database and return only the required data each time. So in other words hit the db every time.
I could look at using a caching handler like the version outlined in this article to cache in the http cache -> http://byterot.blogspot.ie/2012/06/aspnet-web-api-caching-handler.html The drawback of this is if the data gets update via another system which it may, the cached data is not expired.
Any other tips as to how I may handle this scenario, large amount of data, filtered with odata in conjunction with web api?

This is a question that's likely to result in a wide variety of answers. That said, let me put on my pre-MSFT hat and give you my two cents.
A lot of architecture questions are best answered with the consultant's answer, "It depends." The answer depends in your case on a few things specifically. Some developers have a problem with caching layers because there are additional things to think about. An ACID-compliant database buys you a lot of insurance that you have at least a very finite amount of eventual consistency.
If it were me making this decision, I would be considering a few things:
How many rows am I returning on a regular basis?
Are they the same rows over and over?
How big is that in memory? (100k is really not that many rows; you're right about not wanting those 100k rows to hit the disk every time, but it's probably not a problem to keep them all in memory; SQL Server would probably do this for you anyway.)
What am I willing to deal with re: eventual consistency? Do I want some other software to deal with it? (What frequently scares people about caches are things like ensuring that invalidation and insertion get done properly and consistently from different applications/different places in the application.)
Given the information you've already provided (tiered architecture, willingness to try a distributed cache) I think you should pursue a caching layer. There are lots of good caches out there. AppFabric worked fine for us before I worked at Microsoft, but I've also dealt with a variety of other caching layers as well.

Assuming you use Entity Framework it would be the best option to return the IQueryable of EF directly. This way the magic of OData will work directly on your database. $limit and $take will be mapped directly to your SQL query.

best way is to a distributed cache, which you are already using. but the cache provider which you are using i.e. AppFabric, has some limitations. by limitations i mean the feature limitations. check out NCache which is a well mature and feature rich third party distributed cache provider.
if you want to understand the differences of NCache and Appfabric, check the youtube link below, this is FYI....
http://www.youtube.com/watch?v=3CPi1QlskrU

The caching that I have pointed out in the blog http://byterot.blogspot.ie/2012/06/aspnet-web-api-caching-handler.html applies to HTTP caching also known as output caching. Actually the data itself is not cached on the server but on the client or mid-stream cache servers, so is not suitable for what you have it mind.

Related

asp.net mvc 4 cache vs entityframework 6 cache

A friend and I are developing a web application using ASP.NET MVC 4 and EntityFramework 6. We implemented the Repository Pattern and of course the Entity Framework Context is initialized and removed each time a user makes a request to access data in the database, so caching system entity Framework first level is not exploited .
I want to implement the second level caching in Entity Framework but he tells that it is not necessary, that we can use the ASP.NET Cache. My question is,
When to use each type of cache?
How fast is each other?

when to use each type of cache?
There are other performance improvement patterns than writing your own EF provider, which you have to do in order to implement 2nd level cache. My advice (and that's all it is, advice) would be to never implement EF second level cache.
Getting caching right is very difficult, especially cache invalidation.
If your ultimate goal is to deliver web pages quickly, the asp.net cache (or OutputCache) is one way to achieve that. However you then have to choose when to invalidate the cache, which again can be difficult.
how fast is each other?
Caches usually store data in RAM because that's the fastest way to get data physically. However ultimately speed is going to depend. What cache provider are you using? Is your deployment load-balanced, and if so, do they share the same cache? How do they access it? When you are dealing with the web, you have to consider that the data will be going over the network, so there are all of those latency issues that play a factor too (payload/size, hops, etc.)
If performance is really a problem, you may want to look into patterns that use tools like redis or other nosql providers to store pre-computed denormalized sets of your data for faster access. You can also go outside of EF and craft custom sql for data access points that are giving you particular performance issues.
I really don't think you are going to get any other answers that don't say "it depends" one way or another.

Storing a large amount of analytical data

I normally use SQL Server and C# for all projects I do, however I am looking upon a project that could potentially span to billions of rows of data and I don't feel comfortable doing this in SQL Server .
The data I will be storing is
datetime
ipAddress
linkId
possibly other string related data
I have only ever dealt with relational databases before and hence was looking for some guidance on what database technology would be best suited for this type of data storage. One that could scale and do so at a low cost (when compared to sharding SQL Server)
I would then need to pull this data out based on linkId.
Also would I be able to do ordering within the query to the DB or would that be best done in the application?
EDIT: It will be cloud based. Hence I was looking at SQL Azure, which I have used extensively, however it just starts causing issues as the row count goes up.

Since you are looking for general guidance, I feel it is ok to provide an answer that you have prematurely dismissed ;-). Microsoft SQL Server can definitely handle this situation (in the generic sense of having a table of those fields and billions of rows). I have personally worked on a Data Warehouse that had 4 nodes, each of which had the main fact table holding 1.2 - 1.5 Billion rows (and growing) and responded to queries quickly enough, despite some aspects of the data model and indexing that could have been done better. It is a web-based application with many users hitting it all day long (though some periods of the day much harder than others). Also, that fact table was much wider than the table you are describing, unless that "possibly other string related data" is rather large (but there are ways to properly model that as well). True, the free Express edition might not meet your needs, but Standard Edition likely would and it is not super expensive. Enterprise has a nice feature for doing online index rebuilds, but that alone might not warrant the huge jump in license fees.
Keep in mind that with little to no description of what you are actually trying to accomplish with this data, it is hard for me to say that MS SQL Server will definitely meet your needs. But, given that you seemed to have ruled it out entirely on the basis of the large number of rows you might possibly get, I can at least speak to that situation: with good data modeling, good index design, and regular index maintenance, MS SQL Server can definitely handle billions of rows. Now, whether or not it is the best choice for your project depends on what you are trying to do, what the client is comfortable with maintaining, etc.
Good luck :)
EDIT:
When I said (above) that the queries came back "quickly enough", I
meant anywhere from 1 to 90 seconds, depending on various factors.
Keep in mind that these were not simple queries, and in my opinion,
several improvements could be made to the data modeling and index
strategy.
I intentionally left out the Table Partitioning feature not only
because it is only in Enterprise Edition, but also because it is more
often misunderstood and hence misused than understood and used
properly. Table/Index partitioning in SQL Server is not a means of
"sharding".
I also did not mention Column Store indexes because they are only
available in Enterprise Edition. However, for projects large enough
to justify the cost, Column Store indexes are certainly worth
investigating. They were introduced in SQL Server 2012 and came with
the restriction that the table could not be updated once the Column
Store index was created. You can get around that, to a degree, using
Table Partitioning, but in SQL Server 2014 that restriction will be
removed.

Given that this needs to be cloud-based and that you use .Net / C#, if you really are only talking about a few tables (so far just the stated one and the implied "Link" table--source of LinkID) and hence might not need relationships or some of the other RDBMS features, then one option is to use Amazon's DynamoDB. DynamoDB is part of AWS (Amazon Web Services) and is a NoSQL database. Development and even the initial stage of rolling out a project are made a bit easier by their low-end, free tier. As of 2013-11-04, the main DynamoDB page states that:
AWS Free Tier includes 100MB of Storage, 5 Units of Write Capacity,
and 10 Units of Read Capacity with Amazon DynamoDB.
Here is some documentation: Overview, How to Query with .Net, and general .Net SDK.
BE AWARE: When looking into how much you think it might cost, be sure to include related AWS pieces, such as Network usage, etc.

Reduce SQL Server overhead caching query results

I have a software who does a heavy processing based on some files.
I have to query some tables in SQL Server in the process and this is killing the DB and the application performance. (other applications use the same tables).
After optimizing queries and code, getting better results but not enough. After research I reached the solution: Caching some query results. My idea is cache one specific table (identified as the overhead) rows that the file being process need.
I was think in using AppCache Fabric (I'm on MS stack), made some tests it have a large memory usage for small objects ( appcache service have ~350MB of ram usage without objects). But I need to make some queries in these result table (like search for lastname, ssn, birthdate etc.)
My second option is MongoDb as a cache store. I've research about this and most of people I read recommend using memcached or Redis, but I'm using Windows servers and they're not supported officialy.
Using mongo as cache store in this case it is a good approach? Or AppFabric Caching + tag search is better?

It is hard to tell what is better because we don't know enough about your bottlenecks. A lot is depending on quality of the data you're discussing. If the data is very static and is not called constantly but to compile the data set is time-consuming, the good solution might be to use the materialized view. If this data is frequently called than you better caching it on some server (e.g. app fabric).
There are many techniques and possibilities. But you really need to think of the network traffic, demand, size, etc, etc. And it is hard to answer this here without knowing all the details.
Looks like you are on the right way but may be all you need is just a parametrized query. Hard to tell. But I would add Materialized view into the roster that you just posted. May be all you need is to build this view from all the data you need and just access its contents.

My question to you would be that what are your long-term goals or estimates for your application? If this is the highest load you are going to expereince then tuning the DB or using MVL would be an answer. But the long term solution to this is distributed caching and you are already thinking along those lines. Your data requirements is what we'd called "reference data" or "lookup-data" and once you are excuting multiple lookups with limited DB resources there will be performance issue and your DB will become a performance bottleneck.
So the solution, that you are already thinking of, is caching this "reference" data in a cache without the need to go to the database, while, at the same time, keeping cache synchronized with the Database.
Appfabric I wouldn't be too sure about as it will have the same support issues that you mention. What is your budget like? Can you think about spending on a cachisng solution like NCache?

How can I cache private data in a webfarm for ASP MVC

I am making a member based web app in ASP MVC3 and I am trying to plan ahead, at first our user base will not be huge, but as with any software the potential for a sudden volume spike is always a possibility.
Thinking ahead to this scenario, I know that the database is the bottleneck area on most web apps. We are using MSSQL 2008RS we will have dedicated servers with several client databases each client has there own database so if one server begins to bottle neck we can scale vertically or move some of the databases to a new server and begin filling it up.
To access the databases we use primarily LINQ 2 SQL and are currently re-factoring some of our code to make use of the IQueryable mechanisms to do a lazy load of content. but each page contains quite a bit of content from various parts of the database.
We also have a few large databases that are used for widgets in the program that rarely change but have millions of rows. The goal with those is to somehow sync them to the primary source and distribute them across several machines and then load balance those servers.
With this layout should I even worry about caching, or will the built-in caching mechanisms in MSSQL be sufficient?
If so where should I begin? I have looked briefly at app fabric but it looks as tho it is for Azure only?
Resources:
How to cache data in a MVC application
http://stephenwalther.com/blog/archive/2008/08/28/asp-net-mvc-tip-39-use-the-velocity-distributed-cache.aspx
http://stephenwalther.com/blog/archive/2008/08/29/asp-net-mvc-tip-40-don-t-cache-pages-that-require-authentication.aspx

Lazy loading is a performance killer. Its better to load the entire object graph with one join than to lazy load other properties. This is especially the case with a list of objects. If you iterate you'll end up lazy loading for each item in the list. Furthermore every call to the db has overhead. Less calls = better performance.
SO was a top 1000 website before it needed two database servers. I think you'll be ok.
If your revenue model says "each client will have its own database" than your scaling issues should be really easy to solve. Sounds like you already have a plan to scale up with more servers as your client base increases. Whats the problem?
Caching on the web tier is usually the first scaling fix you'll have to worry about. You probably don't need to do a fresh db call with each page request.
Overall this sounds like a lot of premature optimization. Your traffic hasn't reached a point where you need to be worried about scaling. Make these kinds of decisions at the last second possible.

The database cache is different to most caches - it can if course load used data into memory and re-use query plans, but that isn't really a cache as such.
AppFabric is definitely not just azure; after all, I it was you wouldnt be able to install it (and use it) locally :) but in truth there is little between AppFabroc, redis and memcached (the latter lacks persistance, of course).
But I think you should initially look at using the inbuilt asp.net caching; both data caching via HttpContext.Cache, and caching of entire responses (or, in MVC 3, partials). Obviously you should have a broad idea of what data is used heavily by lots of requests, and is safe to re-use : cache that!
Just make sure you treat all cached FAA as immutable (if you need to update the cache, re-add a modified value; don't modify the existing objects) - reason: it won't work the same if you start needing to use distributed caching, as that uses serialization, and any changes you make won't be seen by the next request.

is a database intermediary good system design?

background: we've got a number of server processes and client apps that are used entirely internally, in a fairly controlled environment. we capture a significant amount of data every day that goes into a couple database machines. most everything is c#, with a few c++ apps.
just about every app has some basic (if not extensive) dependence on database data, whether it's for historical data, daily-calculated values, or assorted parameters. as the whole environment has gotten a bit more sprawling, I've been wondering about the sense in sticking an intermediary in between all client and server apps and the database, a sort of "database data broker". any app that needs values from the db makes a request to the data broker, instead of a dll wrapper function that calls a stored proc.
one immediate downside is that the data would make two trips across the network: from db to broker, and from broker to calling app. seems like poor form, but the amount of data would be small enough in each request that I'm ok with it as far as performance goes.
one (seeming) upside is that it would be trivial to set up a test environment, as it would entail just setting up a test data broker, and there's no maintaining of db connection strings locally anywhere else. also, I've been pondering creating a mini request language so you wouldn't have to enumerate functions for each dataset you might request (instead of GetX() and GetY(), there would be Get("name = X")
am I over-engineering this, or is it possibly a worthy architecture?
edit: thanks for all the great comments so far, great food for thought.

It depends on what you're trying to accomplish with it. According to Rocky Lhotka, you should only add a tier if you are forced to, kicking and screaming all the way.
I agree with him: don't tier unless you need to. I think there are valid reasons to add additional tiers, usually for purposes of security, scalability and maintainability. The question becomes: is yours a valid reason?
It looks like the major reason is maintainability. Does it outweigh the benefits you get by not having the tier?

only you can answer these:
what are the benefits of doing this?
what are the problems/risks of doing this?
do you need this to make testing easier or even possible?
if you make this change and when it goes live and crashes will you be fired?
if you make the changes and it goes live will you get a promotion?
etc...

As the former architect of a system that also used a database heavily as a "hub," I can say that there are several drawbacks that you should be aware of. Our system used databases:
As a transaction store (typical OLTP stuff)
As a staging queue (submitted but unprocessed transactions)
As a historical data store (results of processed transactions)
As an interoperation layer (untranslated commands or transactions issued from other systems)
One of the major drawbacks is ownership costs. When your databases become the single point of failure for so many types of operations, it becomes necessary to ensure that they are all hosted in high-availability environments. This not only expensive from a hardware perspective, but it is also expensive to support deployments to HA environments, since developers typically have very limited visibility to the internals.
A second drawback is that you have to seriously design integrity in to all of your tables. In a typical SOA environment, you have complete control over how data is modified. When you expose it through database tables, you must consider that any application with the right credentials will have the ability to modify data. Because of this, you must carefully consider utilitarian implementations of constraints. If you had a single service managing persistence, you could be much looser in constraints on the database and enforce them in code.
Third, if you ever want to expose any functionality that the database tables currently allow you to provide to outside parties, you must write service code anyway, so you might be better served doing it strategically as opposed to reacting to requests.
Fourth, UI interaction directly with the data layer creates security risks, especially if the client is a thick client.
Finally, writing code that responds to events (service calls) is much easier than polling code. Typically, organizations that rely heavily on database polling end up reinventing the wheel every time a new project requires a new "monitoring service." It can be avoided by creating a "framework," but those have their own pitfalls (primarily around prescription versus adoption).
This is just a laundry list of problems I have encountered. It's not necessarily meant to dissuade you from using databases for these functions, but it helps to know the dangers ahead of time so you can at least plan for them if they ever do become issues.
EDIT
Just thought of another scenario that caused us pains. Versioning your changes can be difficult. For example, if you need to change the shape of a table (normalize/denormalize), it has a cascading effect if multiple applications rely on it. In a SOA scenario, it is much easier, because you can keep your old API, change the internal interaction so that it works with the changed tables, and allow consumers to migrate to the new version on their own schedule.

A data broker sounds like a really good way to abstract out the multiple data sources for your apps. It would be easy to consolidate, change repositories, or otherwise move data around if needed in the future.

I may be misunderstanding something, but it seems to me like you should consider some entity framework. That is a framework you can use to "map" your interaction with the db to some domain objects. That way you work locally on domain objects that gets filled form your db, and when it is time to persist the state of your objects to the base, the framework handles all the connections back and forth. In this way you can also easily mock up these domain objects for unit testing without needing a db connection.
Check out NHibernate for a good entity framework alternative.

If you already have the database related know-how I think it's not a bad decission.
Good things that I can think of:
if the data model is consistent you can plug in new tools easily without making any changes in the other apps.
maybe you can have running the database more reliabily than your apps, so if one of them fails, the other one can still be working.
you can make backups and rollbacks using the database tools.
you can do emergency fixes manipulating the data directly with sql or some visual tool.
But if you have to learn new frameworks along the way, maybe the benefits are not worth the extra initial effort.

"any app that needs values from the db makes a request to the data broker"
When database technology was being invented over 40 years ago, the people doing that inventing had ideas along the lines of "any app that needs values from the db makes a request to the dbms".
Have you ever pondered the possibility that YOU ALREADY HAVE a "data broker", and that there might be very little added value in creating a second one of your own ?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.