We have a table in our database which has around 2,500,000 rows (around 3GB). Is it technically possible to view the data in this table in a silverlight application which queries this data using WCF? Potentially, I see issues with the maximum buffer size and timeout errors. We may need the entire data to be used for visualization purposes.
Please guide me if there is a practical solution to this problem.
Moving 3GB to a client is not going to work.
for visualization purposes.
Better prepare the visualization server-side. That will be slow enough.
Generally in this sort of situation if you need to view individual records then you would use a paging strategy. So your call to WCF would be for a page worth of records and you would display those records and the user would click on a next / previous button or some such.
As for the visualisation you should look to perform some transformation / reduction on the server as 2.5 million records is akin to displaying one data point per pixel on your screen.
First of all, have a look here.
Transfering 3GB of data from Disk to Disk can take quite a few minutes let alone on crossing across the network. I think you have got bigger fishes to fry - WCF limitation is irrelevant here.
So let's assume after a few minutes/hours you got the data across teh wire, where do you store it? You Silverlight app if running inside the browser can not grow to 3GB (even on a 64bit machine) and even it could, it does not make any sense. Especialy that amount of data when transformed into objects will take a lot more space.
Here is what I would do:
Get the server to provide snapshots/views of the data that is useful, e.g. providing summary, OLAP cubes, ...
For each record, provide minimum data required.
If you need detail on each record, do that in a separate call
Well, I believe and suggest that you're not going to show 2,5 milion rows in the same listing.
If you develop a good paging of data and the way you query the data is optimal, I don't find the problem with WCF.
I'm agree with querying data with a WCF interface is less efficient than a standalone, direct access to infraestructure solution, but if you need to host some business and data and N clients to access that in a SOA solution, or it's a client-server solution, you'll need to be sure that your queries are efficient.
Suggestions:
Use an OR/M. NHibernate will be your best choice, since it has a lot of ways of tweaking performance and paging is made easy because of it's LINQ support through QueryOver API in NHibernate 3.0. This product has a very interesting caching scheme and it'll let your application efficiently visualize your 2,5 milion-rows database.
Do caching. NHibernate may help you in this area, but think about that and, depending on the client technology (Web, Windows...), you'll find good options for caching presentation views (ASP.NET output caching, for example).
Think about how you're going to serialize objects in WCF: SOAP or JSON? Maybe you would be interested in JSON because serialized objects are tiny enough in order to save network trafic.
If you have questions, just comment out!
Ok, after many users talk about what you do there technically - what is the sense someone without thinking thought you have there?
2.5 million rows make no sensein a grid. Zero. Showing 80 rows per page (wide sdcreen, tilted 90 degree) that would be 31250 pages worth of data. You can not even scripp to a specific page. Ignoring load times -even IF (!) you load that etc., it just makes no sense to have this amount ina grid. Filter it down, then load what you need page wise. But the key here is to force the user to filter BEFORE even thinking about a grid. And once you ahve them, lets not get into takling abuot the performance of the grid.
To show you how bad this is. For get the grid. If you assign ONE PIXEL or every data item, you take 1.33 screens of 1024*768 pixels to show the data. THis is one pixel per item.
So, at the end of the day, even IF (which is impossible) to manage to get this working, you end up with a non sensical / non usable applciation.
Related
I'm trying to build a product catalog application in ASP.NET and C# that will allow a user to select product attributes from a series of drop-down menus, with a list of relevant products appearing in a gridview.
On page load, the options for each of the drop-downs are queried from the database, as well as the entire product catalog for the gridview. Currently this catalog stands at over 6000 items, but we're looking at perhaps five or six times that when the application goes live.
The query that pulls this catalog runs in less than a second when executed in SQL Server Management Studio, but takes upwards of ten seconds to render on the web page. We've refined the query as much as we know how: pulling only the columns that will show in our gridview (as opposed to saying select * from ...) and adding the with (nolock) command to the query to pull data without waiting for updates, but it's still too slow.
I've looked into SqlCacheDependency, but all the directions I can find assume I'm using a SqlDataSource object. I can't do this because every time the user makes a selection from the menu, a new query is constructed and sent to the database to refine the list of displayed products.
I'm out of my depth here, so I'm hoping someone can offer some insight. Please let me know if you need further information, and I'll update as I can.
EDIT: FYI, paging is not an option here. The people I'm building this for are standing firm on that point. The best I can do is wrap the gridview in a div with overflow: auto set in the CSS.
The tables I'm dealing with aren't going to update more than once every few months, if that; is there any way to cache this information client-side and work with it that way?
Most of your solution will come in a few forms (none of which have to do with a Gridview):
Good indexes. Create good indexes for the tables that pull this data; good indexes are defined as:
Indexes that store as little information as actually needed to display the product. The smaller the amount of data stored, the greater amount of data can be stored per 8K page in SQL Server.
Covering indexes: Your SQL Query should match exactly what you need (not SELECT *) and your index should be built to cover that query (hence why it's called a 'covering index')
Good table structure: this goes along with the index. The fewer joins needed to pull the information, the faster you can pull it.
Paging. You shouldn't ever pull all 6000+ objects at once -- what user can view 6000 objects at once? Even if a theoretical superhuman could process that much data; that's never going to be your median usecase. Pull 50 or so at a time (if you really even need that many) or structure your site such that you're always pulling what's relevant to the user, instead of everything (keep in mind this is not a trivial problem to solve)
The beautiful part of paging is that your clients don't even need to know you've implemented paging. One such technique is called "Infinite Scrolling". With it, you can go ahead and fetch the next N rows while the customer is scrolling to them.
If, as you're saying paging really is not an option (although I really doubt it ; please explain why you think it is, and I'm pretty sure someone will find a solution), there's really no way to speed up this kind of operation.
As you noticed, it's not the query that's taking long, it's the data transfer. Copying the data from one memory space (sql) to another (your application) is not that fast, and displaying this data is orders of magnitude slower.
Edit: why are your clients "firm on that point" ? Why do they think it's not possible otherwise ? Why do they think it's the best solution ?
There are many options to show a big largeset of data on a grid but third parties software.
Try to use jquery/javascript grids with ajax calls. It will help you to render on client a large amount of rows. Even you can use the cache to not query many times the database.
Those are a good grids that will help your to show thousands of rows on a web browser:
http://www.trirand.com/blog/
https://github.com/mleibman/SlickGrid
http://demos.telerik.com/aspnet-ajax/grid/examples/overview/defaultcs.aspx
http://w2ui.com/web/blog/7/JavaScript-Grid-with-One-Million-Records
I Hope it helps.
You can load all the rows into a Datatable on the client using a Background thread when the application (Web page) starts. Then only use the Datatable to populate your Grids etc....So you do not have to hit SQL again until you need to read / write different data. (All the other answers cover the other options)
I have around 1000 rows of data.On the ASPX page, whenever the user clicks the sort button, it will basically sort the result according to a specific column.
I propose to sort the result in the SQL query which is much more easier with just an Order by clause.
However, my manager insisted me to store the result in an array, then sort the data within an array because he thinks that it will affect the performance to call the database everytime the user clicks the sort button.
Just out of curiosity - Does it really matter?
Also, if we disregard the number of rows, performance wise, which of these methods is actually more efficient?
Well, there are three options:
Sort in the SQL
Sort server-side, in your ASP code
Sort client-side, in your Javascript
There's little reason to go with (2), I'd say. It's meat and drink to a database to sort as it returns data: that's what a database is designed to do.
But there's a strong case for (3) if you want to have a button that the user can click. This means it's all done client-side, so you have no need to send anything to the web server. If you have only a few rows (and 1000 is really very few these days), it'll feel much faster, because you won't have to wait for sending the request and getting a response.
Realistically, if you've got so many things that Javascript is too slow as a sorting mechanism, you've got too many things to display them all anyway.
In short, if this is a one-off thing for displaying the initial page, and you don't want the user to have to interact with the page and sort on different columns etc., then go with (1). But if the user is going to want to sort things after the page has loaded, then (3) is your friend.
Short Answer
Ah... screw it: there's no short answer to a question like this.
Longer Answer
The best solution depends on a lot of factors. The question is somewhat vague, but for the sake of simplicity let's assume that the 1000 rows are stored in the database and are being retrieved by the client.
Now, a few things to get out of the way:
Performance can mean a variety of things in a variety of situations.
Sorting is (relatively) expensive, no matter where you do it.
Sorting is least expensive when done in the database, as the database already has the all the necessary data and is optimized for these operations.
Posting a question on SO to "prove your manager wrong" is a bad idea. (The question could easily have been asked without mentioning the manager.)
Your manager believes that you should upload all the data to the client and do all the processing there. This idea has some merit. With a reasonably sized dataset processing on the client will almost always be faster than making a round trip to the server. Here's the caveat: you have to get all of that data to the client first, and that can be a very expensive operation. 1000 rows is already a big payload to send to a client. If your data set grows much larger then you would be crazy to send all of it at once, particularly if the user really only needs a few rows. In that case you'll have to do some form of paging on the server side, sending chunks of data as the user requests it, usually 10 or 20 rows at a time. Once you start paging at the server your sorting decision is made for you: you have no choice but to do your sorting there. How else would you know which rows to send?
For most "line-of-business" apps your query processing belongs in the database. My generalized recommendation: by all means do your sorting and paging in the database, then return the requested data to the client as a JSON object. Please don't regenerate the entire web page just to update the data in the grid. (I've made this mistake and it's embarrassing.) There are several JavaScript libraries dedicated solely to rendering grids from AJAX data. If this method is executed properly your page will be incredibly responsive and your database will do what it does best.
We had a problem similar to this at my last employer. we had to return large sets of data efficiently, quickly and consistently into a datagridview object.
The solution that they came up was to have a set of filters the user could use to narrow down the query return and to set the maximum number of rows returned at 500. Sorting was then done by the program on an array of those objects.
The reasons behind this were:
Most people will not not process that many rows, they are usually looking for a specific item (Hence the filters)
Sorting on the client side did save the server a bunch of time, especially when there was the potential for thousands of people to be querying the data at the same time.
Performance of the GUI object itself started to become an issue at some point (reason for limiting the returns)
I hope that helps you a bit.
From both a data-modeling perspective and from an application architecture pattern, its "best practice" to put sorting/filtering into the "controller" portion of the MVC pattern. That is directly opposed to the above answer several have already voted for.
The answer to the question is really: "It depends"
If the application stays only one table, no joins, and a low number of rows, then sorting in JavaScript on the client is likely going to win performance tests.
However, since it's already APSX, you may be preparing for your data/model to expand.--Once there are more tables and joins, and if the UI includes a data grid where the choice of which column to sort will change on a per-client basis, then maybe the middle-tier should be handling this sorting for your application.
I suggest reviewing Tom Dykstra's classic Contosa University ASP.NET example which has been updated with Entity Framework and MVC 5. It includes a section on Sorting, Filtering and Paging. This example shows the value of proper MVC architecture and the ease of implementing sorting/filtering on multiple columns.
Remember, applications change (read: "grow") over time so plan for it using an architecture pattern such as MVC.
I'm looking for a design solution for a pattern that I am going to have to repeat quite a lot throughout a website I am designing. It is going to be ASP.NET MVC front-end, with C# WCF web services connecting using NHibernate to SQL database.
It's a social networking site so imagine facebook here to get a conceptual idea. What I'm looking for is an efficient and performant way to return paginated results of large datasets, for example a user may have 150 emails. I want to return them 10 at a time depending on what page theyre on, obviously only returning the 10 that relate to the page rather than having to load all 150 items into memory and only displaying 10 at a time as I think the user experience would be better to have a slightly longer delay in changing pages compared to a faster initial load. After all when do you look at emails 6 months old? The usual case is you only care about the first page of results anyway. Similarly a user may have had a number of interactions since their last login (eg your notifications feed on facebook) but again I only want to load n number of results at a time, but in this instance rather than having pages, you would click the "Display more" button which would then fetch the next N results, display them with another "display more" link and so forth you can keep clicking until you reach the end of the dataset. I can imagine they would both use the same design though as they are technically both paginated results, just with different UI output and flow.
Can anyone offer some advice on a good design to use for this, bearing in mind my data retrieval is using NHibernate Queryable or Enumerables? Would I want to be loading all data from DB in one hit then using an interator pattern to only return N rows from the service layer, keeping the rest of the list held in memory on the server open in the users session context so if I made another call to retrieve the next N rows, it would be held in place and keep returning N rows until the iterator finished, or would it be best to simply retrieve N rows from the database and return those, holding nothing in session context? I can see how to return top 10 results from Queryable as
var results = (from email in emails where email.UserId = userId).Take(10);
But I'm not sure how efficient this is, is this the fastest way of doing it? And furthermore I don't see how to start at a certain position, this will always only return the first 10, not say the second 10, or third 10 etc.
So I'm a bit unsure how the best way to proceed is and was hoping for some pointers and advice from people who have done something similar. Bearing in mind with my website performance is going to be of the essence so the user experience needs to be pretty sharp and interactive with refreshing new results. Basically if you were trying to simulate a facebook news feed/wall - how would you implement it with the above architecture?
Thanks!
You can use Skip in combination with Take:
var results = (from email in emails where email.UserId = userId)
.Skip((currentPage - 1) * 10)
.Take(10);
About the web service: You really should make it a stateless web service. You could use the ASP.NET Web API for this. This enables you to build a RESTful web service.
Do I want to be loading all the in one hit...
Definitely not, you only want to pull down the records you need, not the ones you may need.
...using an interator pattern to only return N rows from the service layer, keeping the rest of the list held in memory on the server open in the users session context...
Scalability goes right out the window with that idea.
...or would it be best to simply retrieve N rows from the database and return those, holding nothing in session context?
Now your starting to get on the right track...
In general, you want to let the database do as much as the querying as possible i.e. you don't want to hit the database to then have to further query the results (however, that's not always avoidable). In other words, you want to delegate most, if not all, the heavy lifting to the database.
You mentioned you are using NHibernate which is a pretty powerful ORM. The good news is that do a lot of the work for you in terms of query optimization/caching data etc. Like most ORM's nowadays, NHibernate uses deferred execution with it's queries so just watch out for things like hitting the database too early & choosing when to eager load data instead of performing multiple queries. There is a lot to learn with NHibernate, if you haven't already, it's worth taking the time to read up about it before diving in it will save you a lot of hassle in the long run.
Bearing in mind with my website performance is going to be of the essence so the user experience needs to be pretty sharp and interactive with refreshing new results
In terms of the performance (I assume you mean page load speeds) you would just want to ajaxify your site i.e. load what needs to be loaded with the page, pull the rest in the background & update the page dynamically. To achieve the "refreshing new results" part you need to look at polling the server and pulling down new data. I am pretty sure Facebook use a technique called long polling which essentially keeps an active request open with the server for a set amount of time so the data appears to happen "instantly". Polling is a different ball game all together though, it's about striking the balance of server load vs how "fresh" the data needs to be - that's something you would need to decide yourself and the answer to that is usually dependant on the type of data vs the hardware capabilities of the server.
There are some links about it (like this) out there but I liked this guy approach. I don't know if I'd use his PagedQueryable, but his IPageable, IPagedEnumerable and PagedEnumerable are really interesting. Besides, his project introduction page may give you some ideas on how to roll your own pagination.
I'm looking for a "best practise" way to handle incoming time series data.
One data point consists for example of time, height, width etc. for every "tick". Is it a good idea to save n data points in-memory with a collection class and later "flush" the points to a database after reaching the limits of the collection?
Or should the data points be directly written to the database in the first place, so that my object can run queries against it?
I know that this is little information about my requirements, so the question is how fast is the data access to a database compared to a hybrid in-memory and database solution.
Say there are at most 500 data points per second to handle and the data has to be calculated somehow on every point incoming. With a pure database solution, one has to run a store query on every incoming point. I guess this is not effective, but I don't know if such a database is able to "listen" and do this fast.
A nice feature for the database would be to send the points to subcribers. Is this possible with SQL server?
Thanks, Juergen
Putting the "sending to subscribers" requirement aside, don't get into the trap of premature optimization.
I would try the simplest solution first, which is probably just writing the data into the database as it arrives. Then run stress tests. If the performance isn't up to scratch, find the bottlenecks and optimize them out.
Turning to the "sending to subscribers" requirement, this isn't really something which relational database platforms are typically designed for (they are more about storing data and exposing it for on-demand retreival). A pub-sub type requirement is usually best solved using some kind of message bus. Perhaps take a look at something like NServiceBus.
If it is not multi-user then data points in-memory with a collection class is definitive a winner.
If it is multi-user then I would go for some sort of shared in memory data structure on server side
persists it time to time in db.
I would say the bigger question is how you plan on storing this in SQL. I would queue the datapoints in memory for a period of time (1 second?) and then write a single row to the database with a blob field, or nvarchar field containing all the data for that second as this will mean the database will scale further, the row could contain some summary information of what happened in this second which you could use when when performing queries on the data to reduce load when you are doing selects... Of-course this wouldn't be feasable if you want to perform direct queries on this data.
It all depends what you plan to do with the data...
I am making a member based web app in ASP MVC3 and I am trying to plan ahead, at first our user base will not be huge, but as with any software the potential for a sudden volume spike is always a possibility.
Thinking ahead to this scenario, I know that the database is the bottleneck area on most web apps. We are using MSSQL 2008RS we will have dedicated servers with several client databases each client has there own database so if one server begins to bottle neck we can scale vertically or move some of the databases to a new server and begin filling it up.
To access the databases we use primarily LINQ 2 SQL and are currently re-factoring some of our code to make use of the IQueryable mechanisms to do a lazy load of content. but each page contains quite a bit of content from various parts of the database.
We also have a few large databases that are used for widgets in the program that rarely change but have millions of rows. The goal with those is to somehow sync them to the primary source and distribute them across several machines and then load balance those servers.
With this layout should I even worry about caching, or will the built-in caching mechanisms in MSSQL be sufficient?
If so where should I begin? I have looked briefly at app fabric but it looks as tho it is for Azure only?
Resources:
How to cache data in a MVC application
http://stephenwalther.com/blog/archive/2008/08/28/asp-net-mvc-tip-39-use-the-velocity-distributed-cache.aspx
http://stephenwalther.com/blog/archive/2008/08/29/asp-net-mvc-tip-40-don-t-cache-pages-that-require-authentication.aspx
Lazy loading is a performance killer. Its better to load the entire object graph with one join than to lazy load other properties. This is especially the case with a list of objects. If you iterate you'll end up lazy loading for each item in the list. Furthermore every call to the db has overhead. Less calls = better performance.
SO was a top 1000 website before it needed two database servers. I think you'll be ok.
If your revenue model says "each client will have its own database" than your scaling issues should be really easy to solve. Sounds like you already have a plan to scale up with more servers as your client base increases. Whats the problem?
Caching on the web tier is usually the first scaling fix you'll have to worry about. You probably don't need to do a fresh db call with each page request.
Overall this sounds like a lot of premature optimization. Your traffic hasn't reached a point where you need to be worried about scaling. Make these kinds of decisions at the last second possible.
The database cache is different to most caches - it can if course load used data into memory and re-use query plans, but that isn't really a cache as such.
AppFabric is definitely not just azure; after all, I it was you wouldnt be able to install it (and use it) locally :) but in truth there is little between AppFabroc, redis and memcached (the latter lacks persistance, of course).
But I think you should initially look at using the inbuilt asp.net caching; both data caching via HttpContext.Cache, and caching of entire responses (or, in MVC 3, partials). Obviously you should have a broad idea of what data is used heavily by lots of requests, and is safe to re-use : cache that!
Just make sure you treat all cached FAA as immutable (if you need to update the cache, re-add a modified value; don't modify the existing objects) - reason: it won't work the same if you start needing to use distributed caching, as that uses serialization, and any changes you make won't be seen by the next request.