I am designing a database and I would like to normalize the database. In one query I will joining about 30-40 tables. Will this hurt the website performance if it ever becomes extremely popular? This will be the main query and it will be getting called 50% of the time. The other queries I will be joining about two tables.
I have a choice right now to normalize or not to normalize but if the normalization becomes a problem in the future I may have to rewrite 40% of the software and it may take me a long time. Does normalization really hurt in this case? Should I denormalize now while I have the time?
I quote: "normalize for correctness, denormalize for speed - and only when necessary"
I refer you to: In terms of databases, is "Normalize for correctness, denormalize for performance" a right mantra?
HTH.
When performance is a concern, there are usually better alternatives than denormalization:
Creating appropriate indexes and statistics on the involved tables
Caching
Materialized views (Indexed views in MS SQL Server)
Having a denormalized copy of your tables (used exclusively for the queries that need them), in addition to the normalized tables that are used in most cases (requires writing synchronization code, that could run either as a trigger or a scheduled job depending on the data accuracy you need)
Normalization can hurt performance. However this is no reason to denormalize prematurely.
Start with full normalization and then you'll see if you have any performance problems. At the rate you are describing (1000 updates/inserts per day) I don't think you'll run into problems unless the tables are huge.
And even if there are tons of database optimization options (Indexes, Prepared stored procedures, materialized views, ...) that you can use.
Maybe I missing something here. But if your architecture requires you to join 30 to 40 tables in a single query, ad that query is the main use of your site then you have larger problems.
I agree with others, don't prematurely optimize your site. However, you should optimize your architecture to account for you main use case. a 40 table join for a query run over 50% of the time is not optimized IMO.
Don't make early optimizations. Denormalization isn't the only way to speed up a website. Your caching strategy is also quite important and if that query of 30-40 tables is of fairly static data, caching the results may prove to be a better optimization.
Also, take into account the number of writes to the number of reads. If you are doing approximately 10 reads for every insert or update, you could say that data is fairly static, hence you should cache it for some period of time.
If you end up denormalizing your schema, your writes will also become more expensive and potentially slow things down as well.
Really analyze your problem before making too many optimizations and also wait to see where your bottlenecks in the system really as you might end up being surprised as to what it is you should optimize in the first place.
Related
What is the most common and easy to implement solution to improve speed for SQL Server 2008R2 database & .Net 3.5 application.
We have an application with the following attributes:
- small number of simultaneous clients (~200 at MOST).
- complex math operations on SQL server side
- we are imitating something to oracle's row-level security (Thus using tvf's and storedprocs instead of directly querying tables)
-The main problem is that users perform high amount of updates/inserts/deletes/calculations, and they freak out because they need to wait for pages to reload while those actions are done.
The questions I need clarification on are as follows:
What is faster: returning whole dataset from sql server and performing math functions on C# side, or performing calculation functions on sql side (thus, not returning extra columns). Or is it only hardware dependant?
Will caching improve performance (For example if we add redis cache). Or caching solutions only feasible for large number of clients?
Is it a bad practice to pre-calculate some of the data and store somewhere in the database (so, when user will request, it will already be calculated). Or this is what caching suppose to do? If this is not a bad practice, how do you configure SQL server to do calculations when there are available resources?
How caching can improve performance if it still needs to go to the database and see if any records were updated?
general suggestions and comments are also welcome.
Let's separate the answer to two parts, performance of your query execution and caching to improve that performance.
I believe you should start with addressing the load on your SQL server and try to optimize process running on it to the maximum, this should resolve most of the need to implement any caching.
From your question it appears that you have a system that is used for both transactional processing and also for aggregations/calculations, this will often result in conflicts when these two tasks lock each other resources. A long query performing math operations may lock/hold an object required by the UI.
Optimizing these systems to work side-by-side and improving the query efficiency is the key for having increased performance.
To start, I'll use your questions. What is faster? depends on the actual aggregation you are performing, if you're dealing with a set operations, i.e. SUM/AVG of a column, keep it in SQL, on the other hand if you find yourself having a cursor in the procedure, move it to C#. Cursors will kill your performance!
You asked if it's bad-practice to aggregate data aside and later query that repository, this is the best practice :). You'll end up with having one database catering the transactional, high-paced clients and another database storing the aggregated info, this will be quickly and easily available for your other needs. Taking it to the next step will result with you having a data warehouse, so this is definitely where you want to be heading when you have a lot information and calculations.
Lastly, caching, this is tricky and really depends on the specific nature of your needs, I'd say take the above approach, spend the time in improving the processes and I expect the end result will make caching redundant.
One of your best friends for the task is SQL Profiler, run a trace on stmt:completed to see what are the highest duration/io/cpu and pick on them first.
Good luck!
I am trying to use sqlite in my application as a sort of cache. I say sort of because items never expire from my cache and I am not storing anything. I simply need to use the cache to store all ids I processed before. I don't want to process anything twice.
I am entering items into the cache at 10,000 messages/sec for a total of 150 million messages. My table is pretty simple. It only has one text column which stores the id's. I was doing this all in memory using a dictionary, however, I am processing millions of messages and, although it is fast that way, I ran out of memory after some time.
I have researched sqlite and performance and I understand that configuration is key, however, I am still getting horrible performance on inserts (I haven't tried selects yet). I am not able to keep up with even 5000 inserts/sec. Maybe this is as good as it gets.
My connection string is as below:
Data Source=filename;Version=3;Count Changes=off;Journal Mode=off;
Pooling=true;Cache Size=10000;Page Size=4096;Synchronous=off
Thanks for any help you can provide!
If you are doing lots of inserts or updates at once, put them in a transaction.
Also, if you are executing essentially the same SQL each time, use a parameterized statement.
Have you looked at the SQLite Optimization FAQ (bit old).
SQLite performance tuning and optimization on embedded systems
If you have many threads writing to the same database, then you're going to run into concurrency problems with that many transactions per second. SQLite always locks the whole database for writes so only one write transaction can be processed at a time.
An alternative is Oracle Berkley DB with SQLite. This latest version of Berkley DB includes a SQLite front end that has a page-level locking mechanism instead of database level. This provides much higher numbers of transactions per second when there is a high concurrency requirement.
http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html
It includes the same SQLite.NET provider and is supposed to be a drop-in replacement.
Since you're requirements are so specific you may be better off with something more dedicated, like memcached. This will provide a very high throughput caching implementation that will be a lot more memory efficient than a simple hashtable.
Is there a port of memcache to .Net?
I know the answer to this question for the most part is "It Depends", however I wanted to see if anyone had some pointers.
We execute queries each request in ASP.NET MVC. Each request we need to get user rights information, and Various data for the Views that we are displaying. How many is too much, I know I should be conscious to the number of queries i am executing. I would assume if they are small queries and optimized out, half-a-dozen should be okay? Am I right?
What do you think?
Premature optimization is the root of all evil :)
First create your application, if it is sluggish you will have to determine the cause and optimize that part. Sure reducing the queries will save you time, but also optimizing those queries that you have to do.
You could spend a whole day shaving off 50% time spend off a query, that only took 2 milisecond to begin with, or spend 2 hours on removing some INNER JOINS that made another query took 10 seconds. Analyse whats wrong before you start optimising.
The optimal amount would be zero.
Given that this is most likely not achievable, the only reasonable thing to say about is: "As little as possible".
Simplify your site design until it's
as simple as possible, and still
meeting your client's requirements.
Cache information that can be cached.
Pre-load information into the cache
outside the request, where you can.
Ask only for the information that you
need in that request.
If you need to make a lot of independant queries for a single request, parallelise the loading as much as possible.
What you're left with is the 'optimal' amount for that site.
If that's too slow, you need to review the above again.
User rights information may be able to be cached, as may other common information you display everywhere.
You can probably get away with caching more than the requirements necessitate. For instance - you can probably cache 'live' information such as product stock levels, and the user's shopping cart. Use SQL Change Notifications to allow you to expire and repopulate the cache in the background.
As few as possible.
Use caching for lookups. Also store some light-weight data (such as permissions) in the session.
Q: Do you have a performance problem related to database queries?
Yes? A: Fewer than you have now.
No? A: The exact same number you have now.
If it ain't broke, don't fix it.
While refactoring and optimizing to save a few milliseconds is a fun and intellectually rewarding way for programmers to spend time, it is often a waste of time.
Also, changing your code to combine database requests could come at the cost of simplicity and maintainability in your code. That is, while it may be technically possible to combine several queries into one, that could require removing the conceptual isolation of business objects in your code, which is bad.
You can make as many queries as you want, until your site gets too slow.
As many as necessary, but no more.
In other words, the performance bottlenecks will not come from the number of queries, but what you do in the queries and how you deal with the data (e.g. caching a huge yet static resultset might help).
Along with all the other recommendations of making fewer trips, it also depends on how much data is retrieved on each round trip. If it is just a few bytes, then it can probably be chatty and performance would not hurt. However, if each trip returns hundreds of kb, then your performance will hurt faster.
You have answered your own question "It depends".
Although, trying to justify optimal number number of queries per HTTP request is not a legible scenario. If your SQL server has real good hardware support than you could run good number of queries in less time and have real low turn around time for the HTTP request. So basically, "it depends" as you rightly said.
As the comments above indicate, some caching is likely appropriate for your situation. And like your question suggests, the real answer is "it depends." Generally, the fewer the queries, the better since each query has a cost associated with it. You should examine your data model and your application's requirements to determine what is appropriate.
For example, if a user's rights are likely to be static during the user's session, it makes sense to cache the rights data so fewer queries are required. If aspects of the data displayed in your View are also static for a user's session, these could also be cached.
In my C# 3.5 application,code performs following steps:
1.Loop through a collection[of length 10]
2.For each item in step 1, fetch records from oracle database by executing a stored proc[here,record count is typically 100]
3.Process items fetched in step 2.
4.Go to next item in step 1.
My question here, with regard to performance, is it a good idea to fetch all items in step #2[ie. 10 * 100=1000 records] in one shot rather than connecting to database in each step and retrieving the 10 records?
Thanks.
Yes it's slightly better because you will lose the overhead of connecting to the DB, but you will still have the overhead of 10 stored procedure calls. If you could find a way to pass all 10 items as parameter to the stored proc and execute just one stored proc call, I think you would get a better performance.
Depending on how intense the connection steps are, it might be better to fetch all the records at once. However, keep in mind that premature optimization is the root of all evil. :-)
Generally it is better to pull all the records from the database in one stored procedure call.
This is countered when the stored procedure call is long running or otherwise extensive enough to cause contention on the table. In your case however with only a 1000 records, I doubt that will be an issue.
Yes, it is an incredibly good idea. The key to database performance is to run as many operations in bulk as possible.
For example, consider just the interaction between PL/SQL and SQL. These two languages run on the same server and are very thoroughly integrated. Yet I routinely see an order of magnitude performance increase when I reduce or eliminate any interaction between the two. I'm sure the same thing applies to interaction between the application and the database.
Even though the number of records may be small, bulking your operations is an excellent habit to get into. It's not premature optimization, it's a best practice that will save you a lot of time and effort later.
I am building a system, where the data will be added by user every 30 secs or so. In this case, should I go for a batch insert(insert after every 2 mins) or do I insert every time the user enters the data. The system is to be built on c# 3.5 and Sql server.
Start with normal inserts. You're nowhere near having to optimize this.
If performance becomes a problem, or it would be obvious that it may be a concern, only then do you need to look at optimizing -- and even then, it may not be an issue with inserts! Use a profiler to determine the bottleneck.
Once every 30 seconds is no significant stress. By the KISS principle, I prefer one-by-one in this case.
If it is every 30 seconds I would go for insert immediately if the insert is as quick as it should be (<2 seconds for large data).
If you see potential future growth to see more frequent transactions then consider bulking the inserts.
It really depends on your requirements, are you able to give more information?
For example:
Are you running a web app?
Is the data time sensitive (to be used by other users)
Do you have to worry about concurrent usage of the data?
What volume of transactions are you looking at?
For small amounts of data I would simply insert the rows as the user requires, the gain of caching will likely be minimal and it simplifies your implementation. If the usage is expected to be quite high come back and look at optimising the design. Comes back to the general rule of avoiding premature optimisations :)