Join queries and when it's too much - c#

I find that I am using a lot of join queries, especially to get statistics about user operations from my database. Queries like this are not uncommon:
from io in db._Owners where io.tenantId == tenantId
join i in db._Instances on io.instanceId equals i.instanceId
join m in db._Machines on i.machineId equals m.machineId
select ...
My app is still not active, so I have no way of judging if these queries will be computationally prohibitive in real-life. My query:
Is there a limit to when doing too many 'joins' is too much, and can that be described without getting real-life operating stats?
What are my alternatives? For example, is it better to just create additional tables to hold statistics that are I update as I go, rather than pulling together different table sources each time I want statistics?

If you do not have performance information then do not optimize.
Premature optimization is the root of all evil.
1) I don't think you'll ever reach the "limit".
2) This is called denomalization, premature denormalization is just wasted effort if you don't know if a problem exists.
I'd say your query looks pretty normal.

1) Is there a limit to when doing too many 'joins' is too much
No, the number of joins isn't an issue so much as the structure of the data within each table, presence and use of indexes and what needs to be done to get data out.
Normalized data is commonly a primary goal in relational DB design. You typically consider denormalization as a means of optimizing queries only as necessary because of the added effort required to maintain data consistency.
If you're really concerned, post your data model ERD (database tables & how they relate) and the database you are using for the project (because not all databases are the same).

Unless you have very high traffic and indexes are properly set, etc., you shouldn't have problems.
For reporting/analysis, some places will create a data warehouse which in its most basic form is a [partially] denormalized copy of your main database. They are easier to report on since one table usually contain most, if not all, the data needed in a report. They can also be much faster to read from since you don't have to join so much. However, they'll require more disk space (duplicated data). If writes are allowed, they'll be slower (have to update all the duplicated data) and you'll have the problem of keeping that duplicated data consistent.
In other words, unless you're only doing reporting (or read-only access), keep the joins.

Related

Best way to improve performance in WPF application

I'm currently working on a WPF application which was build using entity framework to access data (SQL Server database) (database first).
In the past, the database was on an internal server and I did not notice any problem regarding the performance of the application even though the database is very badly implemented (only tables, no views, no indexes or stored procedure). I'm the one who created it but it was my first job and I was not very good with databases so I felt like entity framework was the best approach to focus mainly on code.
However, the database is now on another server which is waaay slower. As you guessed it, the application has now big performance issues (more than 10 seconds to load a dozen rows, same to insert new rows,...).
Should I stay with entity framework but try to improve performance by altering the database adding views and stored procedure ?
Should I get rid off entity framework and use only "basic" code (and improve the database at the same time) ?
Is there a simple ORM I could use instead of EF ?
Time is not an issue here, I can use all the time I want to improve the application but I can't seem to make a decision about the best way to make my application evolved.
The database is quite simple (around 10 tables), the only thing that could complicates thing is that I store files in there. So I'm not sure I can really use whatever I want. And I don't know if it's important but I need to display quite a few calculated fields. Any advice ?
Feel free to ask any relevant questions.
For performance profiling, the first place I recommend looking is an SQL profiler. This can capture the exact SQL statements that EF is running, and help identify possible performance culprits. I cover a few of these here. The Schema issues are probably the most relevant place to start. The title targets MVC, but most of the items relate to WPF and any application.
A good, simple profiler that I use for SQL Server is ExpressProfiler. (https://github.com/OleksiiKovalov/expressprofiler)
With the move to a new server, and it now sending the data over the wire rather than pulling from a local database, the performance issues you're noticing will most likely be falling under the category of "loading too much, too often". Now you won't only be waiting for the database to load the data, but also for it to package it up and send it over the wire. Also, does the new database represent the same data volume and serve only a single client, or now serving multiple clients? Other catches for developers is "works on my machine" where local testing databases are smaller and not dealing with concurrent queries from the server. (where locks and such can impact performance)
From here, run a copy of the application with an isolated database server (no other clients hitting it to reduce "noise") with the profiler running against it. The things to look out for:
Lazy Loading - This is cases where you have queries to load data, but then see lots (dozens to hundreds) of additional queries being spun off. Your code may say "run this query and populate this data" which you expect should be 1 SQL query, but by touching lazy-loaded properties, this can spin off a great many other queries.
The solution to lazy loading: If you need the extra data, eager load it with .Include(). If you only need some of the data, look into using .Select() to select view models / DTO of the data you need rather than relying on complete entities. This will eliminate lazy load scenarios, but may require some significant changes to your code to work with view models/dtos. Tools like Automapper can help greatly here. Read up on .ProjectTo() to see how Automapper can work with IQueryable to eliminate lazy load hits.
Reading too much - Loading entities can be expensive, especially if you don't need all of that data. Culprits for performance include excessive use of .ToList() which will materialize entire entity sets where a subset of data is needed, or a simple exists check or count would suffice. For example, I've seen code that does stuff like this:
var data = context.MyObjects.SingleOrDefault(x => x.IsActive && x.Id = someId);
return (data != null);
This should be:
var isData = context.MyObjects.Where(x => x.IsActive && x.Id = someId).Any();
return isData;
The difference between the two is that in the first example, EF will effectively do a SELECT * operation, so in the case where data is present it will return back all columns into an entity, only to later check if the entity was present. The second statement will run a faster query to simply return back whether a row exists or not.
var myDtos = context.MoyObjects.Where(x => x.IsActive && x.ParentId == parentId)
.ToList()
.Select( x => new ObjectDto
{
Id = x.Id,
Name = x.FirstName + " " + x.LastName,
Balance = calculateBalance(x.OrderItems.ToList()),
Children = x.Children.ToList()
.Select( c => new ChildDto
{
Id = c.Id,
Name = c.Name
}).ToList()
}).ToList();
Statements like this can go on and get rather complex, but the real problems is the .ToList() before the .Select(). Often these creep in because devs try to do something that EF doesn't understand, like call a method. (i.e. calculateBalance()) and it "works" by first calling .ToList(). The problem here is that you are materializing the entire entity at that point and switching to Linq2Object. This means that any "touches" on related data, such as .Children will now trigger lazy loads, and again further .ToList() calls can saturate more data to memory which might otherwise be reduced in a query. The culprit to look out for is .ToList() calls and to try removing them. Select simpler values before calling .ToList() and then feed that data into view models where the view models can calculate resulting data.
The worst culprit like this I've seen was due to a developer wanting to use a function in a Where clause:
var data = context.MyObjects.ToList().Where(x => calculateBalance(x) > 0).ToList();
That first ToList() statement will attempt to saturate the whole table to entities in memory. A big performance impact beyond just the time/memory/bandwidth needed to load all of this data is simply the # of locks the database must make to reliably read/write data. The fewer rows you "touch" and the shorter you touch them, the nicer your queries will play with concurrent operations from multiple clients. These problems magnify greatly as systems transition to being used by more users.
Provided you've eliminated extra lazy loads and unnecessary queries, the next thing to look at is query performance. For operations that seem slow, copy the SQL statement out of the profiler and run that in the database while reviewing the execution plan. This can provide hints about indexes you can add to speed up queries. Again, using .Select() can greatly increase query performance by using indexes more efficiently and reducing the amount of data the server needs to pull back.
For file storage: Are these stored as columns in a relevant table, or in a separate table that is linked to the relevant record? What I mean by this, is if you have an Invoice record, and also have a copy of an invoice file saved in the database, is it:
Invoices
InvoiceId
InvoiceNumber
...
InvoiceFileData
or
Invoices
InvoiceId
InvoiceNumber
...
InvoiceFile
InvoiceId
InvoiceFileData
It is a better structure to keep large, seldom used data in separate tables rather than combined with commonly used data. This keeps queries to load entities small and fast, where that expensive data can be pulled up on-demand when needed.
If you are using GUIDs for keys (as opposed to ints/longs) are you leveraging newsequentialid()? (assuming SQL Server) Keys set to use newid() or in code, Guid.New() will lead to index fragmentation and poor performance. If you populate the IDs via database defaults, switch them over to use newsequentialid() to help reduce the fragmentation. If you populate IDs via code, have a look at writing a Guid generator that mimics newsequentialid() (SQL Server) or pattern suited to your database. SQL Server vs. Oracle store/index GUID values differently so having the "static-like" part of the UUID bytes in the higher order vs. lower order bytes of the data will aid indexing performance. Also consider index maintenance and other database maintenance jobs to help keep the database server running efficiently.
When it comes to index tuning, database server reports are your friends. After you've eliminated most, or at least some serious performance offenders from your code, the next thing is to look at real-world use of your system. The best thing here to learn where to target your code/index investigations are the most used and problem queries that the database server identifies. Where these are EF queries, you can usually reverse-engineer based on the tables being hit which EF query is responsible. Grab these queries and feed them through the execution plan to see if there is an index that might help matters. Indexing is something that developers either forget, or get prematurely concerned about. Too many indexes can be just as bad as too few. I find it's best to monitor real-world usage before deciding on what indexes to add.
This should hopefully give you a start on things to look for and kick the speed of that system up a notch. :)
First you need to run a performance profiler and find put what is the bottle neck here, it can be database, entity framework configuration, entity framework queries and so on
In my experience, entity framework is a good option to this kind of applications, but you need understand how it works.
Also, What entity framework are you using? the lastest version is 6.2 and has some performance improvements that olders does not have, so if you are using a old one i suggest that update it
Based on the comments I am going to hazard a guess that it is mostly a bandwidth issue.
You had an application that was working fine when it was co-located, perhaps a single switch, gigabit ethernet and 200m of cabling.
Now that application is trying to send or retrieve data to/from a remote server, probably over the public internet through an unknown number of internal proxies in contention with who knows what other traffic, and it doesn't perform as well.
You also mention that you store files in the database, and your schema has fields like Attachment.data and Doc.file_content. This suggests that you could be trying to transmit large quantities (perhaps megabytes) of data for a simple query and that is where you are falling down.
Some general pointers:
Add indexes for anywhere you are joining tables or values you
commonly query on.
Be aware of the difference between Lazy & Eager
loading in Entity Framework. There is no right or wrong answer,
but you should be know what you approach you are using and why.
Split any file content
into its own table, with the same primary key as the main table or
play with different EF classes to make sure you only retrieve files
when you need to use them.

SQL - Storing values that have dynamic properties and constraints

I'm not sure if what I'm attempting to do is simply incorrect/impossible or if there is an easier way and that I'm missing the point.
I'm using SQL Server 2012
What I would like to do is have a table that can store rows with values relating to stored properties in another table. Basically, key value pair. The thing is, I would like to determine which key values can be used by which entities.
For example,
I would like one table listing various companies, another to store 'files' created for each company - this is used to store historical information, another listing various production departments(stages in production), another listing production figures(KGs, Units, etc), and one listing the actual production capture against these figures for each month. There are also tables in place to show which production departments can use which production figures as well as which company has which production departments.
Some of the companies have the same stages in production as well as additional stages that the others don't.
These figures are captured on a monthly basis ONLY, so I have a table describing all the months of a year.
Each production department may have similar types of recordings to be captured, though they don't all have the same production readings.
Here's a link to a graphical representation of the table layouts:
http://tinypic.com/r/30a51mx/8
..
My end result is to auto-populate / update the table with newly added figures as the user enters this section of the program (by passing through the FileID), and to allow the user to edit this using a datagridview (or atleast select a value to be edited from the datagridview)
I will then need to write reports later on that will need to pivot on this information.
Any help or suggestions would be greatly appreciated.
Thanks
For an effective DB design it is very important to understand, two major requirements:
Should the DB design be done keeping in mind the ease of use from application point of view or from efficient storage point of view.
This point is by large decided keeping in mind following factors:
How much data are we going to store, so we need to have some idea about cost of storage factoring redundancy. Good normalised DB reduces redundancy.
Your DB is normalised very well, but is that really needed. Typically cost of storage is very less in today's time, and so if we can think of design which is slightly more redundant, it should be OK. Unless of course you plan to use Standard version of SQL server which has its own limitation in terms of DB size.
Is the data retrieval and update slow/fast? The more normalised DB is more number JOINS are expected. In your case, if you want to return values for multiple properties, say n, in a single result, then you'd need to make n joins on the ProductionProperty table, which will essentially reduce query performance, and hence slow user experience. So unless your UI is not very demanding, and your users can live with small lag, go ahead with a normalised DB design.
ORM mismatch- Since the relational database model and object model (assuming programming language follows OOPs concept) usually mismatch and they will heavily in a normalised scenario like this; you'll need to spend more hours coding through or troubleshooting scenarios which may require you to squirm in pain when making changes to either of these models. I suggest you use a good ORM framework to counter this and be more aware of the ORM mismatch scenarios.
Will you have separate Reporting DB or Reporting tables? Basically this translates into is your DB an OLTP database or Reporting Database? If this DB is going to worked on heavily by Data entry persons day in and day out, normalised form should be suitable considering that Point #1 is satisfied. If however reporting is a major need, then de-normalised form should be preferred (which means that you do not need so many separate tables).
PS: Master data should be kept in table of its own.I can say Months definitely is a master data and so is UoM unless you plan to do CRUD on the UoM measures too. Also note that it hardly matters keeping month in separate table especially when same business logic/constraints can be enforced on columns in SQL.

Is there a desgin pattern that would help to compare big amount of data?

I need to compare particular content of 2 SQL tables located in different servers: Table1 and Table2.
I want to compare each row from Table1 against the whole content of Table2.
Comparison logic is kind of complicated so I want to apply a logical operator that I will wrinte in C#. So I don't want to do the comparison on the SQL query itself.
My concern is the size of the data I will work on will be around 200 MB.
I was thinking to load the data into a DataTable by using ADO.Net and do the comparison on the memory.
What would you recommend? Is there already a pattern like approach to compare massive data?
200 MB should not be a problem. A .NET application can handle much more than that at once.
But even so, I would probably use a forward-only data reader for Table 1, just because there's no good reason not to, and that should reduce the amount of memory required. You can keep table 2 in memory with whatever structure you are accustomed to.
You can use two SqlDataReaders. They only have one row in memory at a time, are forward only, and extremely efficient. After getting the row back from the reader you can then compare the values. Here is an example.
See MSDN.
The most scalable solution is to create SQLCLR functions to execute the comparisons you want.
You should probably avoid a row-by-row comparison at all costs. The network latency and delays due to round-tripping will result in extremely slow execution.
A quick&dirty solution is to extract the data to local files then do the comparison as you will pay the network tax only once. Unfortunately, you lose the speedup provided by database indexes and query optimizations.
A similar solution is to load all the data once in memory and then use indexing structures like dictionaries to provide additional speedup. This is probably doable as your data can fit in memory. You still pay the network tax only once but gain from faster execution.
The most scalable solution is to create SQLCLR code to create one or more functions that will perform the comparisons you want. This way you avoid the network tax altogether, avoid creating and optimizing your own structures in memory and can take advantage of indexes and optimizations.
These solutions may not be applicable, depending on the actual logic of the comparisons you are doing. Both solutions rely on sorting the data correctly
1) Binary search. - You can find the matching row in table 2 without scanning through all of table 2 by using a binary search, this will significantly reduce the number of comparisons
2) If you are looking for overlaps/matches/missing rows between the two tables, you can sort both tables in the same order. Then you can loop through the two tables simultaniously, keeping a pointer to the current row of each table. If table 1 is "ahead" of table 2, then you only increment the table 2 pointer until they are either equal, or table 2 is ahead. Then once table 2 is ahead, you start incrementing table 1 until it is ahead. etc. In this way you only have to loop through each record from each table one time, and you are guaranteed that there were no matches that you missed.
If table 1 and 2 match, then that is a match. while table 1 is ahead, then every row in table 2 is "missing" from table 1, and visa versa.
This solution would also work if you only need to take some action if the rows are in a certain range of each other or something.
3) If you have to actually do some action for every row in table 2 for every row in table 1, then its just two nested loops, and there is not much that you can do to optimize that other than make the comparison/work as efficient as possible. You could possibly multi-thread it though depending on what the work was and where your bottle-neck is.
Can you stage the data to the same database using a quick ETL/SSIS job? This would allow you to do set operations which might be easier deal with. If not, I would agree with the recommendations for forward-only data reader with one table in memory
A couple years ago I wrote a db table comparison tool, which is now an open-source project called Data Comparisons.
You can check out the source code if you want. There is a massive optimization you can make when the two tables you're comparing are on the same physical server, because you can write a SQL query to take care of this. I called this the "Quick compare" method in Data Comparisons, and it's available whenever you're sharing the same connection string for both sides of the comparison.
When they're on two different servers, however, you have no choice but to pull the data into memory and compare the rows there. Using SqlDataReaders would work. However, it's complicated when you must know exactly what's different (what rows are missing from table A or table B, what rows are different, etc). For that reason my method was to use DataTables, which are slower but at least they provide you with the necessary functionality.
Building this tool was a learning process for me. There are probably opportunities for optimization with the in-memory comparison. For example, loading the data into a Dictionary and doing your comparisons off of primary keys with Linq would probably be faster. You could even try Parallel Linq and see if that helps. And as Jeffrey L Whitledge mentioned, you might as well use a SqlDataReader for one of the tables while the other is stored in memory.

How to maximize performance?

I have a problem which I cannot seem to get around no matter how hard I try.
This company works in market analysis, and have pretty large tables (300K - 1M rows) and MANY columns (think 250-300) which we do some calculations on.
I´ll try to get straight to the problem:
The problem is the filtering of the data. All databases I´ve tried so far are way too slow to select data and return it.
At the moment I am storing the entire table in memory and filtering using dynamic LINQ.
However, while this is quite fast (about 100 ms to filter 250 000 rows) I need better results than this...
Is there any way I can change something in my code (not the data model) which could speed the filtering up?
I have tried using:
DataTable.Select which is slow. Dynamic LINQ which is better, but
still too slow. Normal LINQ (just for testing purposes) which almost
is good enough. Fetching from MySQL and do the processing later on
which is badass slow.
At the beginning of this project we thought that some high-performance database would be able to handle this, but I tried:
H2 (IKVM)
HSQLDB (compiled ODBC-driver)
CubeSQL
MySQL
SQL
SQLite
...
And they are all very slow to interface .NET and get results from.
I have also tried splitting the data into chunks and combining them later in runtime to make the total amount of data which needs filtering smaller.
Is there any way in this universe I can make this faster?
Thanks in advance!
UPDATE
I just want to add that I have not created this database in question.
To add some figures, if I do a simple select of 2 field in the database query window (SQLyog) like this (visit_munic_name is indexed):
SELECT key1, key2 FROM table1 WHERE filter1 = filterValue1
It takes 125 milliseconds on 225639 rows.
Why is it so slow? I have tested 2 different boxes.
Of course they must change someting, obviously?
You do not explain what exactly you want to do, or why filtering a lot of rows is important. Why should it matter how fast you can filter 1M rows to get an aggregate if your database can precalculate that aggregate for you? In any case it seems you are using the wrong tools for the job.
On one hand, 1M rows is a small number of rows for most databases. As long as you have the proper indexes, querying shouldn't be a big problem. I suspect that either you do not have indexes on your query columns or you want to perform ad-hoc queries on non-indexed columns.
Furthermore, it doesn't matter which database you use if your data schema is wrong for the job. Analytical applications typically use star schemas to allow much faster queries for a lot more data than you describe.
All databases used for analysis purposes use special data structures which require that you transform your data to a form they like.
For typical relational databases you have to create star schemas that are combined with cubes to precalculate aggregates.
Column databases store data in a columnar format usually combined with compression to achieve fast analytical queries, but they require that you learn to query them in their own language, which may be very different than the SQL language most people are accustomed to.
On the other hand, the way you query (LINQ or DataTable.Select or whatever) has minimal effect on performance. Picking the proper data structure is much more important.
For instance, using a Dictionary<> is much faster than using any of the techniques you mentioned. A dictionary essentially checks for single values in memory. Executing DataTable.Select without indexes, using LINQ to Datasets or to Objects is essentially the same as scanning all entries of an array or a List<> for a specific value,because that is what all these methods do - scan an entire list sequentially.
The various LINQ providers do not do the job of a database. They do not optimize your queries. They just execute what you tell them to execute. Even doing a binary search on a sorted list is faster than using the generic LINQ providers.
There are various things you can try, depending on what you need to do:
If you are looking for a quick way to slice and dice your data, use an existing product like PowerPivot functionality of Excel 2010. PowerPivot loads and compresses MANY millions of rows in an in-memory columnar format and allows you to query your data just as you would with a Pivot table, and even define joins with other in memory sources.
If you want a more repeatable process you can either create the appropriate star schemas in a relational database or use a columnar database. In either case you will have to write the scripts to load your data in the proper structures.
If you are creating your own application you really need to investigate the various algorithms and structures used by other similar tools either for in memory.

Ensure "Reasonable" queries only

In our organization we have the need to let employees filter data in our web application by supplying WHERE clauses. It's worked great for a long time, but we occasionally run into users providing queries that require full table scans on large tables or inefficient joins, etc.
Some clown might write something like:
select * from big_table where
Name in (select name from some_table where name like '%search everything%')
or name in ('a', 'b', 'c')
or price < 20
or price > 40
or exists (select 1 from some_other_table where col1 + col2 + col3 = 4)
or exists (select 1 from table_a, table+b)
Obviously, this is not a great way to query these tables with computed values, non-indexed columns, lots of OR's and an unrestricted join on table_a and table_b.
But for a user, this may make total sense.
So what's the best way, if any, to allow internal users to supply a query to the database while ensuring that it won't lock a dozen tables and hang the webserver for 5 minutes?
I'm guessing that's a programmatic way in c#/sql-server to get the execution plan for a query before it runs. And if so, what factors contribute to cost? Estimated I/O cost? Estimated CPU cost? What would be reasonable limits at which to tell the user that his query's no good?
EDIT: We're a market research company. We have thousands of surveys, each with their own data. We have dozens of researchers that want to slice that data in arbitrary ways. We have tools to let them construct "valid" filters using a GUI, but some "power users" want to supply their own queries. I realize this isn't standard or best practice, but how else can I let dozens of users query tables for the rows they want using arbitrarily complex conditions and ever-changing conditions?
The premise of your question states:
In our organization we have the need to let employees filter date in our web application by supplying WHERE clauses.
I find this premise to be flawed on its face. I can't imagine a situation where I would allow users to do this. In addition to the problems you have already identified, you are opening yourself up to SQL Injection attacks.
I would highly recommend reassessing your requirements to see if you can't build a safer, more focused way of allowing your users to search.
However, if your users really are sophisticated (and trusted!) enough to be supplying WHERE clauses directly, they need to be educated on what they can and can't submit as a filter.
You can try using the following:
SET SHOWPLAN_ALL ON
GO
SET FMTONLY ON
GO
<<< Your SQL code here >>>
GO
SET FMTONLY OFF
GO
SET SHOWPLAN_ALL OFF
GO
Then you can parse through what you've got. As to where to draw the line on various things, that's going to take some experience. There are some things to watch for, but nothing that is cut and dried. It's often more of an art to examine the query plans than a science.
As others have pointed out though, I think that your problem goes deeper than the technology implications. The fact that you let unqualified people access your database in such a way is the underlying problem. From past experience, I often see this in companies where they are too lazy or too inexperienced to properly capture their application's requirements. I'm not saying that this is necessarily the case with your corporate environment, but that's what I've seen.
In addition of trying to control what the users enter (which is a loosing battle, there will always be a new hire that will come up with an immaginative query), I'd look into Resource Governor, see Managing SQL Server Workloads with Resource Governor. You put the ad-hoc queries into a separate pool and cap the allocated resources. This way you can mitigate the problem by limiting the amount of damage a bad query can do to other tasks.
And you should also consider giving access to the data by other means, like Power Pivot and let users massage their data as hard as they want on their own Excel. Business power users love that, and the impact on the transaciton processign server is minimal.
Instead of allowing employees to directly write (append to) queries, and then trying to calculate the query cost before running it, why not create some kind of Advanced Search or filter feature that is NOT writing SQL you cannot control?
In very large Enterprise originations on internal application this is a common practice. Often during your design phase you will limit the criteria or put sensible limits on data ranges, but once the business gets hold of the app there will be calls from the business unit management to remove the restrictions. In my origination this is a management problem not an engineering issue.
What we did was profile all of the criteria and found the largest offenders, both users and what types of queries caused the most problems and put limitations on some of the queries. Also some very expensive queries that were used on a regular basis were added to the app and the app cached the results and ran the queries when load was low. We also created caned optimized queries for standard users and gave only specified users the ability to search for anything. Just a couple of ideas.
You could make a data model for your database and allow users to use SQL Reporting Services' Report Builder. Its GUI-based and doesn't require writing WHERE clauses, so there should be a limit to how much damage they can do.
Or you could warehouse a copy of the db for the purpose of user queries, update the db every hour or so, and let them go to town... :)
I have worked a few places where this also came up. What we ended up doing was NOT allowing users unconstrained access, and promising to have IT do their best to provide queries when needed. The issue was that the database is fairly complicated, and even if users could write grammatically and syntactically correct SQL, they don't necessarily understand the relationships between the tables. In other words, even if they could write their own SQL they would get the wrong answers. We convinced the users that the risk of making the wrong decision based on a flawed or incomplete understanding of the 200 tables in the database was too high. Better to get the right answer after a day than the wrong one instantly.
The other part of this is what does IT do when user A writes a query and gets 1 answer, then user B writes what he thinks is the same query and gets a different answer? Is it IT's job to find the differences? To fix both pieces of SQL? etc. The bottom line is that I would not allow them access. I would load the system with predefined queries, as others have mentioned, and try to train mgmt why that is the only way it will work in the long run.
If you have so much data and you want to provide to your customers the ability to analyse and view the information as they want to, I strongly recommand to thing about OLAP technologies.
I guess you've never heard of SQL Injection attacks? What if the user enters A DROP DATABASE command after the WHERE clause?
This is the reason that direct SELECT permission is almost never given to users in the vast majority of applications.
A far better approach would be to engineer your application around use cases so that you are able to cover a reasonable percentage of requirements with specifically designed filters/aggregation/layout options.
There are a myriad of ways to do this so some analysis of your specific problem domain will definitely be required together with research into viable methods.
Whilst direct SQL access is the most flexible for your users, long executing queries are likely to be just the start of your headaches. SQL injection is a big concern here, whether it's source is malicious or simply misguided.
(Chad mentioned this in a comment, but I think it deserves to be an answer.)
Maybe you should copy data that needs to be queried ad-hoc into a separate database, to isolate any problems from the majority of users.

Categories