I am developing a C# application working with millions of records retrieved from a relational database (SQL Server). My main table "Positions" contains the following columns:
PositionID, PortfolioCode, SecurityAccount, Custodian, Quantity
Users must be able to retrieve Quantities consolidated by some predefined set of columns e.g. {PortfolioCode, SecurityAccount}, {Porfolio, Custodian}
First, I simply used dynamic queries in my application code but, as the database grew, the queries became slower.
I wonder if it would be a good idea to add another table that will contain the consolidated quantities. I guess it depends on the distribution of those groups?
Besides, how to synchronize the source table with the consolidated one?
In SQL Server you could use indexed views to do this, it'd keep the aggregates synchronised with the underlying table, but would slow down inserts to the underlying table:
http://technet.microsoft.com/en-us/library/ms191432.aspx
If it's purely a count of grouped rows in a single table, would standard indexing not suffice here? More info on your structure would be useful.
Edit: Also, it sounds a little like you're using your OLTP server as a reporting server? If so, have you considered whether a data warehouse and an ETL process might be appropriate?
Related
Background
My backend has a database in SQL server 2012 which has around 20 tables (maybe will increase in time) and each table will have approx 100 - 1000 rows initially might increase in future.
Now one of my colleague developed an web application which uses this database and let clients do CRUD and usual business logic.
Problem
My task is to create a reporting page for this web application, what I will be doing is to give client ability to export all of the data for all of there deep nested objects from SQL from all tables or only couple with all columns or only few... in excel, pdf and other formats in future. I might also need to query 3rd party in my business logic for gathering further information (out of context for now).
What can I do to achieve above ?
What I know
I can't think of any efficient and extendable solution, as it will involve 100s of columns and 20s of tables. All I can think of adding 100s of views for what I might require but it doesn't sound particle either.
Should I look into BI or SQL reporting or should this be done in code using ORM like EF ? or is there any open source code already out there for such Generic operations I am totally confused.
Please note I am asking what to use not how to use. Hope I didn't offended anyone.
If you aren't concerned with the client having access to all your database object names, you could write up something yourself without too much effort. If you are creating a page you could query the system views to get a list of all table and column names to populate some sort of filtering (dropdowns, listbox, etc).
you can get a list of all the tables:
select object_id, name from sys.tables
you could get a list of all columns per table:
select object_id, name from sys.columns
object_id is the common key between the views.
Then you could write some dynamic SQL based on the export requirements if you plan to export through SQL.
I have an Entity Framework DbContext with two different DbSets.
In my view I am combining these two sets into the same view model and listing them in the same table.
I want to support table paging to be able to query only a page of records at a time sorted by a particular column. I can't see how to do this without reading all of the records from the database and then paging from memory.
For example, I want to be able to sort by date ascending since both tables have a date column. I could simply take the page size from both tables and then sort in memory but the problem comes into play when I am skipping records. I do not know how many to skip in each table since it depends on how many records are found in the other table.
Is there a way to manipulate Entity Framework to do this?
It is possible.
JOin them in the database (can be done in EF).
Project that (select new {}) into the final object
Order by, skip, take on that projection.
It will be crap performance wise but there is no way around that given you have a broken database model. It basically has to get a tempoary view of all rows for the SQL to find the first ones - that will be slow.
Your best bet is going to be to combine them with a stored procedure or view, and then map that sp/view into Entity Framework. Combing them on the client is going to kill performance - let the server do it for you; it is clearly a server side task.
Don't ask why but there are four databases. One of which I have rights to modify the schema. Let's call it external. Again, it's a legacy deal but there are about 60 tables in one of the other three databases, called main. Each record in those tables has a field that links it to a record in a corresponding table in external.
PetaPoco will make quick work of a lot of the trouble. Tentatively, I've tried multiple Database.tt files to manipulate all four databases. Is there a better way?
Should I create synonyms or views in external that refer to the goods in the other databases? And then only use one Database.tt on external?
Is a combined POCO for the linked tables reasonable?
The Database.tt is only used to pre-generate some poco out of your schema. I can hardly believe you are going to leave it there without modification. Normally I would start there and change to make more reasonable linked (with property complex properties for linked tables)
As to linked table queries, as they must be executed in 1 query, thus you have to only keep connection to only 1 db, thus a linked table is necessary. But be ware of low performances. Cross database table joins can sometime be 10 times slower than local joins, depending on sqls. If you have nested select cross multiple db tables, better to make temp table to avoid performance issue.
I have 3 tables:
Item_Detail -----------------ID, Name
ItemPurchased_Detail ---QtyPurchased , RateOfPurchase , DiscountReceived
ItemSold_Detail -----------QtySold , RateofSale , DiscountGiven
Also, i have a MAIN TABLE, ITEM_FULL_DETAIL, which contains all the columns from above 3 tables.
i have a winform application, with a single form which contains all the textboxes to insert data in the ITEM_FULL_DETAIL table. the user would input all the data, and click the SUBMIT button
i want to insert data first in the MAIN TABLE, and then it should distribute data individually to all the 3 tables. for this what shall i use? like triggers, porcedures, or views or joins?
Also, am using the ITEM_FULL_DETAIL table, because i want to protect my actual tables from any loss of data such as in case of power outage.
Shall I use a temporary table in place of ITEM_FULL_DETAIL table or is it fine using the current one only?
Is there any other way also?
You can use database triggers or insert all records on the application level.
You should probably re-think your design: duplicating the same data in different tables is usually a bad idea. In this case, you could replace ITEM_FULL_DETAIL with a view, then maintain the data in the underlying tables. That way you only have one copy of the data, so you don't need to worry about inconsistencies between tables.
If you did that then you could either insert new data into the 3 underlying tables in the correct order (probably the best idea) or use an INSTEAD OF trigger on the ITEM_FULL_DETAIL view (more complicated). The INSERTs can be done using an ORM, ADO.NET, a stored procedure or whatever suits your application design best.
If you do have a good reason for duplicating your data, then it would be helpful if you could share it, because someone may have a better suggestion for that scenario.
Also, am using the ITEM_FULL_DETAIL table, because i want to protect my actual tables from any loss of data such as in case of power outage.
..What? How do you suppose you are protecting your tables? What are you trying to prevent? There is absolutely no need to have the ITEM_FULL_DETAIL table if what you are worried about is data integrity. You're probably creating a situation in which data integrity can be compromised by using this intermediate table.
Are you aware of transactions? Use them. If two out of three tables are written to, then the power on the client goes off and can't complete the 3rd write, the transaction will fail and the partial data will be rolled back.
Unless I'm totally missing the point here..
I want to distribute a large amount of data to different C# applications. For example, my table contains millions of records. I would like to specify that first 3 millions records are processed by App1 and next 3 million in another C# application App2 and so on. Table rows are deleted and added as per requirement. Now I want to write a SQL query that will process first 3 million records. Now if 5 records are deleted from app1 then app1 must fetch next 5 records from app2 and app2 from app3. So that data always remain constant in each app.
I have used limit in the SQL query, but I didn't get the required output. How can I write the SQL query for this and how should I design the C# application.
It looks a bit as if you want to take over from the database and do the processing that a database is tasked and tailored to do, in your own application. You talk of an SQL query with a LIMIT statement. Don't use that. Millions of records is not much in database terms. If you have performance issues, you may need to index your table or revisit the query design (watch its execution path for performance issues).
If you really cannot let the database do the task and you need to process them one by one in your application, the network latency and bandwidth is likely to be an earlier candidate for performance issues, which you won't make any faster by using multiple apps (let alone the cost of such queries).
If my observations are wrong and your processing of the records must take place outside the database and the network is not a bottleneck, nor are the processors or the database machine and multiple applications will provide a performance gain, then I suggest you create a dispatch application that processes the records and makes them available to your other applications (or better: threads) through normal POCOs. This creates a much easier way of spreading the processing and the dispatch application (or thread) can work as some kind of funnel for your processing applications.
However, look at the cost / benefit equation: is the trouble really going to gain you some performance, or is it better to revisit your design and find a more practical solution?
That sounds like a really bad idea. Requesting a limit of 3 million records is a very slow operation.
An alternative approach would be to have an instance number column and have each instance of your application reserve rows as it needs them by writing its instance number into this column. Process your data in smaller chunks if possible.
Adding an index to the instance number column will allow you to count how many rows you have already handled and also to find the next batch of 1000 (for example) that haven't been assigned to any instance yet.
I would beneift from more understanding of the details of the application and the process to get, select, delete, etc. However, to give it a shot to a viable answer.
In short, use partitioned tables and distributed views. Each application is "keyed" to those tables through the common partitioned view, if any application has to act on another table (or "key") it can use the same view and act on the other tables.
In more detail ...
If you have the Enterprise or Developer edition of SQL Server, or any other that provides distributed views, then you can create three or more tables with a partitioned column ("App1", "App2", "App3"), like what Mark Byers has said, that would then distribute the ability to process against data evenly.
Now create a view (WITH SCHEMABINDING) for "Select Field1, Field2, Field3, etc. from table1 UNION Select Field1, Field2, Field3, etc. from table2 UNION Select Field1, Field2, Field3, etc. from table3"
Create a unique clustered key on the one/two fields that uniquly represent your data. When this is done, you can now select/delete/update from view where partitioncolumn = "app1" and "id=?". This routine makes it that the activity on the view is allowed for action queries (insert/update/delete) and only acts on the table of the partitioned data.
So, App1, sends an "App1" WHERE filter, and the db engine only acts on table1 even though the view has.