SqlDataReader involving query with joins - c#

I have 2 tables linked by a foreign Key (ID). In table1 there are 1 million records. In table 2 there are 50 million records.
I would like to read record from Table1 and read all the associated records of table 2.I can use SqlDataReader and implement peek() to implement this functionality as discussed here (How do I implement a Peek() function on a DataReader?)
select ID, Col1 from Table1 order by ID
select ID, col2 from Table2 order by ID
But the downside of peek approach is I have to compare each child record with parent before advancing pointer of the parent result.
If I use join in SQL Server, it will perform join operation and then start streaming the result which requires a lot of memory.
Another approach would be to divide join operation in batches but this involved firing multiple SQL queries which I don't want..
Can you please suggest some alternative approach to achieve this?

If I understand your problem correctly, you might want to look into using a partitioned table. Here's the MySQL manual page on partitioning, and here's a Stack Overflow question that deals with partitioning and joins

Related

BULK INSERT across multiple related tables?

I need to do a BULK INSERT of several hundred-thousand records across 3 tables. A simple breakdown of the tables would be:
TableA
--------
TableAID (PK)
TableBID (FK)
TableCID (FK)
Other Columns
TableB
--------
TableBID (PK)
Other Columns
TableC
--------
TableCID (PK)
Other Columns
The problem with a bulk insert, of course, is that it only works with one table so FK's become a problem.
I've been looking around for ways to work around this, and from what I've gleaned from various sources, using a SEQUENCE column might be the best bet. I just want to make sure I have correctly cobbled together the logic from the various threads and posts I've read on this. Let me know if I have the right idea.
First, would modify the tables to look like this:
TableA
--------
TableAID (PK)
TableBSequence
TableCSequence
Other Columns
TableB
--------
TableBID (PK)
TableBSequence
Other Columns
TableC
--------
TableCID (PK)
TableCSequence
Other Columns
Then, from within the application code, I would make five calls to the database with the following logic:
Request X Sequence numbers from TableC, where X is the known number of records to be inserted into TableC. (1st DB call.)
Request Y Sequence numbers from TableB, where Y is the known number of records to be inserted into TableB (2nd DB call.)
Modify the existing objects for A, B and C (which are models generated to mirror the tables) with the now known Sequence numbers.
Bulk insert to TableA. (3rd DB call)
Bulk insert to TableB. (4th DB call)
Bulk insert to TableC. (5th DB call)
And then, of course, we would always join on the Sequence.
I have three questions:
Do I have the basic logic correct?
In Tables B and C, would I remove the clustered index from the PK and put in on the Sequence instead?
Once the Sequence numbers are requested from Tables B and C, are they then somehow locked between the request and the bulk insert? I just need to make sure that between the request and the insert, some other process doesn't request and use the same numbers.
Thanks!
EDIT:
After typing this up and posting it, I've been reading deeper into the SEQUENCE document. I think I misunderstood it at first. SEQUENCE is not a column type. For the actual column in the table, I would just use an INT (or maybe a BIGINT) depending on the number of records I expect to have). The actual SEQUENCE object is an entirely separate entity whose job is to generate numeric values on request and keep track of which ones have already been generated. So, if I understand correctly, I would generate two SEQUENCE objects, one to be used in conjunction with Table B and one with Table C.
So that answers my third question.
Do I have the basic logic correct?
Yes. The other common approach here is to bulk load your data into a staging table, and do something similar on the server-side.
From the client you can request ranges of sequence values using the sp_sequence_get_range stored procedure.
In Tables B and C, would I remove the clustered index from the PK
No, as you later noted the sequence just supplies the PK values for you.
Sorry, read your question wrong at first. I see now that you are trying to generate your own PK's rather then allow MS SQL to generate them for you. Scratch my above comment.
As David Browne mentioned, you might want to use a staging table to avoid the strain you'll put on your app's heap. Use tempdb and do the modifications directly on the table using a single transaction for each table. Then, copy the staging tables over to their target or use a MERGE if appending. If you are enforcing FK's, you can temporarily remove those constraints if you choose to insert in reverse order (C=>B=>A). You also may want to consider temporarily removing indexes if experiencing performance issues during the insert. Last, consider using SSIS instead of a custom app.

Entity Framework - how can I optimize “Contains” statement?

In our current application we have some performance issues with some of our queries. Usually we have something like:
List<int> idList = some data here…;
var query = (from a in someTable where idList.Contains(a.Id));
while for simple queries this is acceptable, it becomes a bottleneck when we have more items in idList (in some queries we have about 700 id’s to check, for example).
Is there any way to use something other then Contains? We are thinking of using some temporary tables to first insert the Ids, and then to execute join instead of Contains, but it would seem EntityFramework does not support such operations (creating temporary tables in code) :(
What else can we try?
I Suggest using LINQ PAD it offers a "Transform to SQL" option which allows you to see your query in SQL syntax.
there is a chance that this is the optimal solution (if youre not into messy stuff).
might try holding the idList as a sorted array and have the contains method replaced with a binary search. (you can implement your own extension).
You can try this:
var query = someTable.Where(a => idList.Any(b => b.Id == a.Id));
If you don't mind having a physical table you could use a semi-temporary table. The basic idea is:
Create a physical table with a "query id" column
Generate a unique ID (not random, but unique)
Insert data into the table tagging the records with the query ID
Pass the query id to the main query, using it to join to the link table
Once the query is complete, delete the temporary records
At worst if something goes wrong you will have orphaned records in the link table (which is why you use a unique query ID).
It's not the cleanest solution but it will be faster than using Contains if you have a lot of values to check against.
When Entity Framework starts being a performance bottleneck, generally it's time to write actual SQL.
So what you could do for example is build a table-valued function that takes a table-valued parameter (your list of IDs) as parameter. The function would just return the result of your JOIN.
Table valued function feature requires EF5, so it might be not an option if you're really stuck with EF4.
The idea is to refactor your queries to get rid of idList.
For example you should return the list of orders of male users 18-25 year, from France. If you filter users table by age, sex and country to get idList of users you end up with 700+ id's. Instead you make Orders table join with Users and apply filters to the Users table. So you don't have 2 requests (one for ids and one for orders) and it works much faster cause it can use indexes while joining the table.
Makes sense?

Which approach is better to retrieve data from a database

I am confused about selecting two approaches.
Scenario
there are two tables Table 1 and Table 2 respectively. Table 1 contains user's data for example first name, last name etc
Table 2 contains cars each user has with its description. i.e Color, Registration No etc
Now if I want to have all the information of all users then what approach is best to be completed in minimum time?
Approach 1.
Query for all rows in Table 1 and store them all in a list for ex.
then Loop through the list and query it and get data from Table 2 according to user saved in in first step.
Approach 2
Query for all rows and while saving that row get its all values from table 2 and save them too.
If I think of system processes then I think it might be the same because there are same no of records to be processed in both approaches.
If there is any other better idea please let me know
Your two approaches will have about the same performance (slow because of N+1 queries). It would be faster to do a single query like this:
select *
from T1
left join T2 on ...
order by T1.PrimaryKey
Your client app can them interpret the results and have all data in a single query. An alternative would be:
select *, 1 as Tag
from T1
union all
select *, 2 as Tag
from T2
order by T1.PrimaryKey, Tag
This is just pseudo code but you could make it work.
The union-all query will have surprisingly good performance because sql server will do a "merge union" which works like a merge-join. This pattern also works for multi-level parent-child relationships, although not as well.

Best way of acquiring information from several database tables

I have a medical database that keeps different types of data on patients: examinations, lab results, x-rays... each type of record exists in a separate table. I need to present this data on one table to show the patient's history with a particular clinic.
My question: what is the best way to do it? Should I do a SELECT from each table where the patient ID matches, order them by date, and then keep them in some artificial list-like structure (ordered by date)? Or is there a better way of doing this?
I'm using WPF and SQL Server 2008 for this app.
As others have said, JOIN is the way you'd normally do this. However, if there are multiple rows in one table for a patient then there's a chance you'll get data in some columns repeated across multiple rows, which often you don't want. In that case it's sometimes easier to use UNION or UNION ALL.
Let's say you have two tables, examinations and xrays, each with a PatientID, a Date and some extra details. You could combine them like this:
SELECT PatientID, ExamDate [Date], ExamResults [Details]
FROM examinations
WHERE PatientID = #patient
UNION ALL
SELECT PatientID, XrayDate [Date], XrayComments [Details]
FROM xrays
WHERE PatientID = #patient
Now you have one big result set with PatientID, Date and Details columns. I've found this handy for "merging" multiple tables with similar, but not identical, data.
If this is something you're going to be doing often, I'd be tempted to create a denormalized view on all of patient data (join the appropriate tables) and index the appropriate column(s) in the view. Then use the appropriate method (stored procedure, etc) to retrieve the data for a passed-in patientID.
Use a JOIN to get data from several tables.
You can use a join (can't remember which type exactly) to get all the records from each table for a specific patient. The way this works depends on your database design.
I'd do it with separate SELECT statements, since a simple JOIN probably won't do due to the fact that some tables might have more than 1 row for the patient.
So I would retrieve multiple result-sets in a simple DataSet, add a DalaRelation, cache the object and query it down the line (by date, by exam type, subsets, ...)
The main point is that you have all the data handy, even cached if needed, in a structure which is easily queried and filtered.

How to optimize this SQL Query (from C#)

I am newbie to db programming and need help with optimizing this query:
Given tables A, B and C and I am interested in one column from each of them, how to write a query such that I can get one column from each table into 3 different arrays/lists in my C# code?
I am currently running three different queries to the DB but want to accomplish the same in one query (to save 2 trips to the DB).
#patmortech Use UNION ALL instead of UNION if you don't care about duplicate values or if you can only get unique values (because you are querying via primary or unique keys). Much faster performance with UNION ALL.
There is no sense of "arrays" in SQL. There are tables, rows, and columns. Resultsets return a SET of rows and columns. Can you provide an example of what you are looking for? (DDL of source tables and sample data would be helpful.)
As others have said, you can send up multiple queries to the server within a single execute statement and return multiple resultsets via ADO.NET. You use the DataReader .NextResult() command to return the next resultset.
See here for more information: MSDN
Section: Retrieving Multiple Result Sets using NextResult
Here is some sample code:
static void RetrieveMultipleResults(SqlConnection connection)
{
using (connection)
{
SqlCommand command = new SqlCommand(
"SELECT CategoryID, CategoryName FROM dbo.Categories;" +
"SELECT EmployeeID, LastName FROM dbo.Employees",
connection);
connection.Open();
SqlDataReader reader = command.ExecuteReader();
while (reader.HasRows)
{
Console.WriteLine("\t{0}\t{1}", reader.GetName(0),
reader.GetName(1));
while (reader.Read())
{
Console.WriteLine("\t{0}\t{1}", reader.GetInt32(0),
reader.GetString(1));
}
reader.NextResult();
}
}
}
With a stored procedure you can return more than one result set from the database and have a dataset filled with more than one table, you can then access these tables and fill your arrays/lists.
You can do 3 different SELECT statements and execute in 1 call. You will get 3 results sets back. How you leverage those results depends on what data technology you are using. LINQ? Datasets? Data Adapter? Data Reader? If you can provide that information (perhaps even sample code) I can tell you exactly how to get what you need.
Not sure if this is exactly what you had in mind, but you could do something like this (as long as all three columns are the same data type):
select field1, 'TableA' as TableName from tableA
UNION
select field2, 'TableB' from tableB
UNION
select field3, 'TableC' from tableC
This would give you one big resultset with all the records. Then you could use a data reader to read the results, keep track of what the previous record's TableName value was, and whenever it changes you could start putting the column values into another array.
Take the three trips. The answers so far suggest how far you would need to advance from "new to db programming" to do what you want. Master the simplest ways first.
If they are three huge results, then I suspect you're trying to do something in C# that would better be done in SQL on the database without bringing back the data. Without more detail, this sounds suspiciously like an antipattern.

Categories