Segmented Data Load: Table records to xml - c#

I have a sequence of sql queries that result in very large datasets that I have to query against a database and write them to files. I have about 80 queries and each one produces somewhere between 1000 records to 10,000,000 records. I cannot change the queries themselves. What I'm trying to do is read 500,000 records at a time for each query and write to a file. Here's what I have so far
void WriteXml(string tableName, string queryString)
{
int pageSize = 500000;
int currentIndex = 0;
using (
SqlConnection connection =
new SqlConnection(CONNECTION_STRING))
{
using (SqlCommand command = new SqlCommand(queryString, connection))
{
try
{
connection.Open();
SqlDataAdapter dataAdapter = new SqlDataAdapter(command);
int rowsRead = 0, count = 0, index = 0;
do
{
DataSet dataSet = new DataSet("SomeDatasetName");
rowsRead = dataAdapter.Fill(dataSet, currentIndex, pageSize, tableName);
currentIndex += rowsRead;
if (dataSet.Tables.Count > 0 && rowsRead > 0)
{
dataSet.Tables[0].WriteXml(string.Format(#"OutputXml\{0}_{1}.xml", tableName, index++),
XmlWriteMode.WriteSchema);
}
}
while (rowsRead > 0);
}
catch (Exception e)
{
Log(e);
}
}
}
}
This works but it's very very slow. I'm pretty sure I'm doing something wrong here because when I run it, the application hogs up most of my memory (I have 6GB) and it takes for ever to run. I started it last night and it is still running. I understand I'm dealing with a lot a records but I don't think it's something that would take so many hours to run.
Is this the right way to do paged/segmented data read from a database? Is there any way this method could be optimized or is there any other way I can approach this?
Do let me know if I'm not clear on anything and I'll try to provide clarification.

The paging overloads for DataAdapter.Fill still get the entire result set beneath the covers. Read here:
http://msdn.microsoft.com/en-us/library/tx1c9c2f%28vs.71%29.aspx
the part that pertains to your question:
The DataAdapter provides a facility for returning only a page of data,
through overloads of the Fill method. However, this might not be the
best choice for paging through large query results because, while the
DataAdapter fills the target DataTable or DataSet with only the
requested records, the resources to return the entire query are still
used. To return a page of data from a data source without using the
resources required to return the entire query, specify additional
criteria for your query that reduces the rows returned to only those
required.
In Linq2Sql, there are convenient methods Skip and Take for paging through data. You could roll your own by using a parameterized query constructed to do the same thing. Here is an example to skip 100, and take 20 rows:
SELECT TOP 20 [t0].[CustomerID], [t0].[CompanyName],
FROM [Customers] AS [t0]
WHERE (NOT (EXISTS(
SELECT NULL AS [EMPTY]
FROM (
SELECT TOP 100 [t1].[CustomerID]
FROM [Customers] AS [t1]
WHERE [t1].[City] = #p0
ORDER BY [t1].[CustomerID]
) AS [t2]
WHERE [t0].[CustomerID] = [t2].[CustomerID]
))) AND ([t0].[City] = #p1)
ORDER BY [t0].[CustomerID]

Related

cmd.executescalar() works but throws ORA-25191 Exception

my Code is working, the function gives me the correct Select count (*) value but anyway, it throws an ORA-25191 Exception - Cannot reference overflow table of an index-organized table tips,
at retVal = Convert.ToInt32(cmd.ExecuteScalar());
Since I use the function very often, the exceptions slow down my program tremendously.
private int getSelectCountQueryOracle(string Sqlquery)
{
try
{
int retVal = 0;
using (DataTable dataCount = new DataTable())
{
using (OracleCommand cmd = new OracleCommand(Sqlquery))
{
cmd.CommandType = CommandType.Text;
cmd.Connection = oraCon;
using (OracleDataAdapter dataAdapter = new OracleDataAdapter())
{
retVal = Convert.ToInt32(cmd.ExecuteScalar());
}
}
}
return retVal;
}
catch (Exception ex)
{
exceptionProtocol("Count Function", ex.ToString());
return 1;
}
}
This function is called in a foreach loop
// function call in foreach loop which goes through tablenames
foreach (DataRow row in dataTbl.Rows)
{...
tableNameFromRow = row["TABLE_NAME"].ToString();
tableRows=getSelectCountQueryOracle("select count(*) as 'count' from " +tableNameFromRow);
tableColumns = getSelectCountQueryOracle("SELECT COUNT(*) as 'count' FROM INFORMATION_SCHEMA.COLUMNS WHERE table_name='" + tableNameFromRow + "'");
...}
dataTbl.rows in this outer loop, in turn, comes from the query
SELECT * FROM USER_TABLES ORDER BY TABLE_NAME
If you're using a database-agnostic API like ADO.Net, you would almost always want to use the API's framework to fetch metadata rather than writing custom queries against each database's metadata tables. The various ADO.Net providers are much more likely to write data dictionary queries that handle all the various corner cases and are much more likely to be optimized than the queries you're likely to write. So rather than writing your own query to populate the dataTbl data table, you'd want to use the GetSchema method
DataTable dataTbl = connection.GetSchema("Tables");
If you want to keep your custom-coded data dictionary query for some reason, you'd need to filter out the IOT overflow tables since you can't query those directly.
select *
from user_tables
where iot_type IS NULL
or iot_type != 'IOT_OVERFLOW'
Be aware, however, that there are likely to be other tables that you don't want to try to get a count from. For example, the dropped column indicates whether a table has been dropped-- presumably, you don't want to count the number of rows in an object in the recycle bin. So you'd want a dropped = 'NO' predicate as well. And you can't do a count(*) on a nested table so you'd want to have a nested = 'NO' predicate as well if your schema happens to contain nested tables. There are probably other corner cases depending on the exact set of features your particular schema makes use of that the developers of the provider have added code for that you'd have to deal with.
So I'd start with
select *
from user_tables
where ( iot_type IS NULL
or iot_type != 'IOT_OVERFLOW')
and dropped = 'NO'
and nested = 'NO'
but know that you'll probably need/ want to add some additional filters depending on the specific features users make use of. I'd certainly much rather let the fine folks that develop the ADO.Net provider worry about all those corner cases than to deal with finding all of them myself.
Taking a step back, though, I'd question why you're regularly doing a count(*) on every table in a schema and why you need an exact answer. In most cases where you're doing counts, you're either doing a one-off where you don't much care how long it takes (i.e. a validation step after a migration) or approximate counts would be sufficient (i.e. getting a list of the biggest tables in the system in order to triage some effort or to track growth over time for projections) in which case you could just use the counts that are already stored in the data dictionary- user_tables.num_rows- from the last time that statistics were run.
This article helped me to solve my problem.
I've changed my query to this:
SELECT * FROM user_tables
WHERE iot_type IS NULL OR iot_type != 'IOT_OVERFLOW'
ORDER BY TABLE_NAME

Why does my SQL update for 20.000 records take over 5 minutes?

I have a piece of C# code, which updates two specific columns for ~1000x20 records in a database on the localhost. As I know (though I am really far from being a database expert), it should not take long, but it takes more than 5 minutes.
I tried SQL Transactions, with no luck. SqlBulkCopy seems a bit overkill, since it's a large table with dozens of columns, and I only have to update 1/2 column for a set of records, so I would like to keep it simple. Is there a better approach to improve efficiency?
The code itself:
public static bool UpdatePlayers(List<Match> matches)
{
using (var connection = new SqlConnection(Database.myConnectionString))
{
connection.Open();
SqlCommand cmd = connection.CreateCommand();
foreach (Match m in matches)
{
cmd.CommandText = "";
foreach (Player p in m.Players)
{
// Some player specific calculation, which takes almost no time.
p.Morale = SomeSpecificCalculationWhichMilisecond();
p.Condition = SomeSpecificCalculationWhichMilisecond();
cmd.CommandText += "UPDATE [Players] SET [Morale] = #morale, [Condition] = #condition WHERE [ID] = #id;";
cmd.Parameters.AddWithValue("#morale", p.Morale);
cmd.Parameters.AddWithValue("#condition", p.Condition);
cmd.Parameters.AddWithValue("#id", p.ID);
}
cmd.ExecuteNonQuery();
}
}
return true;
}
Updating 20,000 records one at a time is a slow process, so taking over 5 minutes is to be expected.
From your query, I would suggest putting the data into a temp table, then joining the temp table to the update. This way it only has to scan the table to update once, and update all values.
Note: it could still take a while to do the update if you have indexes on the fields you are updating and/or there is a large amount of data in the table.
Example update query:
UPDATE P
SET [Morale] = TT.[Morale], [Condition] = TT.[Condition]
FROM [Players] AS P
INNER JOIN #TempTable AS TT ON TT.[ID] = P.[ID];
Populating the temp table
How to get the data into the temp table is up to you. I suspect you could use SqlBulkCopy but you might have to put it into an actual table, then delete the table once you are done.
If possible, I recommend putting a Primary Key on the ID column in the temp table. This may speed up the update process by making it faster to find the related ID in the temp table.
Minor improvements;
use a string builder for the command text
ensure your parameter names are actually unique
clear your parameters for the next use
depending on how many players in each match, batch N commands together rather than 1 match.
Bigger improvement;
use a table value as a parameter and a merge sql statement. Which should look something like this (untested);
CREATE TYPE [MoraleUpdate] AS TABLE (
[Id] ...,
[Condition] ...,
[Morale] ...
)
GO
MERGE [dbo].[Players] AS [Target]
USING #Updates AS [Source]
ON [Target].[Id] = [Source].[Id]
WHEN MATCHED THEN
UPDATE SET SET [Morale] = [Source].[Morale],
[Condition] = [Source].[Condition]
DataTable dt = new DataTable();
dt.Columns.Add("Id", typeof(...));
dt.Columns.Add("Morale", typeof(...));
dt.Columns.Add("Condition", typeof(...));
foreach(...){
dt.Rows.Add(p.Id, p.Morale, p.Condition);
}
SqlParameter sqlParam = cmd.Parameters.AddWithValue("#Updates", dt);
sqlParam.SqlDbType = SqlDbType.Structured;
sqlParam.TypeName = "dbo.[MoraleUpdate]";
cmd.ExecuteNonQuery();
You could also implement a DbDatareader to stream the values to the server while you are calculating them.

C# SqlDataAdapter Fill Inconsistent Results

I would greatly appreciate some help with my SqlDataAdapter code below.
I am getting inconsistent results and am at a loss as to what the issue is.
SCENARIO:
This is a windows form project.
I have a routine that loops through a list of stations and calls the code below for each station.
The problem code executes the query (SQL Server) and fills a dataset with the returned rows.
Depending on what the station is up to, the returned rowcount will be >= 0 (typically 1 or 2).
PROBLEM:
The code does not always fill the dataset. Sometimes the dataset fills and sometimes it does not, or more accurately sometimes the rowcount = correct and sometimes rowcount = 0.
Currently, it is the same subset of stations that are not filling correctly.
There are no issues if rowcount is actually = 0.
TROUBLESHOOTING SO FAR:
The code does not throw any exceptions.
Looking at the parameters/ vars in the local window during step through I do not see any differences between stations that return correctly and those that do not.
The query itself does work without issue. I can execute the query in SSMS with the same parameters and I get the expected/ correct results in SSMS.
I do not see any difference in the SSMS query results between between stations that return/ fill correctly and those that do not (columns all have data, etc.).
My first thought was that I must have something amiss with these stations in the underlying tables, but again I see nothing out of place (no missing columns, nulls, etc.).
I have looked at quite a few posts regarding SqlDataAdapter and fill, but they appear to be dealing with all or nothing problems.
I would think if there was a problem with the code, the fill would fail all of the time.
I would also think if there was a problem with the query, the problem would be
present all of the time.
I would further think if there was a problem with the data, I would see something in the query results in SSMS.
RESULTS:
CODE:
string strSql =
#"SELECT
ISNULL(a.ToolGroupId, 0) AS ToolGroupId, ISNULL(a.PartId, 0) AS PartId,
ISNULL(a.CycleCount, 0) AS CycleCount, ISNULL(a.BDT, 0) AS BDT,
start.starttime, endx.endtime,
ISNULL(start.ProgName, 'none') AS Program, ISNULL(a.gName, 'none') AS gName,
ISNULL(a.ProductionMetrics, 1) AS ProductionMetrics
FROM
(SELECT
MIN(datetime) AS starttime, ProgName
FROM
V_CycleTime
WHERE
eocr_no = #stationId
AND datetime >= #startTime
AND datetime < #endtime
GROUP BY
ProgName) AS start
LEFT JOIN
(SELECT
MAX(datetime) AS endtime, ProgName
FROM
V_CycleTime
WHERE
eocr_no = #stationId
AND datetime >= #startTime
AND datetime < #endtime
GROUP BY
ProgName) AS endx ON (endx.ProgName = start.ProgName)
LEFT JOIN
(SELECT
ISNULL(p.ToolGroupId, 0) AS ToolGroupId,
ISNULL(p.PartId, 0) AS PartId,
COUNT(ID) AS CycleCount,
AVG(Expected_Ct) AS BDT,
Program, p.gName, p.ProductionMetrics
FROM
V_CycleTime
LEFT JOIN
(SELECT
ToolGroupId, Program, PartId, t.gName, t.ProductionMetrics
FROM
PartsToPrograms
LEFT JOIN
(SELECT ID, gName, ProductionMetrics
FROM ToolGroups) t ON (t.ID = PartsToPrograms.ToolGroupId)
WHERE
StationId = #stationId
AND IsActive = 1) p ON (p.Program = V_CycleTime.ProgName)
WHERE
eocr_no = #stationId
AND datetime >= #startTime
AND datetime < #endtime
GROUP BY
p.ToolGroupId, p.PartId, Program, p.gName, p.ProductionMetrics) AS A ON (a.Program = start.ProgName)
WHERE
a.ToolGroupID IS NOT NULL
ORDER BY
start.starttime;";
// retrieve the dataset
DataSet ds = new DataSet();
string connetionString = ConfigurationManager.ConnectionStrings["connString"].ConnectionString;
using (var adapter = new SqlDataAdapter(strSql, connetionString))
{
try
{
adapter.SelectCommand.Parameters.AddWithValue("#stationId", GlobalVars.stationId);
adapter.SelectCommand.Parameters.AddWithValue("#startTime", GlobalVars.currentHourStartTime);
adapter.SelectCommand.Parameters.AddWithValue("#endTime", GlobalVars.currentHourEndTime);
adapter.Fill(ds);
}
catch (SqlException ex)
{
SimpleLogger.SimpleLog.Log(ex);
return;
}
}
// test row count
int rowCount = ds.Tables[0].Rows.Count;
if (ds.Tables[0].Rows.Count == 0)
{
//do this
}
else
{
//do that
}

OracleDataReader returning only last row on pagination query

I'm using the Oracle.ManagedDataAccess to return data from my database, and I really need to page the results because there are lots of registers in this table.
So I'm using the second answer from this post to paging, and it really works when I do run on an Oracle Client.
The Final query looks like this:
select *
from (
select rownum as rn, a.*
from (
Select u.*
From users u
order by u.user_code
) a
)
where rownum <= :myReturnSize
and rn > (:myReturnPage-1) * :myReturnSize;
But when I call it from the .Net code below, it returns only the last register of the 100's I asked for.
OracleParameter[] parameters = new OracleParameter[]{
new OracleParameter("myReturnPage", page), //1
new OracleParameter("myReturnSize", size) //100
};
List<User> usersList = new List<User>();
using (OracleConnection conn = new OracleConnection(connString))
{
using (OracleCommand cmd = new OracleCommand(sbSelect.ToString(), conn))
{
conn.Open();
cmd.CommandType = CommandType.Text;
cmd.Parameters.AddRange(parameters);
using (OracleDataReader odr = cmd.ExecuteReader())
{
if (!odr.IsClosed && odr.HasRows)
{
while (odr.Read())
{
User userToReturn = new User();
FillUserEntity(userToReturn, odr);
usersList.Add(userToReturn);
}
}
}
}
}
return usersList.AsQueryable();
Even more bizarre is that when I run this query without pagination in the same method it returns me all registers, more than 723,000.
Any help would be appreciated.
Thanks a lot.
By default the ODP.Net set the parameters by position and not by name. So you just need to invert the order when creating the OracleParameter's array, and also set the BindByName property to true, like this:
cmd.BindByName = true;
Oracle tends to prefer stored procedures over direct text (because reasons). I've had more than a few "it works in SQL Developer but not .Net!" situations that were solved by putting it all together in a stored proc within a package on the database side. That also decouples your query from your application, so if the query has to change you don't have to recompile the app. Your app then just makes the same call as before, but to the stored procedure, probably using an OracleDataAdapter.
Can you confirm whether you query giving correct output from Oracle client.?
Problem is with
where rownum <= :myReturnSize
It will always return the value rownum = :myReturnSize
One possible solution can be
select *
from (
select rownum as rnum, a.*
from (
Select rownum as rn, u.*
From users u
order by u.user_code
) a
)
where rnum <= :myReturnSize
and rn > (:myReturnPage-1) * :myReturn.

How to use start and limit for paging

I am trying to use paging with grid. The following method gives everything in the table to the grid. How do I account for start and limit where start is the page number and limit is the records per page. Basically extjs toolbar looks for my method to return start and limit on demand. I have tried so many solutions but just can't seem to work. That's why I am putting this out here in the simple way.
this is my c# end
public string myRecord(int start, int limit)
{
List<gridPaging> result = new List<gridPaging>();
using (SqlConnection con = new SqlConnection(ConfigurationManager.ConnectionStrings["ApplicationServices2"].ConnectionString))
{
SqlCommand cmd = con.CreateCommand();
cmd.CommandText = "SELECT * FROM myTable ORDER BY Q1";
con.Open();
SqlDataReader reader = cmd.ExecuteReader();
while (reader.Read())
{
gridPaging gp = new gridPaging();
gp.Column1 = reader["Column1"].ToString().Trim();
gp.Column2 = reader["Column2"].ToString().Trim();
gp.Column3 = reader["Column3"].ToString().Trim();
gp.Column4 = reader["Column4"].ToString().Trim();
result.Add(gp);
}
return JsonConvert.SerializeObject(result);
}
}
If this is similar to your current implementation, you could modify your SQL to take advantage of ROW_NUMBER: T-SQL: Paging with ROW_NUMBER()
Alternately, if you had some sort of LINQ implementation, you can use .Skip() and .Take() methods to do your paging.
In T-SQL, you have two built-ins that help you here; the first is the Row_Number function, which assigns a unique, increasing ordinal to each row of a result set as ordered by a special ORDER BY clause, and the second is the TOP keyword, which limits the maximum number of rows to be returned.
Basically, your query should look like this:
SELECT TOP #limit * FROM myTable WHERE (ROW_NUMBER() OVER (ORDER BY Q1)) > #limit * #start ORDER BY Q1
You then plug in values for the parameters #start and #limit from your C# code using Command.CreateParameter. For, say, the third page of results (using a zero-indexed start value of 2) with 15 results per page, this evaluates to the statement
SELECT TOP 15 * FROM myTable WHERE (ROW_NUMBER() OVER (ORDER BY Q1)) > 30 ORDER BY Q1
... which provides rows from the overall query from 31 to 45, the first two pages' queries having produced rows 1-15 and 16-30 respectively.

Categories